GPSYCHO
A GPL'd Psycho-Acoustic Model
GPSYCHO is an open source psycho-acoustic model for ISO based MP3 encoders.
GPSYCHO fixes some substantial bugs in the ISO demonstration source psycho-acoustic
model (ISO psy-model). In addition, GPSYCHO adds mid/side stereo,
real bit reservoir control, much improved critical band bit allocation
routines, variable bit rate (optional) and very good pre-echo control.
At 128kbs, the quality is significantly better than that produced by the
ISO psy-model (as found in almost all other free encoders). An example
of these improvements is shown in
Screenshots
. GPSYCHO is close to the quality of the FhG encoder, but there is
still room for improvement. Read on if you want to help!
As this code is released under the GPL, it can only be used in other
GPL'd projects. I would also encourage others to help improve
GPSYCHO. Some things that would help:
-
Find (and send me) samples where your favorite encoder does a better job
than GPSYCHO.
-
Run your own listening tests and try tuning some of the algorithms below.
Most have parameters that are set via trial and error.
-
Try out new algorithms!
-
Click the above link for results and descriptions of the listening tests
I use.
New Features (which may need some tuning):
-
Bit allocation outer loop improved based on ideas in an MPEG2 J. Audio
Eng. Soc. 1997 paper. The ISO demonstration source outer loop can
produce some very poor quality frames in certain situations.
-
VBR (variable bit rate) is now working! See the VBR link above
for details.
-
MS_STEREO switch. ISO formula is primitive. I use a switch
described here.
-
MS_STEREO ISO sparsing formula does not work. It will remove
95% of the side channel coefficients. I don't sparse the side channel
at all, but allocate less bits for encoding. Martin Weghofer
<e9427483@student.tuwien.ac.at> has a coder which does effectively
use side channel sparsing, but the algorithm does not work will with the
LAME quantization procedure. This is an area that needs further work.
-
MS_STEREO now uses ideas in a Johnson ICASSP 1992 paper to compute true
Mid and Side thresholds which compensate for stereo de-masking.
It is used in PAC and AAC. Enabled with -h in LAME 3.21.
This will eventually become the default.
-
Bit reservoir use. Again the ISO formula performs poorly. At
128kbs, it always thinks it needs to drain the reservoir, and thus
the reservoir can never build up. It will also use up all the bits
for the left channel before even looking at the right channel. I
put in a kludged up formula that seems to work ok, but could use some tuning.
-
Mid/Side bit allocation. I allocate bits based on the differences
between left and right masking thresholds. Anyone have a better idea?
-
Remove data in scalefactor band 21 when encoding at 128kbs or less.
This was done because FhG also does it (you can see this by running the
graphical frame analyzer). It amounts to a 16kHz low-pass filter
and makes a few more bits available for more important information.
If it offends you, you probably should not be encoding at 128kbs in the
first place! Try 160kbs. At 128kbs, more often than not the
psy-model will remove these frequencies even without this mod.
-
Improved shortblock switching. It is now based on surges in PE or
large fluctuations in energy within a single granule. These improvements
trigger some critical window switching that LAME used to miss.
Features to try out:
-
Add a high-pass filter. 20Hz?
-
Shorter FFT for the long block noise threshold calculation. A 768
FFT centered over the 576 sample granule would be more accurate for the
high frequency energies than the 1024 FFT. This should also
improve the perceptual entropy (pe) calculation since there will be less
interference from data outside the granule. Another advantage might
be for the applaud.wav test - see the Quality
section for details. It will of course make the low frequency energy
estimates less accurate.
-
subblock_gain. This seems to be important. FhG uses it for
most short blocks. LAME and other dist10 based codes do not make
any use of this. One subblock gain algorithm (LAME 3.21) is enabled
with -Z.
Bug fixes:
-
Encoding delay is 528, not 1120 as assumed by ISO psy model. This
was due to a combination of bugs and an error in the published delay of
the polyphase filterbank (240, not 256. It uses a 512 window, but
is shifting in 32 samples at a time. Note that this means that the
ISO psycho-acoustics are over 1/2 a frame out of sync with the frame being
encoded. This is why the ISO models are so bad at pre-echo control
- they switch to short blocks one frame too late!
-
Serious bugs in mapping of energies into partition bands and partition
bands into critical bands. Big effect on MPEG2, smaller but significant
effect on MPEG1.
-
Some typoes, such as norm_l being used instead of norm_s.
-
pre-emphasis seemed buggy - ISO was applying pre-emph and then amplifying
scalefactor bands without recomputing the distortion.
-
The short FFTs were using a 128 shiftlenght, instead of 192.
-
The short and long FFT's were not correctly centered over the 576 samples
that would eventually be encoded.
-
overflow in calc_noise().
-
In some cases, the intial quantization step size chosen by quantanf_init
is too big.
-
short block, long block, short block sequence would retroactively change
the middle long block to a short block, but then there would be no psycho-acoustic
info for the new short block.
Things I've learned from analyzing FhG produced .mp3 frames (layer 3, 128kbs):
-
I've never seen FhG use mixed_blocks
-
I've never seen FhG use intensity stereo
-
I've never seen FhG use scsfi<>0
-
Removes data in scalefactor band 21 at 128kbs.
-
Almost always uses ms_stereo. Does not use ISO formula for ms_stereo
switching.
-
More sophisticated mid/side bit allocation.
-
Excellent short block detection. How do they do this? They
can't be using the ISO pe formula.
-
Good bit reservoir use. Not totally based on pe, since they often
allocate extra bits to long blocks.
-
Does not produce variable bit rate frames.