GPSYCHO
Quality and Listening Test Information
A rough estimate of where the GPSYCHO quality improvements come from:
-
35% bug fixes in psycho acoustic model and pre-echo detection.
-
35% bit allocation improvements
-
20% bit reservoir control
-
10 % joint stereo
Tuning by listening tests:
-
Most improvements in GPSYCHO require detailed listening tests. I
think the best way to go about this is to find a sample where GPSYCHO does
something bad. Then see if you can figure out which algorithm/tuning
is at fault, and how it can be improved without breaking something else!
-
The best way to perform a listening test is the "ABC hidden reference test".
Signal A is always the original .wav file. B and C are the encoded
and the original signal, in a random order. Listen to ABC three times,
always in that order, and rate B and C on a scale from 1-5, 5 being for
the signal you perceive as the original.
Here is a detailed example on how the pre-echo algorithm in LAME 3.10 was
tuned and dramatically improved by doing a frame by frame comparison with
the FhG encoder. First a sample is found where LAME produces
noticeably worse results than the state-of-the-art FhG encoder. Listening
tests are used to determine which frames are causing most of the problems.
MP3x ( the frame analyzer) is then used to compare the troublesome frames
produced by LAME to those produced by the FhG encoder. In the case
presented, the problem was because LAME was not switching to short blocks
when it should have.
Some Test Samples
Check out SQAM
- Sound Quality Assessment Material. I haven't had a chance to
try these yet,
but if you find samples where another encoder does noticeably better
than LAME, I would be very interested.
I am not an audiophile, my ears are in their 30's, and I perform my
tests with a motherboard sound card and $40 headphones. However,
the quality of the ISO psy-model is so bad it is quite easy to detect the
flaws and I have not yet needed detailed listening tests to improve it.
I've done most of my work with a few samples:
Note that BladeEnc, 8hz-mp3, LAME 2.1 all produce identical results.
Only the BladeEnc result is given.
Test cases which need work
testsignal2.wav Subtle
pre-echo test case. (700K, about 5 seconds)
This is a very nice pre-echo test case from Jan Rafaj <rafaj@cedric.vabo.cz>.
It has some clear, isolated drums. If your MP3 encoder does
not switch to short blocks at the precise moment, you will have very noticeable
pre-echo. The pre-echo actually sounds like a snare, but this
snare is completely artificial - there is no trace of it in the original
.wav file! ISO based encoders do very poorly, mostly because
the short block switching is completely broken in the psy model (even if
it detects a pre-echo event, it will switch to short blocks 1 granule too
late). LAME 3.03 does noticeably better, but it still uses
the ISO pre-echo detection criterion, and misses many of the pre-echo events.
If you go into l3psy.c and set switch_pe = 1000 (instead of 1800), LAME
will do much better, maybe 90% as good as FhG.
FhG does great. They seem to have excellent pre-echo detection.
I would love to know what their algorithm is based on.
Note 5/99: LAME 3.05 has a much improved pre-echo detection algorithm,
and fixes some of the above problems!
Note 7/99: LAME 3.16 has a better pre-echo detection, and allocates
more bits from the reservoir.
testsignal4.wav Subtle
distortion case. (800K, about 6 seconds)
Another difficult and subtle case from Jan Rafaj <rafaj@cedric.vabo.cz>.
I believe this is by Enya. There is a slight trill as the volume
increases. I can barely here it, but the FhG encoder manages to avoid
it. Using mid/side masking thresholds seems to help a lot (-h in LAME 3.21
and higher).
main_theme.wav
Strange artifact, mid/side stereo test. (1.7M, about 11 seconds)
This sample is from an old Pink Floyd song. It was found by Robert
Hegemann <Robert.Hegemann@gmx.de>
In the beginning, while the foreground pans from right to left there
is a slight twinkling sound. This goes away
with -X, but the true cause and a better fix should be found.
It also contains a lot of distortion if mid/side stereo is used.
The new (lame3.12) mid/side switching algorithm solves this problem and
can detect that almost none of the frames should use mid/side stereo.
The FhG also does not use mid/side encoding for this sample.
Fools.wav Good range of effects
(5M, about 30 seconds)
I got this off an MP3 encoder comparison web site that later vanished.
I think it is a Lemon Heads song.
It was heavily used to tune the LAME 3.12 mid/side
switch.. I use a mono, downsampled version for the current MPEG2
quality improvements.
Test cases previously used to improve LAME
castanets.wav FhG pre-echo
reference sample (1.2M, about 7 seconds)
The castanets should sound like a sharp, crisp clack. In the
ISO psy-model, they are smeared out into long, soft thwack like sounds.
GPSYCHO makes a dramatic improvement in this, which is detectable on any
sound system. This is due to correctly switching to short blocks
and encoding them with extra bits from the reservoir. The attacks
are very mono in nature, so jstereo also helps because it allows even more
bits for encoding the mid channel. The sample is very close to mono,
but if you really decimate the side it will results in noticeable artifacts.
The FhG encoder does an even better job on this sample, mostly because
it detects some of the later castanets. They are muffled by other sounds
and GPSYCHO fails to recognize them as needing short blocks. Latter
on in the sample, the castanets come fast and furious, and even the FhG
encoder can not maintain enough bits in the bit reservoir. VBR would
be great in this situation. It is very easy to put into an encoder, but
I don't have a player to debug it with.
Normally you have to perform listening tests to determine the quality
of an mp3 encoding. You can not generally say anything
about the quality by looking at the original and encoded pcm signal.
Pre-echo problems like in castanets.wav are an exception to this.
In a bad encoding, the sharp attack of the castanets will create noise
that is heard before the actual castanets. This flaw is very
visible in the encoded pcm signal, and is shown for several different
encoders in Screenshots.
With the castents.wav file it's easy to try out new short block detection
schemes. You dont have to rely on listening tests since the pre-echo
is so easy to see in the output pcm data. Just modify the graphical
interface display the new criterion and then go through castanets.wav frame
by frame and see if it is triggered in the correct spots. For an
interesting comparison, run lame with -g (the graphical frame analyzer)
on MP3 files produced by other encoders to see how well they do.
mstest.wav Mide/Side stereo
encoding test sample (700K, about 5 seconds)
A good jstereo test case sent to me by Scott Miller <scgmille@indiana.edu>.
It contains some higher frequency modes which are isolated to the left
channel. LAME sounds fine in Stereo mode (-m s), but using any type
of mid/side stereo will spread these modes to the right channel.
Switching between stereo and ms_stereo will result in the annoying effect
of having them turn on and off in the right channel. The FhG encoder
avoids this problem by using very few mid/side stereo frames.
But the LAME mid/side stereo switching criterion can not detect that this
sample should not be encode with mid/side stereo, and produces too many
mid/side frames. Suggestions for a better switching criterion are
welcome! I've tried a few things, but anything that works is usually
too restrictive, i.e. it will turn off mid/side stereo for half the frames
in castanets.wav, but this sample should have all frames mid/side stereo.
NOTE 6/99: This problem is fixed with new mid/side
switch added to LAME 3.12!
t1.wav Dire Straights sample
(1.4M, about 9 seconds)
This case has some subtle pre-echos that were missed by older versions
of LAME, and it greatly confused the old LAME mid/side stereo switching
criterion. It was found by Nils Faerber <Nils.Faerber@unix-ag.org>.
It was heavily used to tune the LAME 3.12 mid/side
switch, and for more fine tuning of the pre-echo detection algorithm
in LAME 3.15. Nils reports that with LAME 3.12, the quality
is now very close to the FhG encoder.
else3.wav Bit allocation
tests. (1.0M, about 6 seconds)
A sample from Sarah Mclaughlan's "Elsewhere". I first checked out
an MP3 of this song from the Internet (a very high quality encoding).
Later I bought the CD and encoded it my self with an ISO based encoder,
and was surprised at the difference in quality. This is what motivated
me to start looking at the encoder source.
This song contains a lot of very tonal piano music for which even the
ISO encoder usually does ok. But in certain situations it produces
very noticeable distortion in the piano notes. (Particularly in frames
50-70). GPSYCHO fixes this mostly due to the improved outer_loop
in the bit allocation subroutine. This sample also has some attacks
(drums) that are greatly improved with GPSYCHO. I cannot detect a
difference between GPSYCHO and FhG for this sample.
Other test cases
KMFDM-Dogma.wav LAME actually
sounds better than FhG! (1.1M, about 6 seconds.)
Found by Kevin Burtch <kburtch@bellsouth.net>.
fatboy.wav Even FhG has trouble
with this. (900K, about 5 seconds.)
Found by Jake Hamby <jehamby@anobject.com>
applaud.wav 1.4M, about 9
seconds.
This is a very difficult test sample because of the lack on tonality and
all the sharp attacks. All encoders produce results noticeably
different than the original, but the FhG encoder still has the edge.
The extra quality of the FhG encoder is not due to simple fixes like better
use of short blocks and the bit reservoir. They do switch to
ms_stereo, (and GPSYCHO does not), but forcing GPSYCHO into ms_stereo doesn't
improve things.
One thing I would like to try is switching to a 768 FFT instead of 1024.
The FFT is used to compute the energies in the 576 sample (1 granule) window.
With an FFT of almost twice the size of the granule, the higher frequency
energies within the granule are easily contaminated by data from outside
the granule. Looking at the spectrum with MP3x, you can see that
the signal is dominated by higher than normal frequencies which change
substantially from granule to granule.
For example, a 1kHz signal represents 44 sample points. 13 wavelengths
will fit in one granule. Estimating the energy in the 1kHz mode with
a 1024 FFT will use the 13 wavelengths within the granule plus 5 wavelengths
on either side of the granule. This is fine if the signal is very
tonal, meaning the energy does not change much from granule to granule,
but this is not the case for the 1kHz signal in applaud.wav.
A 768 FFT would only consider 2 extra wavelengths on each side of the granule,
and they would be mostly in the taper of the Hann window.
Another possibility would be to try and estimate the energy from the
3 overlapping 256 FFTs used to compute the high frequency tonality.
If anyone has other suggestions, let me know!
Information on the applaud.wav test sample: