GPSYCHO

Some Test Samples



I am not an audiophile, my ears are in their 30's, and I perform my tests with a motherboard sound card and $40 headphones.  However, the quality of the ISO psy-model is so bad it is quite easy to detect the flaws and I have not yet needed detailed listening tests to improve it.  I've done most of my work with a few samples:

Note that BladeEnc, 8hz-mp3, CDEX and LAME 2.1 all produce identical results.  Only the BladeEnc result is given.


mstest.wav

The newest test case.  It was sent to me by Scott Miller <scgmille@indiana.edu>.   It contains some higher frequency modes which are isolated to the left channel.  LAME sounds fine in Stereo mode (-m s), but using any type of mid/side stereo will spread these modes to the right channel.  Switching between stereo and ms_stereo will result in the annoying effect of having them turn on and off in the right channel.  The FhG encoder avoids this problem by using very few  mid/side stereo frames.  But the LAME mid/side stereo switching criterion can not detect that this sample should not be encode with mid/side stereo, and produces too many mid/side frames.  Suggestions for a better switching criterion are welcome!  I've tried a few things, but anything that works is usually too restrictive, i.e. it will turn off mid/side stereo for half the frames in castanets.wav, but this sample should have all frames mid/side stereo.

NOTE 6/99: This problem is fixed with new mid/side switch added to LAME 3.12!

mstest, Mide/Side stereo encoding test sample  (about 5 seconds)

castanets.wav

The castanets should sound like a sharp, crisp clack.  In the ISO psy-model, they are smeared out into  long, soft thwack like sounds.  GPSYCHO makes a dramatic improvement in this, which is detectable on any sound system.  This is due to correctly switching to short blocks and encoding them with extra bits from the reservoir.  The attacks are very mono in nature, so jstereo also helps because it allows even more bits for encoding the mid channel.  The sample is very close to mono, but if you really decimate the side it will results in noticeable artifacts.

The FhG encoder does an even better job on this sample, mostly because it detects some of the later castanets. They are muffled by other sounds and GPSYCHO fails to recognize them as needing short blocks.  Latter on in the sample, the castanets come fast and furious, and even the FhG encoder can not maintain enough bits in the bit reservoir.  VBR would be great in this situation. It is very easy to put into an encoder, but I don't have a player to debug it with.

Normally you have to perform listening tests to determine the quality of an mp3 encoding.   You can not generally  say anything about the quality by looking at the original and encoded pcm signal.  Pre-echo problems like in castanets.wav are an exception to this.  In a bad encoding, the sharp attack of the castanets will create noise that is heard before the actual castanets.  This flaw is very visible in the encoded pcm signal,  and is shown for several different encoders in  Screenshots.

With the castents.wav file it's easy to try out new short block detection schemes.  You dont have to rely on listening tests since the pre-echo is so easy to see in the output pcm data.  Just modify the graphical interface display the new criterion and then go through castanets.wav frame by frame and see if it is triggered in the correct spots.  For an interesting comparison, run lame with -g (the graphical frame analyzer) on MP3 files produced by other encoders to see how well they do.

Castanets, FhG reference sample (about 5 seconds)


else3.wav

A 5 second sample from Sarah Mclaughlan's "Elsewhere".  I first checked out an MP3 of this song from the Internet (a very high quality encoding).  Later I bought the CD and encoded it my self with  an ISO based encoder, and was surprised at the difference in quality.  This is what motivated me to start looking at the encoder source.

This song contains a lot of very tonal piano music for which even the ISO encoder usually does ok.  But in certain situations it produces very noticeable distortion in the piano notes.  (Particularly in frames 50-70).  GPSYCHO fixes this mostly due to the improved outer_loop in the bit allocation subroutine.  This sample also has some attacks (drums) that are greatly improved with GPSYCHO.  I cannot detect a difference between GPSYCHO and FhG for this sample.

Elsewhere, Sara McLachlan (5 second sample)



testsignal2.wav

This is a very nice pre-echo test case from  Jan Rafaj <rafaj@cedric.vabo.cz>.     It  has some clear, isolated drums.  If your MP3 encoder does not switch to short blocks at the precise moment, you will have very noticeable pre-echo.  The pre-echo actually sounds like a snare,  but this snare is completely artificial - there is no trace of it in the original .wav file!   ISO based encoders do very poorly, mostly because the short block switching is completely broken in the psy model (even if it detects a pre-echo event, it will switch to short blocks 1 granule too late).   LAME 3.03 does noticeably better, but it still uses the ISO pre-echo detection criterion, and misses many of the pre-echo events.  If you go into l3psy.c and set switch_pe = 1000 (instead of 1800), LAME will do much better, maybe 90% as good as FhG.

FhG does great.  They seem to have excellent pre-echo detection.  I would love to know what their algorithm is based on.

Note  5/99: LAME 3.05 has a much improved pre-echo detection algiorithm, and fixes most of the above problems!


applaud.wav

This is a very difficult test sample because of the lack on tonality and all the sharp attacks.   All encoders produce results noticeably different than the original, but the FhG encoder still has the edge.  The extra quality of the FhG encoder is not due to simple fixes like better use of short blocks and the bit reservoir.   They do switch to ms_stereo, (and GPSYCHO does not), but forcing GPSYCHO into ms_stereo doesn't improve things.

One thing I would like to try is switching to a 768 FFT instead of 1024.  The FFT is used to compute the energies in the 576 sample (1 granule) window.  With an FFT of almost twice the size of the granule, the higher frequency energies within the granule are easily contaminated by data from outside the granule.  Looking at the spectrum with MP3x, you can see that the signal is dominated by higher than normal frequencies which change substantially from granule to granule.

For example, a 1kHz signal represents 44 sample points.  13 wavelengths will fit in one granule.  Estimating the energy in the 1kHz mode with a 1024 FFT will use the 13 wavelengths within the granule plus 5 wavelengths on either side of the granule.  This is fine if the signal is very tonal, meaning the energy does not change much from granule to granule, but this is not the case for the 1kHz signal in applaud.wav.    A 768 FFT would only consider 2 extra wavelengths on each side of the granule, and they would be mostly in the taper of the Hann window.

Another possibility would be to try and estimate the energy from the 3 overlapping 256 FFTs used to compute the high frequency tonality.

If anyone has other suggestions, let me know!

Information on the applaud.wav test sample: