python_speech_features mfcc vs NWaves mfcc
davidbelle opened this issue ยท 5 comments
I've build a Keras model via python and want to use this model in C#. I need to be able to re-create the same MFCC's between the python library verses MWaves version. I've tried lots of options but can't seem to get it.
Below is the options used when training the model. I've also added a comment next to each line indicating what I think the matching field in MWave should be.
mfccs = python_speech_features.base.mfcc(signal,
winlen=0.256, # Frame Duration, fft/sr
winstep=0.050, # Hop Duration, hop_length / sr
numcep=16, # Feature Count?
nfilt=26, # FilterBankSize
nfft=2048, # FFftSize
preemph=0.0, # PreEmphasis
ceplifter=0, # LifterSize
appendEnergy=False, # IncludeEnergy
winfunc=np.hanning) # Window = NWaves.Windows.WindowType.Hann)
The python version returns array[16][16]
But my MWaves equivalent returns array[15][16]. Here's my options
var mfccOptions = new MfccOptions
SamplingRate = sampleRate,
FeatureCount = 16,
FrameDuration = (double)fftSize / sampleRate,
HopDuration = 0.05,
FftSize = 2048,
Window = NWaves.Windows.WindowType.Hann,
FilterBankSize = 26,
LifterSize = 0,
PreEmphasis = 0
Any thoughts on where I might be going wrong?
Also worth noting, I experimented with the example given on the "Non-expert DSP" page between librosa and nwaves, and i couldn't get those to match either. Below is my code.
sr = 8000
mfccs = librosa.feature.mfcc(signal, sr, n_mfcc=13,
dct_type=2, norm='ortho', window='hamming',
htk=False, n_mels=40, fmin=100, fmax=4000,
n_fft=1024, hop_length=int(0.010 * sr), center=False)
int sr = 8000; // sampling rate
int fftSize = 1024;
double lowFreq = 100; // if not specified, will be 0
double highFreq = 4000; // if not specified, will be samplingRate / 2
int filterbankSize = 40; // or 24 for htk=true (usually)
// if 'htk=False' in librosa:
var melBank = FilterBanks.MelBankSlaney(filterbankSize, fftSize, sr, lowFreq, highFreq);
// if 'htk' parameter in librosa will be set to True, replace the previous line with these lines:
// var melBands = FilterBanks.MelBands(filterbankSize, sr, lowFreq, highFreq);
// var melBankHtk = FilterBanks.Triangular(fftSize, sr, melBands, null, Scale.HerzToMel);
var opts = new MfccOptions
SamplingRate = sr,
FrameDuration = (double)fftSize / sr,
HopDuration = 0.010,
FeatureCount = 13,
FilterBank = melBank, // or melBankHtk if htk=True
NonLinearity = NonLinearityType.ToDecibel, // mandatory
Window = NWaves.Windows.WindowType.Hamming, // in librosa 'hann' is by default
LogFloor = 1e-10f, // mandatory
DctType = "2N",
LifterSize = 0
var extractor = new MfccExtractor(opts);
var mfccs = extractor.ParallelComputeFrom(signal);
Seems very strange. I have looked at the values for the signal on python and on .NET and they match. They are between -1 an 1.
Would love any insight anyone might have.
I'll take a look at python_speech_features
a bit later.
Meanwhile, you can read this thread regarding librosa nuances.
It took me more time than I expected, but anyway...
Essentially, python_speech_features.base.mfcc
is very simple and straightforward. But there are some nuances.
Here's the example of NWaves settings:
var mfccOptions = new MfccOptions
SamplingRate = sampleRate,
FeatureCount = 16,
FrameDuration = 2048.0 / sampleRate,
HopDuration = 0.05,
FilterBankSize = 26,
SpectrumType = SpectrumType.PowerNormalized,
NonLinearity = NonLinearityType.LogE,
DctType = "2N",
Window = WindowType.Hann,
FftSize = 2048,
IncludeEnergy = false
If you run MfccExtractor
with these options, you'll get the results that slightly differ from python_speech_features
(although pretty close). Don't forget to normalize samples in python version: signal = signal / 32768
(or set normalize: false
in WaveFile
constructor/loader in NWaves).
Let's compare 31st vectors, for example:
Personally I would be OK with these coeffs. But if you need to get as close as possible to python_speech_features
, you'll have to add some code.
There are 2 reasons why there are discrepancies:
- Mel-filterbank evaluation.
- Normalization of power spectra.
In python_speech_features
mel-filterbank is constructed differently (in comparison with NWaves,librosa, Kaldi,etc.). I wrote the function that gives identical weights, and you can use it:
float[][] PsfFilterbank(int samplingRate, int filterbankSize, int fftSize, double lowFreq = 0, double highFreq = 0)
var filterbank = new float[filterbankSize][];
if (highFreq <= lowFreq)
highFreq = samplingRate / 2;
var low = NWaves.Utils.Scale.HerzToMel(lowFreq);
var high = NWaves.Utils.Scale.HerzToMel(highFreq);
var res = (fftSize + 1) / (float)samplingRate;
var bins = Enumerable
.Range(0, filterbankSize + 2)
.Select(i => (float)Math.Floor(res * NWaves.Utils.Scale.MelToHerz(low + i * (high - low) / (filterbankSize + 1))))
for (var i = 0; i < filterbankSize; i++)
filterbank[i] = new float[fftSize / 2 + 1];
for (var j = (int)bins[i]; j < (int)bins[i + 1]; j++)
filterbank[i][j] = (j - bins[i]) / (bins[i + 1] - bins[i]);
for (var j = (int)bins[i + 1]; j < (int)bins[i + 2]; j++)
filterbank[i][j] = (bins[i + 2] - j) / (bins[i + 2] - bins[i + 1]);
return filterbank;
Now, use it in MFCC options:
var mfccOptions = new MfccOptions
SamplingRate = sampleRate,
FeatureCount = 16,
FrameDuration = 2048.0 / sampleRate,
HopDuration = 0.05,
FilterBank = PsfFilterbank(sampleRate, 26, 2048),
SpectrumType = SpectrumType.PowerNormalized,
NonLinearity = NonLinearityType.LogE,
DctType = "2N",
Window = WindowType.Hann,
FftSize = 2048,
IncludeEnergy = false
With these settings everything's good, except the first MFCC coefficient:
This is because of different power spectrum normalization in NWaves and python_speech_features
. Here's the code compensating this difference:
// call this on already computed mfccVectors:
for (var i = 0; i < mfccVectors.Count; i++)
mfccVectors[i][0] -= (float)(Math.Log(2) * Math.Sqrt(filterbankSize));
Now, the first coeff is -48.8676...
If you set appendEnergy=true
, the compensation is simpler (although less precise):
// call this on already computed mfccVectors:
for (var i = 0; i < mfccVectors.Count; i++)
mfccVectors[i][0] -= (float)Math.Log(2);
PS. Also note that python_speech_features
auto-pads the last (incomplete) frame of the signal with zeros. This is why you get 16 MFCC-vectors instead of 15 (as in NWaves). You can simply discard this last vector, or zero-pad signal in NWaves manually to match the behaviour. Personally I prefer the first solution.
Amazing work. Thank you. ๐๐๐
I will try it this week and report back. ๐