jsk-ros-pkg/jsk_recognition

[audio_to_spectrogram] audio_to_spectrum.py publishes wrong amplitude?

pazeshun opened this issue · 6 comments

I know it is too late to fix this issue even if this issue is correct, so I just write this for reference.

Currently, audio_to_spectrum.py calculates "amplitude" by applying abs and log to FFT result:

amplitude = np.log(np.abs(amplitude))

However, I think this calculation cannot generate "real" amplitude (consistent with the amplitude of the original signal).
If you want to get "real" amplitude, you have to divide FFT result by self.audio_buffer.audio_buffer_len / 2 and apply abs:
https://helve-blog.com/posts/python/numpy-fast-fourier-transform/
https://ryo-iijima.com/fftresult/

Unfortunately, if we fix this issue, spectrogram image will change and network learned from previous image will come not to work.

iory commented

Unfortunately, if we fix this issue, spectrogram image will change and network learned from previous image will come not to work.

It is a good direction to be able to specify the correct calculation as an option.

I am sorry for the lack of explanation.

However, I think this calculation cannot generate real amplitude.
If you want to get real amplitude, you have to divide FFT result by self.audio_buffer.audio_buffer_len / 2 and apply abs:

The word "amplitude" is not appropriate. Sorry..

The reason for using log was so that the spectrogram would include even the small sounds.
When the spectrogram was made without log, small sounds could not be represented when scaling the vibration intensity from 0~255 across the entire image.

In addition, I used the log scale because I found opinions that it was more suitable for learning or closer to the way humans hear, so I used the log scale.
The log-scaled spectrogram is called melspectrogram as far as I know.

ディープ ネットワークを学習させる際は、信号の対数表現を使用すると有利な場合が多くありますが、これは対数がダイナミック レンジの圧縮器のように機能し、大きさ (振幅) は小さくても重要な情報を保持している表現値をブーストするためです。この例では、対数スペクトログラムの方がスペクトログラムより性能が優れています。
https://jp.mathworks.com/help/signal/ug/spoken-digit-recognition-with-custom-log-spectrogram-layer-and-deep-learning.html

メル尺度は、人間の聴覚、すなわち音の聞こえ方に基づいた尺度です。
人間の聴覚には、周波数の低い音に対して敏感で、周波数の高い音に対して鈍感である、という性質から考案された尺度になっています。
https://fast-d.hmcom.co.jp/techblog/melspectrum-mfcc/

Maybe, the correct thing to do is the following. (But it seems too late...)

  • Avoid using log in spectrum calculations in audio_to_spectrum.py (Follow pazeshun' calculation)
  • Create a new audio_to_melspectrogram.py and use log to the intensity of the spectrum. (This node outputs the same image as our previous spectrogram.)

@708yamaguchi I see, thank you for your explanation.
How about setting the following pipeline as default? Is this OK from your point of view?

audio_to_spectrum.py -> spectrum
                     -> log_spectrum -> spectrum_to_spectrogram.py -> spectrogram -> recognition node

My understanding is that all of our recognition nodes use spectrogram, not spectrum. Is this correct?
Also, I don't use the name melspectrum because melspectrum seems calculated from the more complicated equation according to https://fast-d.hmcom.co.jp/techblog/melspectrum-mfcc/. Is this correct? Recommending another name is also welcome.

Thank you for your suggestion.
I think it's OK.

My understanding is that all of our recognition nodes use spectrogram, not spectrum. Is this correct?

I have heard of JSK programmers watching spectrum to check the properties of sounds, but I have never heard of an example of inputting spectrum into a recognition node.
So changing topic name spectrum to log_spectrum is not a big problem.

Also, I don't use the name melspectrum because melspectrum seems calculated from the more complicated equation according to fast-d.hmcom.co.jp/techblog/melspectrum-mfcc. Is this correct? Recommending another name is also welcome.

I think this is correct. Log scale and mel scale are similar, but they are different. fast-d.hmcom.co.jp/techblog/melspectrum-mfcc.

$$mel = 2595.0 \log_{10} \left( 1.0 + \frac{f}{700.0} \right)$$

Thank you, I'll create a PR introducing the new pipeline.
One note:

So changing topic name spectrum to log_spectrum is not a big problem.

I'll make both spectrum and log_spectrum publish.