Smoothing in the WebAudio specification

Question

Smoothing in the WebAudio specification

Closed this issue a year ago · 9 comments

The WebAudio API requires a "smoothing" pass, and I'm wondering whether FftSharp has anything like this.

I don't know if this is something "standard" or something they came up with specifically for WebAudio (which I have read is a somewhat bizarre standard; I don't know enough about signals or audio to have an opinion either way). I'm referring to the middle few steps of this gist describing how Shadertoy's audio textures are implemented ... steps 6 and 7 basically:

The spectrum is calculated according to the Web Audio API specification:

Take 2048 samples of audio data as an array of floating point data
Multiply it with Blackman window
Convert samples into complex numbers (imaginary parts are all zeros)
Apply the Fourier transform with fftSize = 2048, as a result we get 1024 FFT bins
Convert complex result into real values using cabs() function
Divide each value by fftSize
Apply smoothing by using previously calculated spectrum values:
```
v = k * v_prev + (1 - k) * v
```
Where k is smoothing constant equal to 0.8.
If calculating spectrum the first time, the previous value is assumed to be 0.
Convert resulting values to dB: dB = 20 * log10(v)
Convert floating point dB spectrum into 8-bit values:
1. Clamp the value between dB_min = -100 and dB_max = -30
2. Scale the dB_min..dB_max range into 0..255 range:
```
t = clamp(255 / (dB_max - dB_min) * (dB - dB_min), 0, 255)
```
Write 8-bit values into texture

Answer 1 · 2023-07-30T21:29:52.000Z

Disregard. Went through the code... nothing even close.

I really don't understand why WebAudio doesn't just use decibels. Bah. Web junk...

Answer 2 · 2023-07-30T22:22:54.000Z

Hi @MV10, thanks for opening this issue though! I learned some things by reviewing the WebAudio spec

In case someone ends up here later, I'll note that the step 7 you referred to earlier is essentially "smoothing in the time domain" by never reporting 100% of any signal, but rather a blend of 80% the current signal mixed with 20% of the previous signal.

It's a little weird in my opinion and FftSharp doesn't have that functionality built-in, but it would be easy to add for any user who desires this functionality.

why WebAudio doesn't just use decibels

That log transformation is an extra computational step, and if your goal is peak frequency detection you can find this faster by skipping that step. Other than that, I'm not sure the motivation behind their decisions. Lots of effort seems to go into writing code that performs well with byte data, whereas FftSharp is a lot more precise and is designed to work with double data. Perhaps this makes WebAudio more performant on devices like low-end smart phones.

Hope it helps!
Scott

Answer 3 · 2023-07-31T11:25:12.000Z

@swharden thanks for the reply. Yes, their "pseudo-Decibels" data struck me as very odd, which is why I thought perhaps there might be some corner of DSP where this was commonplace. On the plus side, this prompted me to look through the FftSharp source for myself, so I at least have a better understanding of how it all works.

Answer 4 · 2023-07-31T12:24:31.000Z

Edit: After some Quality Time with Google, I think what I'm asking about below is properly referred to as "scaling"...

Oops -- clicked "reopen" before I finished writing the comment, as I'd like to get more of your input on this algorithm when you have a moment. I've been writing a streaming music visualization program, and I went down this WebAudio rabbit-hole because I wanted to be able to support Shadertoy code. Feeding those shaders with straight Decibels data wasn't working, for reasons which became obvious after I found the writeup linked earlier.

As I mentioned, I put on my big-boy pants and had a look through the FftSharp code for myself and noted that the Fft.Power Decibels calculation operates on frequency magnitude data. Does the magnitude calculation correspond to steps 5 and 6 in the WebAudio writeup?

Convert complex result into real values using cabs() function

Divide each value by fftSize

I assume it must, in order for the subsequent WebAudio Decibels calc to make any sense.

I've been visually comparing my program's output side-by-side with Shadertoy, running the same simple shader in both, which merely draws the spectrum as it is generated. I have an audio-loopback driver installed which lets my program capture streaming audio, and in Shadertoy I can select mic-in as the audio source, which is the loopback driver.

For the question I want to ask, we can ignore the byte-conversion steps (although I wonder if their use of byte-range PCM input data plays a role, as I'm using OpenAL capturing, which produces higher-fidelity short-int samples). The data fed to the shaders is ultimately normalized, so I think it doesn't matter much what the input range is prior to normalization (again, for the purposes of my question; I realize there would be accuracy differences).

I tried to apply the 30dB / 100dB clipping -- another thing which makes no sense to me -- but not surprisingly, at anything but the highest volume levels, much of the signal is below 30dB, so I've temporarily omitted that.

So in the end, the most visually-similar result is to only apply the smoothing to the magnitude data prior to the dB calculation. The "behavior" over time is similar now: the smoothing produces a sort of slow-motion effect compared to un-smoothed Decibels data. However, what puzzles me is that the WebAudio data has much more pronounced peaks and valleys. In the image below, my application's data is on top, and the Shadertoy data is below.

I'm trying to figure out how to more closely replicate this -- a simple multiplier won't do, as the variance is in both directions (higher peaks, deeper troughs).

The code, of course, is trivial, but for what it's worth:

private void CalculateWebAudio()
{
    const double k = 0.8d;

    for (int i = 0; i < SampleSize; i++)
    {
        // value from the previous WebAudio calcs
        double v_prev = BufferWebAudioSmoothing[i];

        // dB is derived from magnitude
        double sample = InternalBuffers.FrequencyMagnitude[i];

        // time-domain smoothing (why???)
        sample = k * v_prev + (1d - k) * sample;

        // store for the next batch of samples
        BufferWebAudioSmoothing[i] = sample;

        // apply the normal Decibels calculation
        sample = 20d * Math.Log10(sample);

        // skip clipping and byte-range scaling

        // store for output
        InternalBuffers.FrequencyWebAudio[i] = sample;
    }
}

Really appreciate your time! Thank you.

Answer 5 · 2023-07-31T14:36:24.000Z

For some reason, arbitrarily normalizing by dividing by 60 seems very, very close. Why 60? Heck if I know. Previously I was using 90, since I had read somewhere that PC audio normally wouldn't exceed 90dB (and my own max-volume testing on three completely different PCs and one laptop always peaked around 87 or 88).

Maybe this smoothing stuff also brings the range down? I'll have to check on that. I suppose this is good enough for "eyecandy" purposes, but I'd be interested if you have more sensible explanations or ideas!

Answer 6 · 2023-08-02T00:59:31.000Z

Hi @MV10, sorry I'm at a loss!

Note that FftSharp produces the identical FFT output values as Python with the Numpy and Scipy libraries (which I considered to be very high quality scientific libraries), so I'm confident that FftSharp is outputting the most accurate values possible.

I suspect that web audio thing is aiming for speed and sacrificing precision, and their applications are more geared toward eyecandy and such. Perhaps if you ask a similar question on their GitHub they can do a better job of justifying their design desicsions.

Overall I agree that the values you are producing look very similar!

The "behavior" over time is similar now: the smoothing produces a sort of slow-motion effect

That's a great description. I'll use that phrase in the future when I describe time averaging!

If you disable time averaging I imagine you are likely to get sharper troughs

smoothing to the magnitude data prior to the dB calculation

WebAudio data has much more pronounced peaks and valleys

Are you 100% sure the smoothing is applied before the conversion to dB? If I were designing a visualizer from scratch, I think I'd want to do the time averaging after dB conversion. Maybe this will help restore your troughs?

Actually, what may have a lot of influence on your troughs is the size of your window function too. Not how long it is in memory, but rather how "wide" the bell of the bell curve is. Consider playing with Blackman windows of different widths.

Also are you 100% sure that WebAudio and FftSharp are both using the identical sampling rate? That could also influence the frequency resolution.

Have you tried visualizing pure sine waves? https://onlinetonegenerator.com/ I find that tool really helpful sometimes, and it may be interesting to compare your two apps side by side using a tool like that

Answer 7 · 2023-08-02T01:02:14.000Z

One more thing, I glanced at https://mcguirev10.com and I swear I've run across that site before!

I think it was related to your WPF system tray post, I landed there while toying with https://github.com/swharden/Tmoji

It's funny how the internet connects people in different ways! 🚀

Answer 8 · 2023-08-02T10:14:47.000Z

Yeah sorry, this sort of wandered far afield of FftSharp's area of concern! I have no doubts FftSharp is more accurate -- I did try playing with the sequence, but smoothing before the dB calc was the most similar. I appreciate the attention and suggestions!

As for the blog -- yep, I get my hands into just about everything. I have around 100K readers (to my very great surprise) and when I start writing about this project, FftSharp will definitely get a call out!

As for Tmoji, did you know you can hit Win+. for a system emoji pop-up? You can even type the name of the emoji while that is visible...

Thanks again!

Answer 9 · 2023-08-02T15:09:40.000Z

Sounds good!

You're right about Win+. ... I use TMoji to quickly access commonly used symbols that are hard to locate in that dialog. For emoji alone the windows picker is indeed better, and in hindsight the name and screenshots over-emphasize the emoji capabilities 😅