How many flops in the inference for 1 second of audio?

Question

How many flops in the inference for 1 second of audio?

Closed this issue 4 years ago · 5 comments

What is the total ops that are computed in Waveglow to generate 1 second of audio? Assume the generated audio is 22050Hz, meaning 22K samples per second.

Also, it's claimed that the model is faster than WaveNet, but the total trainable params for this pytorch model seems to be ~270million, whereas it's much smaller around 90x smaller at ~3million params for most tensorflow Wavenet models.

How does Waveglow has almost 90X more params but the documents here say it's faster than Wavenet?

Answer 1 · 2020-06-19T02:22:20.000Z

WaveNet is an autoregressive model that generates one sample at a time sequentially. WaveGlow is a parallel model that generates samples in parallel.

Answer 2 · 2020-06-19T06:58:50.000Z

So how many ops does it have? Does it mean it has lesser Flops than Wavenet even though the model Waveglow model is 100X bigger (more params)?

Answer 3 · 2020-06-19T15:33:42.000Z

This paper will provide you more information about WaveGlow flops, etc. The main difference is that WaveNet is sequential and WaveGlow is parallel, even though WaveGlow has considerably more params than WaveNet.

Imagine you're adding a list of numbers, e.g. [1, 2, 3, 4]. In the sequential naive approach, you would create an accumulator vriable, call it total_sum=0, and then increment total_sum with the numbers on the list, one by one. In the parallel approach, you would take advantage of the fact that order does not matter for addition, and add the numbers in parallel.

Answer 4 · 2020-06-25T00:39:13.000Z

@rafaelvalle Useful paper link, thanks.

I was wondering about the audio output of wavenet/waveglow and saw the output layer for Wavenet. It shows a conv1d kernel producing 2 float values as out_channel is 2. (from Line-139 at https://github.com/Rayhane-mamah/Tacotron-2/blob/ab5cb08a931fc842d3892ebeb27c8b8734ddd4b8/wavenet_vocoder/models/wavenet.py#L139)

inference/final_convolution_2/kernel:0 (float32_ref 1x128x2) [256, bytes: 1024]
inference/final_convolution_2/bias:0 (float32_ref 2) [2, bytes: 8]

Isn't the output in the form of audio samples where each sample is just 1 float value, so shouldn't the final_convolution_2 produce only 1 float? Any reason why the kernel is shaped 1x128x2 and produces 2 float values?

I see the same with WaveGlow output. The final layer is like this with out_channels 4 so it seems it'd produce 4 float values. How would it them translate to audio samples each of which I assume is just 1 float?

....
(10): Invertible1x1Conv(
(conv): Conv1d(4, 4, kernel_size=(1,), stride=(1,), bias=False)
)
(11): Invertible1x1Conv(
(conv): Conv1d(4, 4, kernel_size=(1,), stride=(1,), bias=False)
)

Answer 5 · 2020-06-25T21:39:03.000Z

Take a look at WaveGlow's paper to understand the code.