keunwoochoi/kapre

Output size with get_stft_magnitude_layer

mo-seph opened this issue · 3 comments

I am confused about the output sizes coming from get_stft_magnitude_layer.

In the composed docs (https://kapre.readthedocs.io/en/latest/composed.html), it says that an input shape of (2048, 1) and n_fft=1024 gives an output of (None, 1, 3, 513) - that is 1 channel with 3 hops and 513 points in each hop. This seems correct assuming it defaults to each hop overlapping 50%.

In the examples (https://kapre.readthedocs.io/en/latest/quickstart.html), an input shape of (2, 32000) with n_fft=2048, win_length=2018, hop_length=1024 leads to an output shape of (None, 0, 1025, 320000).

So my questions are:

  • why has the number of channels disappeared?
  • why are there 320000 frames and not 320000 / hop_length frames?

Thanks!

Hi, thanks for the question, I spotted an error in the second one at https://kapre.readthedocs.io/en/latest/quickstart.html. I'll fix it soon. The issue here is that I passed channels_last where I should've set channels_first.

On further investigation, it seems that channels_last may not be working:

    src = np.ndarray(shape=(1,16384))
    print("Input shape: {}".format(src.shape))
    inputs = keras.Input(shape=src.shape, name='input')
    print(inputs)
    x = composed.get_stft_magnitude_layer(n_fft=1024,
                                name='static_stft',
                                input_data_format="channels_last",
                                return_decibel=True,)(inputs)
    print(x)

prints

Input shape: (1, 16384)
Tensor("input:0", shape=(None, 1, 16384), dtype=float32)
Tensor("static_stft/magnitude_to_decibel/Maximum_5:0", shape=(None, 0, 513, 16384), dtype=float32)

If I permute the input so it is channels first:

    src = np.ndarray(shape=(1,16384))
    print("Input shape: {}".format(src.shape))
    inputs = keras.Input(shape=src.shape, name='input')
    inputs = keras.layers.Permute((2, 1))(inputs)
    print(inputs)
    x = composed.get_stft_magnitude_layer(n_fft=1024,
                                name='static_stft',
                                input_data_format="channels_last",
                                return_decibel=True,)(inputs)
    print(x)

it gives:

Tensor("permute/transpose:0", shape=(None, 16384, 1), dtype=float32)
Tensor("static_stft/magnitude_to_decibel/Maximum_5:0", shape=(None, 61, 513, 1), dtype=float32)

So it looks like it is always assuming channels_first, unless I'm misunderstanding something.

Edit to add version = 0.3.4, Python 3.8.5, tensorflow = 2.3.1, keras = 2.2.4

Input shape: (1, 16384) would be "channels_first" (mono channel), right? STFT layer seems working as expected.

And that's why it worked after you permute the input to (16384, 1) because now it is "channels_last".