Output size with get_stft_magnitude_layer
mo-seph opened this issue · 3 comments
I am confused about the output sizes coming from get_stft_magnitude_layer.
In the composed docs (https://kapre.readthedocs.io/en/latest/composed.html), it says that an input shape of (2048, 1)
and n_fft=1024
gives an output of (None, 1, 3, 513)
- that is 1 channel with 3 hops and 513 points in each hop. This seems correct assuming it defaults to each hop overlapping 50%.
In the examples (https://kapre.readthedocs.io/en/latest/quickstart.html), an input shape of (2, 32000)
with n_fft=2048, win_length=2018, hop_length=1024
leads to an output shape of (None, 0, 1025, 320000)
.
So my questions are:
- why has the number of channels disappeared?
- why are there 320000 frames and not 320000 / hop_length frames?
Thanks!
Hi, thanks for the question, I spotted an error in the second one at https://kapre.readthedocs.io/en/latest/quickstart.html. I'll fix it soon. The issue here is that I passed channels_last
where I should've set channels_first
.
On further investigation, it seems that channels_last
may not be working:
src = np.ndarray(shape=(1,16384))
print("Input shape: {}".format(src.shape))
inputs = keras.Input(shape=src.shape, name='input')
print(inputs)
x = composed.get_stft_magnitude_layer(n_fft=1024,
name='static_stft',
input_data_format="channels_last",
return_decibel=True,)(inputs)
print(x)
prints
Input shape: (1, 16384)
Tensor("input:0", shape=(None, 1, 16384), dtype=float32)
Tensor("static_stft/magnitude_to_decibel/Maximum_5:0", shape=(None, 0, 513, 16384), dtype=float32)
If I permute the input so it is channels first:
src = np.ndarray(shape=(1,16384))
print("Input shape: {}".format(src.shape))
inputs = keras.Input(shape=src.shape, name='input')
inputs = keras.layers.Permute((2, 1))(inputs)
print(inputs)
x = composed.get_stft_magnitude_layer(n_fft=1024,
name='static_stft',
input_data_format="channels_last",
return_decibel=True,)(inputs)
print(x)
it gives:
Tensor("permute/transpose:0", shape=(None, 16384, 1), dtype=float32)
Tensor("static_stft/magnitude_to_decibel/Maximum_5:0", shape=(None, 61, 513, 1), dtype=float32)
So it looks like it is always assuming channels_first
, unless I'm misunderstanding something.
Edit to add version = 0.3.4, Python 3.8.5, tensorflow = 2.3.1, keras = 2.2.4
Input shape: (1, 16384)
would be "channels_first"
(mono channel), right? STFT layer seems working as expected.
And that's why it worked after you permute the input to (16384, 1)
because now it is "channels_last"
.