Question
mozsen opened this issue · 1 comments
excuse me,I have a question.
context = [-11,0,5,7,10]
nput_dim = 16
output_dim = 8
net = TDNN(context, input_dim, output_dim, full_context=False)
what are the -11 0 5 7 10 mean?
for example,I have a speech dateset which has 10000 frames * 2576 features.
2576 fratures per frame.Input is 1*2576.
I want to implement speech separation by tdnn,batch_size is 200,target is IRM(1 * 161).
what is " context = [-11,0,5,7,10] , "mean? what can I get if I use it as my hidden layer?
Thank you very much,I need help to complete my graduation project.And My english is poor, please forgive me.I appreciate your reply.Thanks a lot.
Sorry for the late reply!
The context array specifies which context frames are used to compute the convolution with the kernel. So think of it as a weird dilated convolution. The output sequence length will depend on the context you provide. So based on what you said about your input:
input size: 200100002576
so input dim = 2576 and sequence length is 10000.
let's say output dim = 1024
The output sequence length will be = 10000 - 11 - 10 = 9979 because a current frame is looking at 11 frames in the past and 10 frames in the future.
I realized that maybe you will need to use some form of subsampling after this, like max-pooling. The authors of the Peddinti paper implemented some subsampling of their own for subsampling.