hitachi-speech/EEND

Question about shuffle

takenori-y opened this issue · 2 comments

In the implementation, acoustic features rather than embeddings are shuffled during training. Is it ok? The positional encoding for the Transformer-based encoder seem to be meaningless feature.

Thank you for your interest!
We created the positional encoding here, but actually, it is not used in our model so it's ok to shuffle acoustic features in the dataloader.
We reported in our ASRU 2019 paper that we did not use positional encodings as below:

The architecture of the encoder block is depicted in Fig. 2. This configuration of the encoder block is almost the same as the one in the Speech-Transformer introduced in [44], but without positional encoding.

I'm sorry for the confusion.

Ah, I see. Thank you for answering my question.