Seq2seq: Input not matching Output (and big thanks)

Question

Seq2seq: Input not matching Output (and big thanks)

s2458588 opened this issue 10 months ago · 0 comments

Hi, bentrevett. Thank you for providing these concise tutorials! 🥇
I was able to train an LSTM for seq2seq application (similar to translation) in short time, using a custom dataset and custom tokenizer.

The input and output however seem to mismatch significantly, especially in length.
The model is supposed to generate the transcription to a short sequence, where the input and the output almost have a character to character correspondence. Example:
Input: Doe, John
Expected output: doʊ, dʒɑn
Actual output: very large string with > 30 tokens.

Training was done with ~450k parallelized examples (75%,15%,10% train, test, eval split), batch size 128 for 10 epochs resulting in 17_000_000 parameters. I know there are many things where it could have gone wrong, but is there an obvious reason I am not seeing? Token to ID ratio seemed fine.
Greatly appreciate and ideas!