word_language_model, is it a Transformer, Encoder-only or Decoder only?
efg001 opened this issue ยท 1 comments
๐ Documentation
The document says word_language_model uses RNN/Transformer but I am having trouble understanding exactly what it is.
Looking at the input target sequences, seems like it is a generative model where the expected output is shifted by 1(i.e the model is trained to generate words base on a prefix)
https://github.com/pytorch/examples/blob/main/word_language_model/main.py#L140
However, I see the output of decoder is re-wired as the input to encoder here:
https://github.com/pytorch/examples/blob/main/word_language_model/model.py#L143
As a reference, since the document says that word_language_model implement both a RNN and a transformer model, I looked pytorch's implementation of transformer here:
https://github.com/pytorch/pytorch/blob/main/torch/nn/modules/transformer.py#L273-L279
pytorch's implementation aligns with what the paper proposed where the input to decoder is src(input sequence) and input to. decoder is tgt(shifted target sequence)
So obviously word_language_model is not a vanilla transformer-like model for generating text because of the rewiring.
Since it uses the vanilla transformer model and the built in cross attention in decoder is not removed, it is not a decoder-only model either.
And since it is trained to generate text, I dont think it can be understood as a decoder-only model.
Can someone help me understand why the output of encoder is re-wired to decoder as input to decoder instead of through cross attention and if the doc needs to be updated to reflect what the model is doing or the code needs to be simplified to use a decoder-only model?
nvm its a decoder only model
encoder is decoder
self.decoder = nn.Linear(nhid, ntoken)