Attention Is All You Need paper
-
Training:
- At training, all sentences except the last word of the target sentence are given as output.
-
Evaluation:
- At the time of evaluation, output's input is start " token", and output's input is added every time a word comes out. Then " token" appears or translate the sentence up to max_len.
-
Example:
- Example Sentence: Several women wait outside in a city. (English) -> Mehrere Frauen warten in einer Stadt im Freien. (German)
- Training:
- Source sentence: Several women wait outside in a city.
- Output's input: Mehrere Frauen warten in einer Stadt im
- Target sentence: Mehrere Frauen warten in einer Stadt im Freien.
- Evaluation:
- Input data shape: (batch_size, max_len)
- Input Embedding output shape: (batch_size, max_len, embedding_dim)
- Positional Encoding
- Positional encoding method that sets the position of each word and embedding dimention regardless of input sentence
- Positional encoding is performed for each sentence length.
- Formula
- Matrix multiplication and softmax of query and key, we can see how each word affects other words.
- Layer normalization was used in the paper.
- Formula
- Layer Normalization Paper Reference
- Encoder
- I also used masking in the encoder section. Because the sentence is less than max_len, = 1 is inserted, so I masked it.
- Decoder
- It masks the word because it can not predict the current word by looking at the future word.
- example of execute
- code example
- code refactoring & translate code