Attention Is All You Need - Transformer

Network Architecture

Training:
- At training, all sentences except the last word of the target sentence are given as output.
Evaluation:
- At the time of evaluation, output's input is start " token", and output's input is added every time a word comes out. Then " token" appears or translate the sentence up to max_len.
Example:
- Example Sentence: Several women wait outside in a city. (English) -> Mehrere Frauen warten in einer Stadt im Freien. (German)
- Training:
  - Source sentence: Several women wait outside in a city.
  - Output's input: Mehrere Frauen warten in einer Stadt im
  - Target sentence: Mehrere Frauen warten in einer Stadt im Freien.
- Evaluation:
  - Source sentence: Several women wait outside in a city.
  - Output's input:

Input data shape: (batch_size, max_len)
Input Embedding output shape: (batch_size, max_len, embedding_dim)
Positional Encoding
- Positional encoding method that sets the position of each word and embedding dimention regardless of input sentence
- Positional encoding is performed for each sentence length.
Formula

Matrix multiplication and softmax of query and key, we can see how each word affects other words.

Encoder
- I also used masking in the encoder section. Because the sentence is less than max_len, = 1 is inserted, so I masked it.
Decoder
- It masks the word because it can not predict the current word by looking at the future word.