Machine Translation is one of the most important problems in natrual language processing. In this project, using Transformer model, we acheived 49 BLEU score on translating English to Persian sentences.
Machine Translation is a sequence-to-sequence problem; in other words, it requires a sequence as input and returns a sequence as output. These kind of problems are often solved with Encoder-Decoder models. Encoder takes a sequence as input, generates some vectors that represent the input then decoder uses those vectors to generate the output via decoding them.
In Encoder and Decoder, LSTM units might be used to store longer sequences.
One of the issues with Simple Encoder-Decoder models that use LSTM, is that they need to represent any input in a finite space. For example let's say we have a sentence like I really love AI given to the Encoder to encode it to
To address this issue we can return a vector at each time step. In this case we have the representation of the whole sentence at each time but focusing more on the current time-step. The decision for what parts the decoder should attend to at each time-step is made via decoder's hidden state.
Attention mechanism gives weight to each encoded time-step as to which part is should attend to more. The result of this is a weighted average of the encoder's output.
Attention is calculated via computing the similarity. Attention for the RNN with Attention part of the project uses additive compatibility function also known as additive attention and is calculated as such:
where
Also
Context to be given to the decoder is calculated as the weighted sum of the encoder outputs:
$$
\begin{equation}
c_i = \sum_{j=1}^{T_x} \alpha_{ij}h_j
\end{equation}
$$
where
We can show that, how much attention the decoder is giving to each sequence of the encoder via
Although we have solve the issue with Encoder-Decoder models with attention, we cannot benefit much from parallelism since RNN units are sequential.
Here is how GRU is calculated:
$$ \begin{align} \tilde{h}t &= tanh(W_c x_t +U_c[h{t-1} \odot r_t] + b_c)\ r_t &= \sigma(W_r x_t + U_r h_{t-1} + b_r)\ z_t &= \sigma(W_z x_t + U_z h_{t-1} + b_z)\ h_i &= z_t \odot \tilde{h}t + (1 - z_t)h{t-1}\ \end{align} $$
It can be seen the recurrent relation is embedded in the calculations.