The Attention based architecture performs better on longer sequences than the classical, end-to-end encoder-decoder based Seq2Seq network. The translation is done from the English corpus to the Hindi corpus.
When dealing with data, the most important thing to do is to "standardise" it. Essentially, data has to be preprocessed so that there are no anomalities present (which could lead to poor testing results).
We follow the following steps to "clean" our data:
- converting our english data into lower case.
- replace english contractions with the help of a dictionary
- Remove extra spaces and non-printable symbols / numbers.
SpaCy tokenising was used and the words were embedded into a 300-dimensional embedding.
- Batch size = 32
- Adam optimizer
- CrossEntropyLoss as the loss function
The models were trained for a certain number of epochs (note that a single epoch took longer in the attention model as there are more learnable parameters) and were only saved when the validation loss decreased. This is done to prevent overfitting the model to the test data.
While the translations obtained aren't exact, the mis-translated words still hold onto the context of the sentence. The Encoder-Decoder architecutre didn't perform too poorly owing to the multi-layered approach taken. The Attention model uses a single-layered decoder.
The models aren't perfect but can be improved by using beam search instead of greedy decoding.