This project is originally based on a-PyTorch-Tutorial-to-Transformers. I have made heavy modifications involving hyperparameter configuration and implementing various methods to improve the performance of the model from other papers. Papers involved include:

Results are documented in a spreadsheet here.

Example tensorboard graph of validation loss between the different runs: tensorboard