Introducing Sparsity in the Transformer model (Keras Implementation)

A proof of concept implementation of evolutionary sparsity in the Transformer model architecture.

How To Run:

Sparse variant architecture, trained on the original data (29.000 samples in training set, 1024 samples in test set)

python3 en2de_main.py sparse origdata

*Original architecture with a rewritten trainingsloop and using custom transfer-function in order to validate the obtained results *

python3 en2de_main.py originalWithTransfer origdata

Loads the saved model from previous training epochs and continues training this model

sets the dataset to be used for the trainings-task

The Transformer original paper:
"Attention is All You Need" (Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin, arxiv, 2017)
SET-procedure original paper:
Scalable training of artificial neural networks with adaptive sparse connectivity inspired by network science (Decebal Constantin Mocanu, Elena Mocanu, Peter Stone, Phuong H. Nguyen, Madeleine Gibescu & Antonio Liotta)

Transformer implementation in Keras by LSdefine:
The Transformer model in Attention is all you need：a Keras implementation.
Attention is all you need - A Pytorch implementation
Jadore801120/attention-is-all-you-need-pytorch.
Sparsity SET-procedure based on the proof-of-concept code of:
Dr. D.C. Mocanu - TU/e

The test sys argument gives me error: UnicodeEncodeError: 'ascii' codec can't encode character '\xe4' in position 6: ordinal not in range(128).
Solution: run in terminal: export LC_CTYPE=C.UTF-8