==========================
A PyTorch implementation from scratch of the Transformer architecture and Mixture of Experts model.
This project aims to implement the Transformer architecture [1] and Mixture of Experts (MoE) model [2] from scratch in PyTorch. The Transformer architecture is a sequence-to-sequence model that has achieved state-of-the-art results in machine translation, question answering, and other tasks. MoE, on the other hand, is a neural network framework that learns to select the best expert among multiple candidates.
The TMOE model consists of two main components:
- Transformer: A sequence-to-sequence Transformer encoder-decoder architecture.
- Mixture of Experts: A MoE module that selects the best expert among multiple candidates based on the input features.
The Transformer architecture is a multi-head attention mechanism with self-attention and feed-forward neural networks.
- Self-Attention Mechanism: The self-attention mechanism allows the model to attend to different parts of the input sequence simultaneously.
- Feed-Forward Neural Networks (FFNNs): FFNNs are used as position-wise affine transformations.
The MoE module selects the best expert among multiple candidates based on the input features. The experts are trained separately and then combined using a gating mechanism.
This project is implemented in PyTorch 2.2.0. The code is well-documented and follows PEP 8 guidelines for naming conventions and coding style.
The training code is available in the train.py
file. It uses the torch.optim.Adam
optimizer with a learning rate of 0.001 and a batch size of 32.
To run the code, simply execute the following command:
python train.py
This will start the training process and save the results to the results.txt
file.
This project is licensed under the MIT License. See LICENSE for details.
[1] Vaswani et al. (2017). Attention Is All You Need. arXiv:1706.03762.
[2] Mixture of Experts Explained arXiv:1710.05155.