The goal of this project is threefold.
- Explore the capabilities of the Ignite, Lightning, Fast AI and thinc.ai framework.
- Understand the attention mechanism.
- Study the state of the art in language modeling.
- We should strive to write as few lines of code as possible.
- We should solve the problem at hand first and later do generalization.
- [] Choose appropriate datasets.
- [] Implement DataLoader for the simplest dataset in Ignite.
- [] Implement a basic attention base model.
- [] Document results from experiment.
Dataset should be well established. We should look at nlpprogress for idea. We can later throw in a protein dataset.
First we should select a small dataset. I think we should first focus on language modeling, later we can investigate classification and translation.
For simplicity I think we should start with character level language model datasets. From nlpprogress we have two options Hutter Prize and Text8.
Text8 is a single text file with 10^8 characters.
We will first write a simple RNN model.
Text8 seems simpler, lets start with that.