Protein Bigrams: a small bigram language model for protein sequences

This repository contains a small bigram language model for protein sequences. The bigram statistics were calculated using UniRef50.

Usage

Install all requirements
Download the dataset using the download_dataset.py script (approx. 6.8GB download and XXGB unzipped). For convenience, use tmux to run the script in the background.
Run main.py to train the model and generate sequences

Additionally, the notebook visulizes the computed bigram statistics.

Happy coding! In case of ideas, questions or bugs, please reach out at mail@timonschneider.de

Allow for using a prior distribution over amino acids (allows fine tuning)
Provide computed bigram statistics with repo
n-gram statistics (n > 2)
Compute a bunch of structures and check how they look
Phenotype -> Genotype: build function that maps sequence properties (e.g. organism) to a bigram statistic by only using data for which sequence properties apply
- Evaluate generated sequences by using a classifier that predicts the sequence properties
"token Multi layer bigrams" (?): i.e. tokenize and encode using BPE then do bigram statistics over all tokens

This repository was inspired by adapting Andrej Kaparthy's makemore to protein sequences.