This repository includes several Variational Autoencoder variants to fit the probability distribution and generate for T cell receptor beta chain CDR3 sequences.
Our work is inspired by Vampire. Before running our VAE models, you should set up your environment as described in Vampire repository.
- Amino acid sequences are onehot-encoded into 30*21 matrix. Processing step can be finished by executing
preprocess_adaptive.py
in Vampire. - Nucleotide sequence are onehot-encoded into 90*5 matrix. Processing step can be finished by executing
preprocess_nt.py
.
The data we used to train the models are cohort2 from Immunosequencing identifies signatures of cytomegalovirus exposure history and HLA-mediated effects on the T-cell repertoire. Processed data can be downloaded from Zenodo.
Before running .py
scripts, you should execute
conda activate vampire
- Input and output are CDR3 amino acid sequence.
- Encoder and decoder consists dense layers.
- Input data is CDR3 amino acid sequence and the corresponding V- and J- genes.
- Output data is CDR3 amino acid sequence.
- Encoder and decoder consists dense layers.
- Input data is CDR3 nucleotide sequence.
- Output data is CDR3 amino acid sequence.
- Encoder and decoder consists dense layers.
- Input and output are CDR3 amino acid sequence.
- Encode consists of bidirectional GRU layer, decoder consists of undirectional GRU layer.
The scripts that plot distance heatmap of generations' probability distribution and scatter of frequency estimation are modeller_plot.py
and plot_heatmap.py
.