This is an experimental model fine-tuned from the esm2_t6_8M_UR50D model for multi-label classification. In particular, the model is fine-tuned on the Pfam database available here, the general purpose of the Pfam database is to provide a complete and accurate classification of protein families and domains.
In this project, fine-tuning refers to the process of training a pre-trained transformer model on a new dataset related to protein sequence classification. The pre-trained model, such as the ESM-2 transformer, has previously learned rich representations of biological sequences from a vast and diverse dataset. However, to make it useful for a specific classification task, we fine-tune it on a smaller dataset containing labeled protein sequences specific to our classification problem in Pfam.
Two trained models are available for download, one with classes capped at a minimum of 500 samples ("small" and has 290 classes) and one with classes capped at a minimum of 200 samples ("big" and has 1158 classes). Outputs for each model are in the SMALL_MODEL
and BIG_MODEL
directories, respectively.
These are automatically pulled from 🤗 during the Evaluate.sh
command:
- For the small model, see the HuggingFace model here.
- For the big model, see the HuggingFace model here.
The model architecture of the original ESM-2 transformer is illustrated as a PyTorch model in architecture.txt
.
The performance on the test set for each model is as follows:
Precision | Recall | F1-Score | Support | |
---|---|---|---|---|
Accuracy | 0.9948 | 25513 | ||
Macro Avg | 0.9949 | 0.9940 | 0.9944 | 25513 |
Weighted Avg | 0.9949 | 0.9948 | 0.9948 | 25513 |
Precision | Recall | F1-Score | Support | |
---|---|---|---|---|
Accuracy | 0.8360 | 58490 | ||
Macro Avg | 0.7924 | 0.7476 | 0.7327 | 58490 |
Weighted Avg | 0.8276 | 0.8360 | 0.8024 | 58490 |
This project includes scripts for processing training data, fine-tuning a transformer model, and evaluating the model on a test set.
- Process Training Data:
- Script:
Preprocess.py
- Bash Script:
Preprocess.sh
- Description: Reads raw data, removes classes with fewer than a specified number of samples, and preprocesses sequences for training.
- Usage:
./Preprocess.sh
- Script:
- Train Model:
- Script:
finetune_esm2_to_pfam.py
- Bash Script:
Train.sh
- Description: Fine-tunes a transformer model (facebook/esm2_t6_8M_UR50D) on the preprocessed training data.
- Usage:
./Train.sh
- Script:
- Evaluate Model:
- Script:
Evaluate.py
- Bash Script:
Evaluate.sh
- Description: Downloads the fine-tuned model pushed during step 2, and evaluates it on a test set, generating a dataframe of predictions, a classification report and a confusion matrix to the cwd.
- Usage:
./Evaluate.sh
- Script:
An environment file is provided for conda users. To create a conda environment with the required dependencies, run the following command:
conda env create -f environment.yml
- Dr. Mohamed Ali al-Badri - Research Fellow at UCL.
This work is based on the following work:
@article{lin2023evolutionary,
title={Evolutionary-scale prediction of atomic-level protein structure with a language model},
author={Lin, Zeming and Akin, Halil and Rao, Roshan and Hie, Brian and Zhu, Zhongkai and Lu, Wenting and Smetanin, Nikita and Verkuil, Robert and Kabeli, Ori and Shmueli, Yaniv and others},
journal={Science},
volume={379},
number={6637},
pages={1123--1130},
year={2023},
publisher={American Association for the Advancement of Science}
}
This project is licensed under the MIT License - see the LICENSE file for details.