/MBConFit

Primary LanguagePythonBSD 3-Clause "New" or "Revised" LicenseBSD-3-Clause

ConFit

Overview

ConFit is a pLM-based ML method for learning the protein fitness landscape with limited experimental fitness measurements as training data. It uses a contrastive learning strategy to fine-tune the pre-trained pLM, tailoring it to achieve protein-specific fitness prediction while avoiding overfitting.

Dependencies

This code is based on Python 3.9.18 and PyTorch 2.1.0 with CUDA 12.2. Please first install the correct PyTorch version and then install the required packages as follows:

conda install pytorch torchvision torchaudio pytorch-cuda=12.2 -c pytorch -c nvidia
pip install -r requirements.txt

Data

Source Data

To use the source data in the paper, please use the script scripts/download.sh to download our source data.

Running on customized data

To run the customized data using our model, please follow the steps below:

  1. collect the DMS dataset and put it in data/$dataset/data.csv, the csv file should have the following necessary columns:

    seq: the mutant sequence.

    log_fitness: the ground truth fitness value for the sequence.

    mutated_position: the mutant position the the assay, please note that the position number should start from 0.

    PID: A unique number for each sequence, which can be generated by auto-increment.

  2. Generate the fasta file for the wild-type sequence, please put it in data/$dataset/wt.fasta

  3. If you want to use retrieval augmentation, please follow the DeepSequence repo(https://github.com/debbiemarkslab/DeepSequence) to generate the ELBO for each mutant assay. Please put the predicted ELBO in data/$dataset/vae_elbo.csv, which contains the following columns:

    seq: the mutant sequence.

    PID: a unique number for each sequence, which should be consistent with data.csv.

    elbo:the predicted ELBO values for each mutant assays.

Train ConFit

We provide an example of training ConFit in the scripts/train.sh file, which would help you quickly try our model. For example, the following script trains our model on GB1_Olson2014_ddg dataset using 48 shots of training data:

accelerate launch --config_file config/parallel_config.yaml confit/train.py \
    --config config/training_config.yaml \
    --dataset GB1_Olson2014_ddg \
    --sample_seed 0 \
    --model_seed 1

--config: (required) specifies the file containing training hyperparameters

--dataset: (required) specifies the dataset name

--sample_seed: (optional) specify the random seed when sampling testing and training data.

--model_seed: (optional) specify the initiating seed for the pretrianed ESM-1v model, please choose it from 1-5.

After training, please use the following scripts to conduct inference on the test set:

python confit/inference.py --dataset $dataset --shot $shot

--dataset: (required) specifies the dataset name

--shot: (required) specifies the training size

--no_retrieval :(optional) forbidden the retrieval augmentation in inference

the test spearman will be generated in results/$dataset/summary.csv

customizing config files

Training size: For different training sizes, please modify shot in config/training_config.yamland change the training hyperparameters in that file accordingly.

GPU: We trained ConFit using 4 A40. According to the GPU numbers you use, please modify num_processes and gpu_number respectively in config/parallel_config.yaml and config/training_config.yaml.

PLM: We utilized ESM-1v as our PLM to be fine-tuned. Similar protein language models can also be used. Please modify model in config/training_config.yaml to ESM-2 or ESM-1b to change the PLM.