conda env create -f env.yml
conda activate align
Download from Tranception
MSA_weights
(download & unzip MSA_weights.zip)MSA_files
(download & unzip MSA_ProteinGym.zip)- Tranception checkpoints (Small and Large) then place under
ckpt/
Download DMS data from ProteinGym then place under proteingym/
Split the dataset.
python split_dataset.py
The allocated memory depends on the length of the protein, which is very diverse. Therefore, batch size is handled by the functions in core/utils/get_batch.py
. It is hard-coded for GPU with 24GB memory, so you may change the value considering your resource constraints.
python pipeline.py --DMS_id IF1_ECOLI_Kelsic_2016
python 1_sft_all.py
python 2_ref_dist.py
python 3_generate_pairs.py
python 4_align_score.py
The scoring files will be saved under dms-results
.
python performance.py --input_scoring_files_folder dms-results/$exp_name$ --performance_by_depth
This will evaluate the experiment in terms of correlation metrics.
See tasks/
See tasks/pipeline.py
. You may
- Set
target_seq
value as wild-type sequence - Set
preference
argument as root directory containing the dataset - Set
DMS_id
argument as directory containing your DMS assay - All datasets should consist of columns
mutant
,mutated_sequence
, andDMS_score
.mutant
should be in conventional format, e.g. R42Q, where multiple mutations joined by : symbol. Due to the behavior of pandasjoin
functions in tranception scoring utils, I recommend to delete other columns.
$preference$/
$DMS_id$/
dms_train.csv
dms_test.csv
dms_val.csv
sft.csv
See draw_figure.ipynb
. Note that total.csv
contains Spearman's rho correlation of Ours, Alpha-Missense,Tranception Large, Tranception Large (no retrieval), EVE, and ESM1v in ProteinGym benchmark.
- Support LoRA