README
Download Cmai.
cd /Parent/Dir
git clone git@github.com:ice4prince/Cmai.git
In the folder scripts/rfold, Install RoseTTAFold from RoseTTAFold's git.
cd Cmai/scripts/rfold
git clone git@github.com:RosettaCommons/RoseTTAFold.git
cd RoseTTAFold
Follow the README in RoseTTAFold OR go through
1. create the environment
# If your NVIDIA driver compatible with cuda11
conda env create -f RoseTTAFold-linux.yml
# If not (but compatible with cuda10)
conda env create -f RoseTTAFold-linux-cu101.yml
# create conda environment for pyRosetta folding & running DeepAccNet
conda env create -f folding-linux.yml
2. download network weights
wget https://files.ipd.uw.edu/pub/RoseTTAFold/weights.tar.gz
tar xfz weights.tar.gz
3. download and install third-party software
./install_dependencies.sh
4. download sequence and structure databases
# uniref30 [46G]
wget http://wwwuser.gwdg.de/~compbiol/uniclust/2020_06/UniRef30_2020_06_hhsuite.tar.gz
mkdir -p UniRef30_2020_06
tar xfz UniRef30_2020_06_hhsuite.tar.gz -C ./UniRef30_2020_06
# BFD [272G]
wget https://bfd.mmseqs.com/bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt.tar.gz
mkdir -p bfd
tar xfz bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt.tar.gz -C ./bfd
# structure templates (including *_a3m.ffdata, *_a3m.ffindex) [over 100G]
wget https://files.ipd.uw.edu/pub/RoseTTAFold/pdb100_2021Mar03.tar.gz
tar xfz pdb100_2021Mar03.tar.gz
# for CASP14 benchmarks, we used this one: https://files.ipd.uw.edu/pub/RoseTTAFold/pdb100_2020Mar11.tar.gz
NOTE: the path for sequence and structure databases (UniRef30_2020_06, bfd, and pdb100_2021Mar03) should be written down for downstream analysis.
Go back to Cmai and Install the environments:
cd /Parent/Dir/Cmai
conda env create -f models/runEmbed.yml
conda env create -f models/runBind.yml
If a path is specified to install the conda environments:
conda env create -f models/runEmbed.yml -p /path/to/cmai_envs/runEmbed
conda env create -f models/runBind.yml -p /path/to/cmai_envs/runBind
please remember to add the path to the conda env_dirs
conda config --append env_dirs /path/to/cmai_envs
and make sure that all three env names are listed in your conda environment list:
conda env list
# conda environments:
#
base * /path/to/python/3.8.x-anaconda
runBind /path/to/cmai_envs/runBind
runEmbed /path/to/cmai_envs/runEmbed
RoseTTAFold /path/to/.conda/envs/RoseTTAFold
(RoseTTAFold can also been installed in the /path/to/cmai_envs if you like)
Then run the script provided to save the environment path to paras/env_path.
./get_env_path.sh
python Cmai.py --input 'data/example/input.csv' --out '/path/to/Cmai/example/data/example/output' --rf_data 'path/to/RoseTTAFold_database'
Required columns:
Antigen_id | Antigen_seq | BCR_Vh | BCR_CDR3h |
---|
To be noticed:
If there is NO 'Antigen_seq' column, a fasta file MUST be provided.
Optional columns:
BCR_id | Score | test | ... |
---|
TO BE NOTICED: Rows with NA will be REMOVED!!!!
Test data
Antigen_id | Antigen_seq | BCR_id | BCR_Vh | BCR_CDR3h | BCR_species |
---|---|---|---|---|---|
H4Hubei | mlsivilfllvaenssqnytgnpvicmghhavangtmvktltddqvevvaaqelvesqnlpelcpsplrlvdgqtcdiingalgspgcdhlngaewdvfierpnamdtcypfdvpdyqslrsilasngkfefiaeefqwttvkqdgksgackranvndffnrlnwlvksdgnayplqnltkvnngdyarlyiwgvhhpstdteqtnlyknnpggvtvstktsqtsvvpniggrpwvrgqsgrisfywtivepgdlivfntignliaprghyklnnqkkstilntaipigscvskchtdkgslsttkpfqnisriaigncpkyvkqgslklatgmrnipekasrglfgaiagfiengwqglidgwygfrhqnaegtgtaadlkstqaaidqingklnrliektnekyhqiekefeqvegriqdlekyvedtkidlwsynaellvalenqhtidvtdsemnklfervrrqlrenaedkgngcfeifhkcdnnciesirngtydhdiyrdeainnrfqiqgvkltqgykdtilwisfsiscfllvalllafvlwacqngnircqici | BCR_6 | QVQLQQSGPGLVKPSQTLSLTCAISGDSVSSYNAVWNWIRQSPSRGLEWLGRTYYRSGWYNDYAESVKSRITINPDTSKNQFSLQLNSVTPEDTAVYY | CARSGHITVFGVNVDAFDMW | |
H4Hubei | mlsivilfllvaenssqnytgnpvicmghhavangtmvktltddqvevvaaqelvesqnlpelcpsplrlvdgqtcdiingalgspgcdhlngaewdvfierpnamdtcypfdvpdyqslrsilasngkfefiaeefqwttvkqdgksgackranvndffnrlnwlvksdgnayplqnltkvnngdyarlyiwgvhhpstdteqtnlyknnpggvtvstktsqtsvvpniggrpwvrgqsgrisfywtivepgdlivfntignliaprghyklnnqkkstilntaipigscvskchtdkgslsttkpfqnisriaigncpkyvkqgslklatgmrnipekasrglfgaiagfiengwqglidgwygfrhqnaegtgtaadlkstqaaidqingklnrliektnekyhqiekefeqvegriqdlekyvedtkidlwsynaellvalenqhtidvtdsemnklfervrrqlrenaedkgngcfeifhkcdnnciesirngtydhdiyrdeainnrfqiqgvkltqgykdtilwisfsiscfllvalllafvlwacqngnircqici | BCR_7 | QVQLQQSGPGLVKPSQTLSLTCAISGFSVSSYNAVWNWIRQSPSRGLEWLGRTYYRSGWYNDYAESVKSRITINPDTSKNQFSLQLNSVTPEDTAVYY | CARSGHITVFGVNVDAFDMW | |
ova | mgsigaasmefcfdvfkelkvhhanenifycpiaimsalamvylgakdstrtqinkvvrfdklpgfgdsieaqcgtsvnvhsslrdilnqitkpndvysfslasrlyaeerypilpeylqcvkelyrgglepinfqtaadqarelinswvesqtngiirnvlqpssvdsqtamvlvnaivfkglwekafkdedtqampfrvteqeskpvqmmyqiglfrvasmasekmkilelpfasgtmsmlvllpdevsgleqlesiinfekltewtssnvmeerkikvylprmkmeekynltsvlmamgitdvfsssanlsgissaeslkisqavhaahaeineagrevvgsaeagvdaasvseefradhpflfcikhiatnavlffgrcvsp | BCR_11 | QVQLVQSGAEVKKPGASVKVSCKASGYTFTDYYMHWVRQAPGQGLEWMGKVNPNKRGTTYNQKFEGRVTMTTDTSTSTAYMELRSLRSDDTAVYY | CARSNALDAW | human |
ova | mgsigaasmefcfdvfkelkvhhanenifycpiaimsalamvylgakdstrtqinkvvrfdklpgfgdsieaqcgtsvnvhsslrdilnqitkpndvysfslasrlyaeerypilpeylqcvkelyrgglepinfqtaadqarelinswvesqtngiirnvlqpssvdsqtamvlvnaivfkglwekafkdedtqampfrvteqeskpvqmmyqiglfrvasmasekmkilelpfasgtmsmlvllpdevsgleqlesiinfekltewtssnvmeerkikvylprmkmeekynltsvlmamgitdvfsssanlsgissaeslkisqavhaahaeineagrevvgsaeagvdaasvseefradhpflfcikhiatnavlffgrcvsp | BCR_12 | QVQLVQSGAEVKKPGASVKVSCKASGYTFTDYYMHWVRQAPGQGLEWMGRVNPNGRGTTYNQKFEGRVTMTTDTSTSTAYMELRSLRSDDTAVYY | CARSNALDAW | human |
Preprocessing and Intermediate Data
The intermediate data will be generated and saved in the preprocessed directory defined by '--pre_dir' which will be the same with the output directory if not specified.
Preprocessed Data
Antigen_id | Antigen_seq | BCR_id | BCR_Vh | BCR_CDR3h | BCR_species | record_id | ||||
---|---|---|---|---|---|---|---|---|---|---|
H4Hubei | MLSIVILFLLVAENSSQNYTGNPVICMGHHAVANGTMVKTLTDDQVEVVAAQELVESQNLPELCPSPLRLVDGQTCDIINGALGSPGCDHLNGAEWDVFIERPNAMDTCYPFDVPDYQSLRSILASNGKFEFIAEEFQWTTVKQDGKSGACKRANVNDFFNRLNWLVKSDGNAYPLQNLTKVNNGDYARLYIWGVHHPSTDTEQTNLYKNNPGGVTVSTKTSQTSVVPNIGGRPWVRGQSGRISFYWTIVEPGDLIVFNTIGNLIAPRGHYKLNNQKKSTILNTAIPIGSCVSKCHTDKGSLSTTKPFQNISRIAIGNCPKYVKQGSLKLATGMRNIPEKASRGLFGAIAGFIENGWQGLIDGWYGFRHQNAEGTGTAADLKSTQAAIDQINGKLNRLIEKTNEKYHQIEKEFEQVEGRIQDLEKYVEDTKIDLWSYNAELLVALENQHTIDVTDSEMNKLFERVRRQLRENAEDKGNGCFEIFHKCDNNCIESIRNGTYDHDIYRDEAINNRFQIQGVKLTQGYKDTILWISFSISCFLLVALLLAFVLWACQNGNIRCQICI | BCR_6 | QVQLQQSGPGLVKPSQTLSLTCAISGDSVSSYNAVWNWIRQSPSRGLEWLGRTYYRSGWYNDYAESVKSRITINPDTSKNQFSLQLNSVTPEDTAVYY | CARSGHITVFGVNVDAFDMW | record_0 | |||||
H4Hubei | MLSIVILFLLVAENSSQNYTGNPVICMGHHAVANGTMVKTLTDDQVEVVAAQELVESQNLPELCPSPLRLVDGQTCDIINGALGSPGCDHLNGAEWDVFIERPNAMDTCYPFDVPDYQSLRSILASNGKFEFIAEEFQWTTVKQDGKSGACKRANVNDFFNRLNWLVKSDGNAYPLQNLTKVNNGDYARLYIWGVHHPSTDTEQTNLYKNNPGGVTVSTKTSQTSVVPNIGGRPWVRGQSGRISFYWTIVEPGDLIVFNTIGNLIAPRGHYKLNNQKKSTILNTAIPIGSCVSKCHTDKGSLSTTKPFQNISRIAIGNCPKYVKQGSLKLATGMRNIPEKASRGLFGAIAGFIENGWQGLIDGWYGFRHQNAEGTGTAADLKSTQAAIDQINGKLNRLIEKTNEKYHQIEKEFEQVEGRIQDLEKYVEDTKIDLWSYNAELLVALENQHTIDVTDSEMNKLFERVRRQLRENAEDKGNGCFEIFHKCDNNCIESIRNGTYDHDIYRDEAINNRFQIQGVKLTQGYKDTILWISFSISCFLLVALLLAFVLWACQNGNIRCQICI | BCR_7 | QVQLQQSGPGLVKPSQTLSLTCAISGFSVSSYNAVWNWIRQSPSRGLEWLGRTYYRSGWYNDYAESVKSRITINPDTSKNQFSLQLNSVTPEDTAVYY | CARSGHITVFGVNVDAFDMW | record_1 | |||||
ova | MGSIGAASMEFCFDVFKELKVHHANENIFYCPIAIMSALAMVYLGAKDSTRTQINKVVRFDKLPGFGDSIEAQCGTSVNVHSSLRDILNQITKPNDVYSFSLASRLYAEERYPILPEYLQCVKELYRGGLEPINFQTAADQARELINSWVESQTNGIIRNVLQPSSVDSQTAMVLVNAIVFKGLWEKAFKDEDTQAMPFRVTEQESKPVQMMYQIGLFRVASMASEKMKILELPFASGTMSMLVLLPDEVSGLEQLESIINFEKLTEWTSSNVMEERKIKVYLPRMKMEEKYNLTSVLMAMGITDVFSSSANLSGISSAESLKISQAVHAAHAEINEAGREVVGSAEAGVDAASVSEEFRADHPFLFCIKHIATNAVLFFGRCVSP | BCR_11 | QVQLVQSGAEVKKPGASVKVSCKASGYTFTDYYMHWVRQAPGQGLEWMGKVNPNKRGTTYNQKFEGRVTMTTDTSTSTAYMELRSLRSDDTAVYY | CARSNALDAW | human | record_2 | ||||
ova | MGSIGAASMEFCFDVFKELKVHHANENIFYCPIAIMSALAMVYLGAKDSTRTQINKVVRFDKLPGFGDSIEAQCGTSVNVHSSLRDILNQITKPNDVYSFSLASRLYAEERYPILPEYLQCVKELYRGGLEPINFQTAADQARELINSWVESQTNGIIRNVLQPSSVDSQTAMVLVNAIVFKGLWEKAFKDEDTQAMPFRVTEQESKPVQMMYQIGLFRVASMASEKMKILELPFASGTMSMLVLLPDEVSGLEQLESIINFEKLTEWTSSNVMEERKIKVYLPRMKMEEKYNLTSVLMAMGITDVFSSSANLSGISSAESLKISQAVHAAHAEINEAGREVVGSAEAGVDAASVSEEFRADHPFLFCIKHIATNAVLFFGRCVSP | BCR_12 | QVQLVQSGAEVKKPGASVKVSCKASGYTFTDYYMHWVRQAPGQGLEWMGRVNPNGRGTTYNQKFEGRVTMTTDTSTSTAYMELRSLRSDDTAVYY | CARSNALDAW | human | record_3 |
(processed_input.csv)
The output directory for the test data is 'data/example/output', where the intermediates for antigen embeddings should be saved in the folder 'RFoutputs' which can be found at RFoutputsLink.
If there is no 'antigen embedding directory' specified using '--npy_dir', the antigen embeddings will be extracted and saved in the folder 'NPY' in the 'preprocessed directory' as 'pred_dir/NPY', which is not included in the github but can be found at NPYLink.
Expected Output
The final output WITHOUT merging with the input data (using --no_merge):
record_id | Antigen | BCR_id_y | Score | Rank |
---|---|---|---|---|
record_0 | H4Hubei | BCR_6 | -3.97083 | 0.581942 |
record_1 | H4Hubei | BCR_7 | -3.97073 | 0.584042 |
record_2 | ova | BCR_11 | -4.16114 | 0.025797 |
record_3 | ova | BCR_12 | -4.16291 | 0.022598 |
If MERGED with the input data (Default):
Antigen_id | Antigen_seq | BCR_id_x | BCR_Vh | BCR_CDR3h | BCR_species | record_id | Antigen | BCR_id_y | Score | Rank |
---|---|---|---|---|---|---|---|---|---|---|
H4Hubei | mlsivilfllvaenssqnytgnpvicmghhavangtmvktltddqvevvaaqelvesqnlpelcpsplrlvdgqtcdiingalgspgcdhlngaewdvfierpnamdtcypfdvpdyqslrsilasngkfefiaeefqwttvkqdgksgackranvndffnrlnwlvksdgnayplqnltkvnngdyarlyiwgvhhpstdteqtnlyknnpggvtvstktsqtsvvpniggrpwvrgqsgrisfywtivepgdlivfntignliaprghyklnnqkkstilntaipigscvskchtdkgslsttkpfqnisriaigncpkyvkqgslklatgmrnipekasrglfgaiagfiengwqglidgwygfrhqnaegtgtaadlkstqaaidqingklnrliektnekyhqiekefeqvegriqdlekyvedtkidlwsynaellvalenqhtidvtdsemnklfervrrqlrenaedkgngcfeifhkcdnnciesirngtydhdiyrdeainnrfqiqgvkltqgykdtilwisfsiscfllvalllafvlwacqngnircqici | BCR_6 | QVQLQQSGPGLVKPSQTLSLTCAISGDSVSSYNAVWNWIRQSPSRGLEWLGRTYYRSGWYNDYAESVKSRITINPDTSKNQFSLQLNSVTPEDTAVYY | CARSGHITVFGVNVDAFDMW | record_0 | H4Hubei | BCR_6 | -3.97083 | 0.581942 | |
H4Hubei | mlsivilfllvaenssqnytgnpvicmghhavangtmvktltddqvevvaaqelvesqnlpelcpsplrlvdgqtcdiingalgspgcdhlngaewdvfierpnamdtcypfdvpdyqslrsilasngkfefiaeefqwttvkqdgksgackranvndffnrlnwlvksdgnayplqnltkvnngdyarlyiwgvhhpstdteqtnlyknnpggvtvstktsqtsvvpniggrpwvrgqsgrisfywtivepgdlivfntignliaprghyklnnqkkstilntaipigscvskchtdkgslsttkpfqnisriaigncpkyvkqgslklatgmrnipekasrglfgaiagfiengwqglidgwygfrhqnaegtgtaadlkstqaaidqingklnrliektnekyhqiekefeqvegriqdlekyvedtkidlwsynaellvalenqhtidvtdsemnklfervrrqlrenaedkgngcfeifhkcdnnciesirngtydhdiyrdeainnrfqiqgvkltqgykdtilwisfsiscfllvalllafvlwacqngnircqici | BCR_7 | QVQLQQSGPGLVKPSQTLSLTCAISGFSVSSYNAVWNWIRQSPSRGLEWLGRTYYRSGWYNDYAESVKSRITINPDTSKNQFSLQLNSVTPEDTAVYY | CARSGHITVFGVNVDAFDMW | record_1 | H4Hubei | BCR_7 | -3.97073 | 0.584042 | |
ova | mgsigaasmefcfdvfkelkvhhanenifycpiaimsalamvylgakdstrtqinkvvrfdklpgfgdsieaqcgtsvnvhsslrdilnqitkpndvysfslasrlyaeerypilpeylqcvkelyrgglepinfqtaadqarelinswvesqtngiirnvlqpssvdsqtamvlvnaivfkglwekafkdedtqampfrvteqeskpvqmmyqiglfrvasmasekmkilelpfasgtmsmlvllpdevsgleqlesiinfekltewtssnvmeerkikvylprmkmeekynltsvlmamgitdvfsssanlsgissaeslkisqavhaahaeineagrevvgsaeagvdaasvseefradhpflfcikhiatnavlffgrcvsp | BCR_11 | QVQLVQSGAEVKKPGASVKVSCKASGYTFTDYYMHWVRQAPGQGLEWMGKVNPNKRGTTYNQKFEGRVTMTTDTSTSTAYMELRSLRSDDTAVYY | CARSNALDAW | human | record_2 | ova | BCR_11 | -4.16114 | 0.025797 |
ova | mgsigaasmefcfdvfkelkvhhanenifycpiaimsalamvylgakdstrtqinkvvrfdklpgfgdsieaqcgtsvnvhsslrdilnqitkpndvysfslasrlyaeerypilpeylqcvkelyrgglepinfqtaadqarelinswvesqtngiirnvlqpssvdsqtamvlvnaivfkglwekafkdedtqampfrvteqeskpvqmmyqiglfrvasmasekmkilelpfasgtmsmlvllpdevsgleqlesiinfekltewtssnvmeerkikvylprmkmeekynltsvlmamgitdvfsssanlsgissaeslkisqavhaahaeineagrevvgsaeagvdaasvseefradhpflfcikhiatnavlffgrcvsp | BCR_12 | QVQLVQSGAEVKKPGASVKVSCKASGYTFTDYYMHWVRQAPGQGLEWMGRVNPNGRGTTYNQKFEGRVTMTTDTSTSTAYMELRSLRSDDTAVYY | CARSNALDAW | human | record_3 | ova | BCR_12 | -4.16291 | 0.022598 |
(merged_results.csv)
Expected Time
The expected time for processing the test data is 55m55.012s.
python Cmai.py --input 'data/example/input.csv' --out '/path/to/Cmai/example/data/example/output' --rf_data 'path/to/RoseTTAFold_database'
# Antigen Embedding:
# In 1-step:
python Cmai.py --code '/path/to/Cmai' --input 'data/example/binary_example.csv' --out 'data/example/output' --rf_data 'path/to/RoseTTAFold_database' --runEmbed
# In 2 steps:
python Cmai.py --code '/path/to/Cmai' --input 'data/example/binary_example.csv' --out 'data/example/output' --rf_data 'path/to/RoseTTAFold_database' --runEmbed --gen_msa --use_cpu
python Cmai.py --code '/path/to/Cmai' --input 'data/example/binary_example.csv' --out 'data/example/output' --rf_data 'path/to/RoseTTAFold_database' --runEmbed --run_rf
# Binding Predict:
python Cmai.py --code '/path/to/Cmai' --out 'data/example/output' --skip_check --runBind
Cmai.py is the main interface for users to execute Cmai after installation is successful.
usage: Cmai.py [-h] [--code CODE] [--input INPUT] [--out OUT] [--env_path ENV_PATH] [--rf_data RF_DATA]
[--fasta FASTA] [--pre_dir PRE_DIR] [--npy_dir NPY_DIR] [--cpu CPU] [--mem MEM] [--use_cpu]
[--seed SEED] [--min_size_background_bcr MIN_SIZE_BACKGROUND_BCR]
[--max_size_background_bcr MAX_SIZE_BACKGROUND_BCR] [--export_background] [--add_rank]
[--background_score BACKGROUND_SCORE] [--rf_para] [--gen_msa] [--run_rf] [--skip_preprocess]
[--skip_extract] [--runEmbed] [--runBind] [--skip_check] [--suffix] [--no_rank] [--verbose]
[--no_merge] [--move_npy] [--gen_npy] [--embedBCR] [--bcr_heatmap] [--debug] [--e_values]
Parameters for the interface script.
optional arguments:
-h, --help show this help message and exit
--code CODE the Cmai directory
--input INPUT the input files in csv which should include Antigen_id,BCR_Vh,BCR_CDR3h
--out OUT the directory for output files. An absolute path is required.
--env_path ENV_PATH the file saving the directory of the Conda environments- python of runEmbed, python of
runBind, and RoseTTAFold in order.
--rf_data RF_DATA the database folder for RoseTTAFold
--fasta FASTA The fasta file entering runEbed. When no sequence included in the input, the separate fasta
file of antigens is required
--pre_dir PRE_DIR the directory to save the preprocessed data. If not defiend, same with output directory.
--npy_dir NPY_DIR the npy folder if different with preprocess folder
--cpu CPU the maximum of cpus for antigen embedding. If not defined, use the value of paras/rf_para.txt
--mem MEM the maximum of memory in GB for antigen embedding. If not defined, use the value of
paras/rf_para.txt
--use_cpu the option to use cpu or gpu.
--seed SEED the seed for the first 100 background BCRs. To use the prepared embeded 100 BCRs, keep the
seed to default 1
--min_size_background_bcr MIN_SIZE_BACKGROUND_BCR
the initial and minimum sample size of background BCRs. The default is 100
--max_size_background_bcr MAX_SIZE_BACKGROUND_BCR
the maximum size for subsample of background BCRs, which should no more than 1000000. The
default is 10000
--export_background Only export the score dict for background BCRs of quantity of the max_size_background_bcr
number, default is False.
--add_rank Only add ranks from background BCR scores to no_ranked results, default is False.
--background_score BACKGROUND_SCORE
the pkl file of the score dictionary of background BCRs
--rf_para use the parameters from paras/rf_para.txt for antigen embedding. Default is False
--gen_msa only run generating msa and exit. Default is False
--run_rf skip generating msa and running embedding prediction. Default is False
--skip_preprocess skip preprocess of antigen_embedding. Default is False
--skip_extract skip extracting NPY for antigen embedding. Default is False
--runEmbed only run antigen embedding. Default is False
--runBind only run binding. Default is False
--skip_check skip check and preprocess of input data, only use when it has been done before. Default is
False
--suffix Adding suffix to antigen id. Only use to distinguish same-name antigens. The default is False.
--no_rank Only export the predicted score but no rank in background BCRs, default is False.
--verbose Enable verbose output, default is False.
--no_merge Unable merging output to input, default is False.
--move_npy only move npy files to the desired directory. Default is False
--gen_npy extract npy from npz files. Default is False
--embedBCR extract the bcr sequences and embeddings to the folder of preprocessed data. Default is False
--bcr_heatmap export full embedding results including the heatmap comparison. Default is False
--debug Switch to the debug mode and print output step by step. Default is False
--e_values E_VALUES E-value cutoff for inclusion in result alignment. Default
is '1e-30 1e-10 1e-6 1e-3'
Cmai is designed for large-scale inference on the binding properties between antigens and antibodies. Speed is thus an important factor in its scalability. We randomly sampled 50 BCR-antigen binding pairs (50 unique antigens) from SabDab and ran our model. The model took an average of ~12 minutes to finish computing for each pair, which included both the antigen embedding computation step by RosettaFold (average time=11.84 minutes), and also the binding prediction/rank percentile computation step against 1 million BCRs (average time=32.67 seconds). In practice, the most common scenario is to test the binding between many BCRs against a specific antigen of interest, and in this case, the embedding step needs to be run only once, enabling efficient inference on massive datasets.
In terms of computational hardware requirements, GPUs are required for the binding prediction phase (executed with --runBind
), and are strongly recommended for the antigen embedding phase (--run_rf
). For optimal performance, RoseTTAFold generally requires a GPU with at least 40 GB of memory to prevent out-of-memory errors. On the other hand, the MSA generation phase (--gen_msa
) runs on CPUs. If GPU resources are limited, it is advisable to run the MSA generation phase separately on CPUs before proceeding with the other phases.
Our training and prediction computations were executed on the A100 GPU nodes of our UT Southwestern BioHPC server (https://portal.biohpc.swmed.edu/content/).