ReactZyme: A Benchmark for Enzyme-Reaction Prediction [paper]
Official Github repository of ReactZyme (arxiv-link).
Check out for our newest EnzymeFlow !!!
Rawdata can be downloaded from zendo-reactzyme. Once downloaded, put rawdata into 'data' folder.
(1) Rawdata
There should be 4 rawdata files including: (1) cleaned_uniprot_rhea.tsv; (2) uniprot_molecules.tsv; (3) uniprot_rhea.tsv; (4) rhea_molecules.tsv. Additionally, there is a saprot_seq.pt for structure-aware protein sequences for SaProt after running FoldSeek. Put these files under the 'data' folder.
(2) Processed data
And there should be 3 splits: (1) time; (2) seq-smi based; (3) mol-smi based. Put time/seq-smi/mol-smi under new_time/new_seq_smi/new_mol_smi folders, respectively. Notice that we only provide positive enzyme-reaction pairs, the design of negative samples remains an open question. Nevertheless, we provide example of negative samples generation in prepare_negative.py.
SaProt tips: If you want to use SaProt, you have to use FoldSeek to get structure-aware sequence representations. This can be annoying. So we provide processed structure-aware sequences for our dataset (the 'saprot_seq.pt' file from zendo). Or if you'd like to do it on your own, you can use the function get_struc_seq from process_saprot.py.
(1) Processing Sequences
get_afdb.py: code example of fetching afdb structures for time-based split.
process_saprot.py: code example of processing saprot features for afbd structures.
process_esm.py: code example of processing ESM features for sequences.
(2) Processing Reactions
mat.py: code for MAT for loading model purposes.
process_mat.py: code example of processing MAT features for reactions.
prepare_graphs.py: code for process molecular graphs.
(3) General dataloading
data_utils.py: dataloader etc.
(4) Negative samples
prepare_negative.py: code example of preparing negative samples based on reaction SMILES. Once you have the dictionary of negative pairs 'data/negative_mol_dict.pt', you can prepare negative samples for training.
(5) Unimol features
unimol.ipynb: code example of generating unimol features for reactions.
(1) Train MLP
train.py: code for MLP training.
You can do time-based esm-unimol training like: CUDA_VISIBLE_DEVICES=0 python train.py --split_type time --mol_embedding_type unimol --pro_embedding_type esm --batch_size 1000
retrieval.py: code for MLP evaluation.
(2) Train Contrastive
train_contra.py: code for MLP-contrastive training.
retrieval.py: code for MLP-contrastive evaluation.
(3) Train Transformer
train_tfmr.py: code for Transformer training.
retrieval_tfmr.py: code for Transformer evaluation.
(4) Train Bi-RNN
train_rnn.py: code for Bi-RNN training.
retrieval_rnn.py: code for Bi-RNN evaluation.
@article{hua2024reactzyme,
title={Reactzyme: A Benchmark for Enzyme-Reaction Prediction},
author={Hua, Chenqing and Zhong, Bozitao and Luan, Sitao and Hong, Liang and Wolf, Guy and Precup, Doina and Zheng, Shuangjia},
journal={arXiv preprint arXiv:2408.13659},
year={2024}
}