Venom is a mixture of substances produced by a venomous organism aiming at preying, defending, or intraspecific competing resulting in certain unwanted conditions for the target organism. Venom sequences are a highly divergent class of proteins making their machine learning-based and homology-based identification challenging. Prominent applications in drug discovery and healthcare, while having scarcity of annotations in the protein databases, made automatic identification of venom an important protein informatics task. Most of the existing machine learning approaches rely on engineered features, where the predictive model is trained on top of those manually designed features. Recently, transfer learning and representation learning resulted in significant advancements in many machine learning problem settings by automatically learning the essential features. This paper proposes an approach, called ToxVec, for automatic representation learning of protein sequences for the task of venom identification. We show that pre-trained language model-based representation outperforms the existing approaches in terms of the F1 score of both positive and negative classes achieving a macro-F1 of 0.89. We also show that an ensemble classifier trained over multiple training sets constructed from multiple down-samplings of the negative class instances can substantially improve a macro-F1 score to 0.93, which is 7 percent higher than the state-of-the-art performance.
Please use python 3.x.x to run ToxVec. The requirement for using ToxVec can be installed using the following pip
command:
pip install -r installation/requirements.txt
Besfor running the script you need to download the model files from the following URL and copy them to the models directory
Replace YYY
with numbers from 1 to 10 to dowload 10 ToxVec models
http://7xr.ir:5000/static/toxvec_model_YYY.bin
or if you are using a linux\mac os machine call the following scripts to download the models:
For Linux:
bash dowload_linux.sh
For Mac OS:
bash dowload_macos.sh
python3 toxvec.py --in example_input/input.fasta --out example_input/input_annaotated_by_toxvec.fasta
The file example_input/input_annaotated_by_toxvec.fasta
is an example output of ToxVec where the ToxVec has annottated the fatsa.
We used the benchmarking dataset as an example where the actual_label
is the variable showing the ground truth.
toxvec_pred
shows the ToxVec prediction.
toxvec_venom_score
is the portion of models classify this peptide as venom. A peptide is classifed as Venom if and only if this score is 1.