This repository contains code for the paper 'USPNet: unbiased organism-agnostic and highly sensitive signal peptide predictor with deep protein language model'
You can use either USPNet or USPNet-fast to predict the signal peptide of a protein sequence.
We also provide MSA Transformer embeddings of the benchmark set as a demo.
First, download the repository and create the environment.
requirement
git clone https://github.com/JunboShen/USPNet.git
cd ./USPNet
conda env create -f ./environment.yml
All the data mentioned above can also be obtained from our OSF project.
USPNet prediction head (without organism group information).
USPNet-fast prediction head (without organism group information).
Specialized trained model optimized with higher accuracy on the major class (Sec/SPI). The model emphasizes the major class through an increased weight on the major class (Sec/SPI) in the objective function.
USPNet-fast prediction head (focus on Sec/SPI, require group information).
Put all the downloaded files into the same folder.
If you want to use USPNet on our benchmark set, please run:
# data processing, data_processed/ folder is created by default
python data_processing.py
#Please put MSA embedding into the data_processed/ folder
python predict.py
# categorical benchmark data
unzip test_data.zip
python test.py
Demo of USPNet on benchmark data without organism group information:
python predict.py --group_info no_group_info
python test.py no_group_info
Demo of USPNet-fast on benchmark data:
python predict_fast.py
python test_fast.py
Demo of USPNet on benchmark data without organism group information:
python predict.py --group_info no_group_info
python test_fast.py no_group_info
To generate MSA embeddings on your own protein sequences and use USPNet to perform signal peptide prediction, please run:
# MSA embedding generation. <data_directory_path>: Directory where the processed data will be saved. <msa_directory_path>: Directory for storing MSA files (.a3m).
python data_processing.py --fasta_file <fasta_file_path> --data_processed_dir <data_directory_path> --msa_dir <msa_directory_path>
# Prediction. use 'python predict.py --data_dir <data_directory_path> --group_info no_group_info' if lack of organism group information.
python predict.py --data_dir <data_directory_path>
If you want to use USPNet-fast to perform signal peptide prediction on your own protein sequences, please run:
# Data processing. Processed data is saved in data_processed/ by default.
python data_processing.py --fasta_file <fasta_file_path> --data_processed_dir <data_directory_path>
# Prediction. use 'python predict_fast.py --data_dir <data_directory_path> --group_info no_group_info' if lack of organism group information.
python predict_fast.py --data_dir <data_directory_path>