/USPNet

Primary LanguagePythonMIT LicenseMIT

USPNet

This repository contains code for the paper 'USPNet: unbiased organism-agnostic and highly sensitive signal peptide predictor with deep protein language model'

You can use either USPNet or USPNet-fast to predict the signal peptide of a protein sequence.
We also provide MSA Transformer embeddings of the benchmark set as a demo.

Local Environment Setup for running the test

First, download the repository and create the environment.

Create an environment with conda

requirement

git clone https://github.com/JunboShen/USPNet.git
cd ./USPNet
conda env create -f ./environment.yml

Download the benchmark set

Test set.

Download categorical benchmark data

Categorical test data.

Download embeddings for the benchmark set

MSA embedding for test set.

All the data mentioned above can also be obtained from our OSF project.

Download trained predictive models

USPNet prediction head

USPNet prediction head (without organism group information).

USPNet-fast prediction head.

USPNet-fast prediction head (without organism group information).

Trained predictive model targeting the major class (Sec/SPI)

Specialized trained model optimized with higher accuracy on the major class (Sec/SPI). The model emphasizes the major class through an increased weight on the major class (Sec/SPI) in the objective function.

USPNet-fast prediction head (focus on Sec/SPI, require group information).

Usage

Put all the downloaded files into the same folder.

If you want to use USPNet on our benchmark set, please run:

# data processing, data_processed/ folder is created by default
python data_processing.py 
#Please put MSA embedding into the data_processed/ folder
python predict.py

# categorical benchmark data
unzip test_data.zip
python test.py

Demo of USPNet on benchmark data without organism group information:

python predict.py --group_info no_group_info

python test.py no_group_info

Demo of USPNet-fast on benchmark data:

python predict_fast.py

python test_fast.py

Demo of USPNet on benchmark data without organism group information:

python predict.py --group_info no_group_info

python test_fast.py no_group_info

To generate MSA embeddings on your own protein sequences and use USPNet to perform signal peptide prediction, please run:

# MSA embedding generation. <data_directory_path>: Directory where the processed data will be saved. <msa_directory_path>: Directory for storing MSA files (.a3m).
python data_processing.py --fasta_file <fasta_file_path> --data_processed_dir <data_directory_path> --msa_dir <msa_directory_path>

# Prediction. use 'python predict.py --data_dir <data_directory_path> --group_info no_group_info' if lack of organism group information.
python predict.py --data_dir <data_directory_path>

If you want to use USPNet-fast to perform signal peptide prediction on your own protein sequences, please run:

# Data processing. Processed data is saved in data_processed/ by default.
python data_processing.py --fasta_file <fasta_file_path> --data_processed_dir <data_directory_path>

# Prediction. use 'python predict_fast.py --data_dir <data_directory_path> --group_info no_group_info' if lack of organism group information.
python predict_fast.py --data_dir <data_directory_path>