DeepRT(+): ultra-precise peptide retention predictor

Overview:

Predict peptide retention time by capsule network with embedding.
Handle multiple LC conditions by transfer learning, and predict with limited data.
Support different LC types: RPLC, SCX, HILIC and more.
Extract retention-related properties of amino acids.
Current precision (R²): RPLC 0.995, SCX 0.996, and HILIC 0.993.
Discriminate between structurally similar peptides using RT.

Content:

Installation
Scripts to reproduce the results
- 2.1 RPLC datasets
- 2.2 Other datasets
Change to your own datasets
- 3.1 Datasets
- 3.2 Model parameters
Transfer learning using our pre-trained models
Make prediction using the trained models
Publication
Other models
CPU version
Questions

1 Installation

git clone https://github.com/horsepurve/DeepRTplus  
cd DeepRTplus

And then follow DeepRT_install.sh to install the prerequisites. Note that only PyTorch 0.3.0 and 0.4.0 are tested.

2 Scripts to reproduce the results

2.1 RPLC datasets

Let's see how to apply DeepRT on HeLa dataset (modifications included). Simply type:

python data_split.py data/mod.txt 9 1 2
python capsule_network_emb.py

The HeLa data is split with 9:1 ratio with random seed 2, 9 for training and 1 for testing, and then the capsule network begins training. You may check out the prediction result (about 0.985 Pearson) and log file in typically 3 minutes (on a laptop with GTX 1070, for example).

To reproduce the result in the paper, just run as:

cd work
sh ../pipeline_mod.sh

And then you may see the reports (predicted normalized RT, Pearson/Spearman correlation) in the work directory.

Please use the CPU versions (capsule_network_emb_cpu.py and ensemble_emb_cpu.py) in the scripts if you run on a CPU. For example:

cd work
sh ../pipeline_mod_cpu.sh # run the CPU version

2.2 Other datasets

See data/README_data.md for a summary and run the corresponding pipeline. All the necessary parameters for those datasets are stored in config_backup.py.

3 Change to your own datasets

3.1 Datasets

Prepare your dataset as the following format:

sequence	RT
4GSQEHPIGDK	2507.67
GDDLQAIK	2996.73
FA2FNAYENHLK	4681.428
AH3PLNTPDPSTK	2754.66
WDSE2NSERDVTK	2645.274
TEEGEIDY2AEEGENRR	3210.3959999999997
SQGD1QDLNGNNQSVTR	2468.946

Separate the peptide sequence and RT (in second) by tab (\t), encode the modified amino acides as digits (currently only four kinds of modification are included in the pre-trained models):

'M[16]' -> '1',
'S[80]' -> '2',
'T[80]' -> '3',
'Y[80]' -> '4'

You may use Excel (search and replace then export) to prepare your data.

3.2 Model parameters

There are only several parameters to specify in config.py, e.g. for HeLa data, which is self-explainable:

train_path = 'data/mod_train_2.txt' 
test_path = 'data/mod_test_2.txt' 
result_path = 'result/mod_test_2.pred.txt'
log_path = 'result/mod_test_2.log'
save_prefix = 'epochs' # this is where we store the models when training
pretrain_path = ''
dict_path = '' 

conv1_kernel = 10
conv2_kernel = 10

min_rt = 0
max_rt = 110 
time_scale = 60 # set at 60 if your retention time is in second
max_length = 50 # maximum length of the peptides

Then type as following:

python capsule_network_emb.py

4 Transfer learning using our pre-trained models

Training deep neural network models is time-consuming, especially for large dataset such as the Misc dataset here. However, the prediction accuracy is far from satisfactory without training dataset that big enough. The transfer leaning strategy used here can overcome this issue. You can use your small datasets in hand to fine-tune our pre-trained model in RPLC.

There are only three parameters to change while using transfer learning:

pretrain_path = 'param/dia_all_epo20_dim24_conv10/dia_all_epo20_dim24_conv10_filled.pt' # load pre-trained model
dict_path = 'data/mod.txt' # load amino acid alphabet including four kinds of modification
max_length = 66 # the max length in the pre-trained model

And run the same command again:

python capsule_network_emb.py

Please note that:

transfer learning can only be applied to datasets generated from the same type of LC, e.g. RPLC to RPLC, SCX to SCX, and HILIC to HILIC, etc.
provided the same LC type, the species, gradient, and modification status can all be different.
do not change max length or amino acid alphabet here or you need to pre-train the model again.
you have to use the GPU version to load the pre-trained models in param/. If you are using the CPU version, load model from param_cpu instead.
the pretrained models for all the datasets used in the paper (including RPLC, SCX, and HILIC) are provided. See Release page for details.

To reproduce the transfer learning result in the paper, just type:

cd work
sh ../pipeline_mod_trans_emb.sh

5 Make prediction using the trained models

Predicting unknown RT for a new peptide using a current model is easy to do, see below as a demo, the four parameters of which are maximum RT, saved RT model, convolutional filter size and testing file, respectively:

python prediction_emb.py max_rt param/dia_all_trans_mod_epo20_dim24_conv10.pt 10 test_path

The test_path stores the sequences and real RT values of the peptides you want to predict. However, if you actually don't know their real RT values, just let them be dummy values like 0 in this file, and the predicted RT will be written to test_path.pred. Note that before training, we firstly have normalized RTs for all peptides (rt_norm=(rt-min_rt)/(max_rt-min_rt)), so here we use max_rt to change them back to their previous RT scale (supposing min_rt is 0).

6 Publication

Please refer to DOI: 10.1021/acs.analchem.8b02386.

If you find the software useful please consider citing:

@article{ma2018deeprt,
    author = {Ma, Chunwei and Ren, Yan and Yang, Jiarui and Ren, Zhe and Yang, Huanming and Liu, Siqi},
    title = {Improved Peptide Retention Time Prediction in Liquid Chromatography through Deep Learning},
    journal = {Analytical Chemistry},
    volume = {90},
    number = {18},
    pages = {10881-10888},
    year = {2018}
}

7 Other models

As ResNet and LSTM (already been optimized) were less accurate then capsule network, the codes for ResNet and LSTM were deprecated, and DeepRT(+) (based on CapsNet) is recommended.

You can still use SVM for training. To do so, use data_adaption.py to change the data format, and then import it to Elude/GPTime.

8 CPU version

Running DeepRT on CPU is not recommended, because it is way too slow. However, if you have to, use capsule_network_emb_cpu.py instead of capsule_network_emb.py. You can set BATCH_SIZE to be very large if you have large enough memory.

Transfer learning on CPU is now supported. Load CPU pre-trained models from param_cpu.

If you are running the pipelines using CPU, please substitute ensemble_emb_cpu.py for ensemble_emb.py in the scripts.

9 Questions

contact