- Predict peptide retention time by capsule network with embedding.
- Handle multiple LC conditions by transfer learning, and predict with limited data.
- Support different LC types: RPLC, SCX, HILIC and more.
- Extract retention-related properties of amino acids.
- Current precision (R2): RPLC 0.995, SCX 0.996, and HILIC 0.993.
- Discriminate between structurally similar peptides using RT.
Content:
- Installation
- Scripts to reproduce the results
- Change to your own datasets
- Transfer learning using our pre-trained models
- Make prediction using the trained models
- Publication
- Other models
- CPU version
- Questions
git clone https://github.com/horsepurve/DeepRTplus
cd DeepRTplus
And then follow DeepRT_install.sh to install the prerequisites. Note that only PyTorch 0.3.0 and 0.4.0 are tested.
Let's see how to apply DeepRT on HeLa dataset (modifications included). Simply type:
python data_split.py data/mod.txt 9 1 2
python capsule_network_emb.py
The HeLa data is split with 9:1 ratio with random seed 2, 9 for training and 1 for testing, and then the capsule network begins training. You may check out the prediction result (about 0.985 ACC) and log file in typically 3 minutes (on a laptop with GTX 1070, for example).
To reproduce the result in the paper, just run as:
cd work
sh ../pipeline_mod.sh
And then you may see the reports (predicted normalized RT, Pearson/Spearman correlation) in the work directory.
Please use the CPU versions (capsule_network_emb_cpu.py and ensemble_emb_cpu.py) in the scripts if you run on a CPU. For example:
cd work
sh ../pipeline_mod_cpu.sh # run the CPU version
See data/README_data.md for a summary and run the corresponding pipeline. All the necessary parameters for those datasets are stored in config_backup.py.
Prepare your dataset as the following format:
sequence RT
4GSQEHPIGDK 2507.67
GDDLQAIK 2996.73
FA2FNAYENHLK 4681.428
AH3PLNTPDPSTK 2754.66
WDSE2NSERDVTK 2645.274
TEEGEIDY2AEEGENRR 3210.3959999999997
SQGD1QDLNGNNQSVTR 2468.946
Separate the peptide sequence and RT (in second) by tab (\t), encode the modified amino acides as digits (currently only four kinds of modification are included in the pre-trained models):
'M[16]' -> '1',
'S[80]' -> '2',
'T[80]' -> '3',
'Y[80]' -> '4'
You may use Excel (search and replace than export) to prepare your data.
There are only several parameters to specify in config.py, e.g. for HeLa data, which is self-explainable:
train_path = 'data/mod_train_2.txt'
test_path = 'data/mod_test_2.txt'
result_path = 'result/mod_test_2.pred.txt'
log_path = 'result/mod_test_2.log'
save_prefix = 'epochs' # this is where we store the models when training
pretrain_path = ''
dict_path = ''
conv1_kernel = 10
conv2_kernel = 10
min_rt = 0
max_rt = 110
time_scale = 60 # set at 60 if your retention time is in second
max_length = 50 # maximum length of the peptides
Then type as following:
python capsule_network_emb.py
Training deep neural network models is time-consuming, especially for large dataset such as the Misc dataset here. However, the prediction accuracy is far from satisfactory without training dataset that big enough. The transfer leaning strategy used here can overcome this issue. You can use your small datasets in hand to fine-tune our pre-trained model in RPLC.
There are only three parameters to change while using transfer learning:
pretrain_path = 'param/dia_all_epo20_dim24_conv10/dia_all_epo20_dim24_conv10_filled.pt' # load pre-trained model
dict_path = 'data/mod.txt' # load amino acid alphabet including four kinds of modification
max_length = 66 # the max length in the pre-trained model
And run the same command again:
python capsule_network_emb.py
Please note that:
- transfer learning can only be applied to datasets generated from the same type of LC, e.g. RPLC to RPLC, SCX to SCX, and HILIC to HILIC, etc.
- provided the same LC type, the species, gradient, and modification status can all be different.
- do not change max length or amino acid alphabet here or you need to pre-train the model again.
- you have to use the GPU version to load the pre-trained models in param/. If you are using the CPU version, load model from param_cpu instead.
- the pretrained models for all the datasets used in the paper (including RPLC, SCX, and HILIC) are provided. See Release page for details.
To reproduce the transfer learning result in the paper, just type:
cd work
sh ../pipeline_mod_trans_emb.sh
Predicting unknown RT for a new peptide using a current model is easy to do, see below as a demo, the four parameters of which are maximum RT, saved RT model, convolutional filter size and testing file, respectively:
python prediction_emb.py max_rt param/dia_all_trans_mod_epo20_dim24_conv10.pt 10 test_path
The test_path stores the sequences and real RT values of the peptides you want to predict. However, if you actually don't know their real RT values, just let them be dummy values like 0 in this file, and the predicted RT will be written to test_path.pred. Note that before training, we firstly have normalized RTs for all peptides (rt_norm=(rt-min_rt)/(max_rt-min_rt)), so here we use max_rt to change them back to their previous RT scale (supposing min_rt is 0).
Please refer to DOI: 10.1021/acs.analchem.8b02386.
As ResNet and LSTM (already been optimized) were less accurate then capsule network, the codes for ResNet and LSTM were deprecated, and DeepRT(+) (based on CapsNet) is recommended.
You can still use SVM for training. To do so, use data_adaption.py to change the data format, and then import it to Elude/GPTime.
Running DeepRT on CPU is not recommended, because it is way too slow. However, if you have to, use capsule_network_emb_cpu.py instead of capsule_network_emb.py. You can set BATCH_SIZE to be very large if you have large enough memory.
Transfer learning on CPU is now supported. Load CPU pre-trained models from param_cpu.
If you are running the pipelines using CPU, please substitute ensemble_emb_cpu.py for ensemble_emb.py in the scripts.