Shen-Lab/DeepAffinity

Generating the protein SPS representation

psuriana opened this issue · 12 comments

Hi, I am trying to run DeepAffinity on the PDBind dataset, and am running into issues when trying to use the SSpro/ACCpro to get the the secondary structure and exposedness predictions for the proteins in PDBind. I've followed the installation instructions and was able to run the software on some proteins, but for the majorities of the proteins it failed. The error message was pretty cryptic and I don't think it's a problem with the binaries since I was able to run the software on some of the PDBind protein sequences. I am wondering if you have some insights on these? Or do you perhaps have the processed data for PDBind? Thanks!

Thank you for the interest in DeepAffinity. Given the new data, you may want to re-train the model(s). Note that we removed four protein classes including GPCRs from our training data to test the transfer learning strategy for potentially new classes of drug targets. You may want to use our data before split or use your own training set, as well as remove sequence homology to PDBbind (new test set), for a typical ML setting.

For protein SPS format, we first use Scratch, specifically SSpro/ACCpro, which start with given protein sequences and predict secondary structures (SS) and solvent accessibility (ACC). We then use our own program to use the amino-acid, predicted SS, and predicted ACC sequence files as input to generate the SPS file: https://github.com/Shen-Lab/DeepAffinity/blob/master/data/script/split_data_script/pfam/our_format/group.py

What you described seems to be an issue with Scratch. It is not clear what the issue actually is based on your description. This excellent program is not ours but developed by the Baldi group at UCI. You may want to check with them about the Scratch issue.

Meanwhile, Dr. Tomas Babak at Queens University has successfully ran Scratch on all human protein sequences (UniProt canonical sequences) and used our program (bugs spotted and addressd for super short/long sequences) to get their SPS files, as described in the latest update of readme. We hope that these files might be useful to you as well.

Lastly, now that you are using PDBbind that has both sequences and structures, you may want to consider DSSP-annotated actual ACC and SS files rather than predicted ones for conversion into SPS files as well.

Hi, thank you for your response. I have another questions. I was able to run the DSSP to get the SS and the Accessible surface area (ASA) values, but I noticed that ACCpro gives a binary prediction (exposed vs non exposed) instead. Do you perhaps know how to convert DSSP ASA to ACCpro ACC? For now, I used the >0.25 threshold to convert DSSP ASA values to exposedness binary since it seems like the threshold value used according to the website description.

Also, what's the command you use to retrain the DeepAffinity model?

Thanks.

We did not use any structures for DeepAffinity. But what you did, using relative solvent accessibility > 0.25 as the threshold, is consistent with Scratch's practice: https://onlinelibrary.wiley.com/doi/abs/10.1002/prot.10069 (Page 144)

The source code and example data for training DeepAffinity with joint attention and seq2seq warm start can be found under Joint_models/joint_attention/joint_warm_start/ We recommend using an ensemble of models with various hyper-parameter combinations over a single model with optimized hyperparameters.

If you run ./DeepAffinity_inference.sh without specifying the checkpoint_file, which model will it use?

Do you perhaps have the script file to run the training with custom dataset?

For training the model with custom dataset, you can use https://github.com/Shen-Lab/DeepAffinity/blob/master/Joint_models/joint_attention/joint_warm_start/joint-Model.py Note that Line 281-313 load six sets of our data from a data folder in the same parent folder (https://github.com/Shen-Lab/DeepAffinity/blob/master/Joint_models/joint_attention/joint_warm_start/data): one training, one test, and four generalization sets (ion channels, GPCRs etc). And each set corresponds to a trio of data files (protein sps, compound smiles, and label). So you can simply replace the training and the test set data with your own and disregard the generalization sets (as they are no longer relevant for your custom training data).

DeepAffinity_inference.sh as a wrapper is mainly designed for the inference or application of learned DeepAffinity models (checkpoints) to any compound-protein pair of interest. If no checkpoint is provided, it will actually call aforementioned joint-Model.py (see line 134-136) and train a model using our IC50 data in the aforementioned data folder. In the hind sight, we should have just loaded a default IC50 checkpoints. Nevertheless, if you would like to use the wrapper instead for training, again you can replace corresponding data files with custom ones.

@psuriana Google scholar recommended your latest preprint to me today. Congrats!

Here are a few points related to DeepAffinity for your consideration:

  1. The DeepAffinity model you ran is probably RNN/RNN-CNN rather than RNN/GCNN-CNN as stated in your appendix, since the compound is represented as SMILES strings and modeled through RNN, which is as seen in https://github.com/Shen-Lab/DeepAffinity/blob/master/Joint_models/joint_attention/joint_warm_start/joint-Model.py
  2. Since the protein and compound max lengths have been changed in your dataset, our pre-trained decoders from unsupervised learning won't be able to serve the warm start needed for the semi-supervised DeepAffinity. So I take that DeepAffinity was trained with a cold start? Or did you pre-train the two decoders with the new max lengths?
  3. Was an ensemble of RNN/RNN-CNN models used as in the DeepAffinity paper?

Thanks for the open access to the data. We are intrigued by the results and will perform our assessments

@Shen-Lab Thanks!

  1. Regarding the model, I am not sure which version of it is. It would be nice if you can clarify and we would update the preprint accordingly. But yes, we are feeding in the SMILES strings to the model, if joint-model.py doesn't do anything else, then it's probably the RNN as you said.
  2. I trained the model per your instruction in the previous message using the setup in https://github.com/Shen-Lab/DeepAffinity/blob/master/DeepAffinity_inference.sh with some modifications to make it takes in our training/validation/test set. I didn't look at the code too closely, correct me if I am wrong, but I was under the impression that this should have trained everything from scratch, including the encoder/decoder?
  3. The results in our preprint are average of 3 replicates. We didn't have time to train a lot of ensembles at the time.

Thank you @psuriana for the very informative comments!

Sure, based on your description, the DeepAffinity model you used was RNN/RNN-CNN. The way we name various models for DeepAffinity is ProteinEncoder/CompoundEncoder-AffinityPredictor. In this case, proteins are input as SPS and encoded by RNN (specifically GRU), whereas compounds are input as SMILES and encoded as RNN as well. The predictive module for affinity has been CNN across all models.

When no check point is loaded to DeepAffinity_inference.sh (which was originally written for loading pre-trained checkpoints and making inference only), it would try to re-train a model with a warm start (see Line 136 of https://github.com/Shen-Lab/DeepAffinity/blob/master/DeepAffinity_inference.sh) by calling https://github.com/Shen-Lab/DeepAffinity/blob/master/Joint_models/joint_attention/joint_warm_start/joint-Model.py
This joint-Model.py will load unsupervised pre-trained seq2seq encoders for proteins and compounds to initialize the joint supervised training of the encoders and CNN. Since our pre-trained seq2seq was for max protein SPS length 152 and max compound canonical SMILES length 100, it wouldn't be a fit as a warm start for supervised training of your data whose max SPS and SMILES lengths are 168 and 160, respectively.

BTW, we are very interested in the nicely made ATOM3D data sets including LBA. The download at https://www.atom3d.ai/lba.html seems to only provide PDB IDs of protein-compound cocrystal structures as well as binding affinitis. Any chance that the paired protein sequences and compound SMILES can be provided along?

@Shen-Lab Thanks for the explanations!

If I were to re-train the seq2seq2 encoders for the proteins and compounds as well, how do I do that? What command should I run to re-train the model including the encoders?

Thanks for the feedback on the ATOM3D datasets! We plan to include the protein seqs and the compound SMILES as well for the later release.

We may not have a convenient command ready for re-training seq2seq. I imagine that one version under https://github.com/Shen-Lab/DeepAffinity/tree/master/seq2seq_models needs to be re-trained with some length values changed in the script and updated in the data updated. @mostafakarimi71 Could you please help us out - what is the easiest way to retrain the seq2seq for new data with SPS and SMILES lengths being 168 and 160, respectively?

@psuriana While we wait for more inputs from @mostafakarimi71, two sanity tests can be done in parallel: one is to train the supervised model with a cold start of seq2seq instead (without unsupervised pre-training, thus random initialization such as Xavier) for the longer lengths; the other is to train and test the original model with the original lengths only on the subset of your data whose max protein SPS length is 152 and max compound canonical SMILES length is 100.

@psuriana Thanks for your interest in DeepAffinity. I have added a new section in our readme for how to tr-train the seq2seq models on a new dataset with the desired length. Please follow that and if you have any questions please let us know.

@mostafakarimi71 Thanks. I'll check it out!