hkmztrk/DeepDTA

Issue regarding the datasets

Closed this issue · 3 comments

Hello,

I am having issues with finding the compound SMILES and protein sequences for the datasets. You have already mentioned in previous issues, and also in this README.md, that the data was obtained from https://staff.cs.utu.fi/~aatapa/data/DrugTarget/ and https://jcheminf.biomedcentral.com/articles/10.1186/s13321-017-0209-z.
Unfortunately, I cannot seem to find the SMILES dataset and the protein sequence dataset. It seems to me that only the similarity matrices are available.
Do you have any information regarding this? Thank you.

Hi @matija-marijan, the drug SMILES and protein sequences are extracted from databases, if you investigate each dataset in the repo you will find the id:sequence dictionaries, e.g. Davis

Hello @hkmztrk, thank you for your reply and clarification.

I am also interested in the way that the SMILES and protein sequences were extracted from the databases you linked (https://staff.cs.utu.fi/~aatapa/data/DrugTarget/ and https://jcheminf.biomedcentral.com/articles/10.1186/s13321-017-0209-z.), as I cannot seem to find the SMILES and protein sequences on the websites. Do you have any information regarding this? Thank you.

Hi @matija-marijan, the original sources provide the corresponding IDs, e.g. for Davis, you can use PubChem for compounds and UniProt for sequences.