Where is the training data in example.ipynb
Closed this issue · 18 comments
I tried to run example.ipynb, but I got
FileNotFoundError: [Errno 2] File /n/scratch3/users/k/kh278/bindingdb/fold1/train.csv does not exist: '/n/scratch3/users/k/kh278/bindingdb/fold1/train.csv'
In addition, when I run python train.py --task ${task_name} to run the experiments, I only can choose the task as 'biosnap'. However, I got
Traceback (most recent call last):
File "train.py", line 206, in
model_max, loss_history = main()
File "train.py", line 156, in main
auc, auprc, f1, logits, loss = test(testing_generator, model_max)
File "train.py", line 64, in test
loss = loss_fct(logits, label)
File "/home/xzhang/miniconda3/envs/DeepPurpose/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/xzhang/miniconda3/envs/DeepPurpose/lib/python3.7/site-packages/torch/nn/modules/loss.py", line 530, in forward
return F.binary_cross_entropy(input, target, weight=self.weight, reduction=self.reduction)
File "/home/xzhang/miniconda3/envs/DeepPurpose/lib/python3.7/site-packages/torch/nn/functional.py", line 2519, in binary_cross_entropy
"Please ensure they have the same size.".format(target.size(), input.size()))
ValueError: Using a target size (torch.Size([32])) that is different to the input size (torch.Size([16])) is deprecated. Please ensure they have the same size.
Thanks
Hi, can you try now? I fixed the bugs.
Thank you! Now it works.
By the way, I want to modify DAVIS and KIBA datasets. Could you please tell me how to do it? Where is the original dataset? I want to have the drugs' name and proteins' name, not only have SMILES and sequence of the proteins.
In addition, what is ESPF?
Thank you for your hard work and happy holidays!
Hi there,
Actually, you can modify the dataset using Pandas directly. This means to manipulate the csv file directly.
And the original dataset is provided in https://zitniklab.hms.harvard.edu/TDC/multi_pred_tasks/dti/
Thank you very much. I will try it later.
@limberc
I am sorry to ask you again.
Where is the .csv files. I check the link that you gave to me, but I used
from tdc.multi_pred import DTI
data = DTI(name = 'DAVIS')
split = data.get_split()
I got data, but data is a DTI object, I can't apply Pandas to it.
Thanks
Hi, you can use data.get_data()
to get the pandas dataframe. Also, it is downloaded in the local folder with name data
Thank you so much. I got the data using data.get_data(), but I could not find the files that show Drug xxx's real drug's name (ex. cid) and Taget xxx's real protein name (pid or uniprot)
Thank you very much
The Davis Dataset can be found in http://staff.cs.utu.fi/~aatapa/data/DrugTarget/. There are CID files
Thank you!
I used data.get_data for KIBA dataset, I got the followings:
Drug_ID | Drug | Target_ID | Target | Y | |
---|---|---|---|---|---|
Drug 1916 | COC1=C(C=C2C(=C1)CCN=C2C3=CC(=C(C=C3)Cl)Cl)Cl | Target 219 | MTVKTEAAKGTLTYSRMRGMVAILIAFMKQRRMGLNDFIQKIANNS... | 0.000000 | |
Drug 1916 | COC1=C(C=C2C(=C1)CCN=C2C3=CC(=C(C=C3)Cl)Cl)Cl | Target 217 | MSWSPSLTTQTCGAWEMKERLGTGGFGNVIRWHNQETGEQIAIKQC... | 1.100000 |
I don't know how you sort your drugs and proteins and got Drug 1916, and Target 219, Target 217. I mean Drug 1916 corresponds to which chemical names (CID) and Target 219 corresponds to which protein.
By the way, how do you explain Y=0? I think it is not correct. I saw that people plotted figures of Y_pred vs. Y_true always starting from 8.0 or 9.0. Did they delete the data with Y<9 and then train their models or just plot from 9.0? What do you think? I think you included these smaller values and trained your models. Is it correct?
The drug 1916 is just an index, not a cid.
I think y=0 is normal. Note that the KIBA score is not a affinity score, it is a combination score. So I don’t think it is comparable with what people normally plot the binding affinity score, which also should not remove the points below 8-9. What people normally do is sometime to plot the binding affinity in log scale, which could leads to all values from 0-10. But i haven’t found any plot making plots removing below 8.
Thank you for your fast response.
Maybe you didn't fully understand my question. I want to know how you made this index. When you output some results show a pair of drug 1916 +Target 219, I want to know which chemical, which protein. I think you have a lookup table about this, but I can't find it.
on fig.4 for KIBA data.
This is resolved in the new TDC version, you can update and the ID is now Uniprot IDs
Thank you so much. However, maybe you forgot to change the KIBA dataset.
It is also updated, with the chembl ids.
I used the following lines:
from tdc.multi_pred import DTI
data = DTI(name = 'KIBA')
df = data.get_data(format='df')
df.head()
Drug_ID | Drug | Target_ID | Target | Y | |
---|---|---|---|---|---|
Drug 1916 | COC1=C(C=C2C(=C1)CCN=C2C3=CC(=C(C=C3)Cl)Cl)Cl | Target 219 | MTVKTEAAKGTLTYSRMRGMVAILIAFMKQRRMGLNDFIQKIANNS... | 11.1 | |
Drug 1916 | COC1=C(C=C2C(=C1)CCN=C2C3=CC(=C(C=C3)Cl)Cl)Cl | Target 217 | MSWSPSLTTQTCGAWEMKERLGTGGFGNVIRWHNQETGEQIAIKQC... | 11.1 | |
Drug 1916 | COC1=C(C=C2C(=C1)CCN=C2C3=CC(=C(C=C3)Cl)Cl)Cl | Target 85 | MERPPGLRPGAGGPWEMRERLGTGGFGNVCLYQHRELDLKIAIKSC... | 11.1 | |
Drug 1916 | COC1=C(C=C2C(=C1)CCN=C2C3=CC(=C(C=C3)Cl)Cl)Cl | Target 170 | MRPSGTAGAALLALLAALCPASRALEEKKVCQGTSNKLTQLGTFED... | 11.1 | |
Drug 1916 | COC1=C(C=C2C(=C1)CCN=C2C3=CC(=C(C=C3)Cl)Cl)Cl | Target 68 | MELAALCRWGLLLALLPPGAASTQVCTGTDMKLRLPASPETHLDML... | 11.1 |
however, I changed the dataset from KIBA to DAVIS
from tdc.multi_pred import DTI
data = DTI(name = 'DAVIS')
df = data.get_data(format='df')
df.head()
Drug_ID | Drug | Target_ID | Target | Y | |
---|---|---|---|---|---|
11314340 | Cc1[nH]nc2ccc(-c3cncc(OCC(N)Cc4ccccc4)c3)cc12 | AAK1 | MKKFFDSRREQGGSGLGSGSSGGGGSTSGLGSGYIGRVFGIGRQQV... | 43.0 | |
11314340 | Cc1[nH]nc2ccc(-c3cncc(OCC(N)Cc4ccccc4)c3)cc12 | ABL1p | PFWKILNPLLERGTYYYFMGQQPGKVLGDQRRPSLPALHFIKGAGK... | 10000.0 | |
11314340 | Cc1[nH]nc2ccc(-c3cncc(OCC(N)Cc4ccccc4)c3)cc12 | ABL2 | MVLGTVLLPPNSYGRDQDTSLCCLCTEASESALPDLTDHFASCVED... | 10000.0 | |
11314340 | Cc1[nH]nc2ccc(-c3cncc(OCC(N)Cc4ccccc4)c3)cc12 | ACVR1 | MVDGVMILPVLIMIALPSPSMEDEKPKVNPKLYMCVCEGLSCGNED... | 10000.0 | |
11314340 | Cc1[nH]nc2ccc(-c3cncc(OCC(N)Cc4ccccc4)c3)cc12 | ACVR1B | MAESAGASSFFPLVVLLLAGSGGSGPRGVQALLCACTSCLQANYTC... | 10000.0 |
I think it detects local KIBA file, so it didn't download the new version. Could you try remove the local file first, and do it again for KIBA?
It works now. Thanks a lot.