kexinhuang12345/MolTrans

Where is the training data in example.ipynb

Closed this issue · 18 comments

I tried to run example.ipynb, but I got

FileNotFoundError: [Errno 2] File /n/scratch3/users/k/kh278/bindingdb/fold1/train.csv does not exist: '/n/scratch3/users/k/kh278/bindingdb/fold1/train.csv'

In addition, when I run python train.py --task ${task_name} to run the experiments, I only can choose the task as 'biosnap'. However, I got

Traceback (most recent call last):
File "train.py", line 206, in
model_max, loss_history = main()
File "train.py", line 156, in main
auc, auprc, f1, logits, loss = test(testing_generator, model_max)
File "train.py", line 64, in test
loss = loss_fct(logits, label)
File "/home/xzhang/miniconda3/envs/DeepPurpose/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/xzhang/miniconda3/envs/DeepPurpose/lib/python3.7/site-packages/torch/nn/modules/loss.py", line 530, in forward
return F.binary_cross_entropy(input, target, weight=self.weight, reduction=self.reduction)
File "/home/xzhang/miniconda3/envs/DeepPurpose/lib/python3.7/site-packages/torch/nn/functional.py", line 2519, in binary_cross_entropy
"Please ensure they have the same size.".format(target.size(), input.size()))
ValueError: Using a target size (torch.Size([32])) that is different to the input size (torch.Size([16])) is deprecated. Please ensure they have the same size.

Thanks

Hi, can you try now? I fixed the bugs.

Thank you! Now it works.
By the way, I want to modify DAVIS and KIBA datasets. Could you please tell me how to do it? Where is the original dataset? I want to have the drugs' name and proteins' name, not only have SMILES and sequence of the proteins.
In addition, what is ESPF?

Thank you for your hard work and happy holidays!

Hi there,

Actually, you can modify the dataset using Pandas directly. This means to manipulate the csv file directly.

And the original dataset is provided in https://zitniklab.hms.harvard.edu/TDC/multi_pred_tasks/dti/

Thank you very much. I will try it later.

@limberc
I am sorry to ask you again.
Where is the .csv files. I check the link that you gave to me, but I used
from tdc.multi_pred import DTI
data = DTI(name = 'DAVIS')
split = data.get_split()
I got data, but data is a DTI object, I can't apply Pandas to it.

Thanks

Hi, you can use data.get_data() to get the pandas dataframe. Also, it is downloaded in the local folder with name data

Thank you so much. I got the data using data.get_data(), but I could not find the files that show Drug xxx's real drug's name (ex. cid) and Taget xxx's real protein name (pid or uniprot)

Thank you very much

The Davis Dataset can be found in http://staff.cs.utu.fi/~aatapa/data/DrugTarget/. There are CID files

Thank you!

I used data.get_data for KIBA dataset, I got the followings:

  Drug_ID Drug Target_ID Target Y
Drug 1916 COC1=C(C=C2C(=C1)CCN=C2C3=CC(=C(C=C3)Cl)Cl)Cl Target 219 MTVKTEAAKGTLTYSRMRGMVAILIAFMKQRRMGLNDFIQKIANNS... 0.000000
Drug 1916 COC1=C(C=C2C(=C1)CCN=C2C3=CC(=C(C=C3)Cl)Cl)Cl Target 217 MSWSPSLTTQTCGAWEMKERLGTGGFGNVIRWHNQETGEQIAIKQC... 1.100000

I don't know how you sort your drugs and proteins and got Drug 1916, and Target 219, Target 217. I mean Drug 1916 corresponds to which chemical names (CID) and Target 219 corresponds to which protein.

By the way, how do you explain Y=0? I think it is not correct. I saw that people plotted figures of Y_pred vs. Y_true always starting from 8.0 or 9.0. Did they delete the data with Y<9 and then train their models or just plot from 9.0? What do you think? I think you included these smaller values and trained your models. Is it correct?

The drug 1916 is just an index, not a cid.

I think y=0 is normal. Note that the KIBA score is not a affinity score, it is a combination score. So I don’t think it is comparable with what people normally plot the binding affinity score, which also should not remove the points below 8-9. What people normally do is sometime to plot the binding affinity in log scale, which could leads to all values from 0-10. But i haven’t found any plot making plots removing below 8.

This is resolved in the new TDC version, you can update and the ID is now Uniprot IDs

Thank you so much. However, maybe you forgot to change the KIBA dataset.

It is also updated, with the chembl ids.

I used the following lines:
from tdc.multi_pred import DTI
data = DTI(name = 'KIBA')
df = data.get_data(format='df')
df.head()

  Drug_ID Drug Target_ID Target Y
Drug 1916 COC1=C(C=C2C(=C1)CCN=C2C3=CC(=C(C=C3)Cl)Cl)Cl Target 219 MTVKTEAAKGTLTYSRMRGMVAILIAFMKQRRMGLNDFIQKIANNS... 11.1
Drug 1916 COC1=C(C=C2C(=C1)CCN=C2C3=CC(=C(C=C3)Cl)Cl)Cl Target 217 MSWSPSLTTQTCGAWEMKERLGTGGFGNVIRWHNQETGEQIAIKQC... 11.1
Drug 1916 COC1=C(C=C2C(=C1)CCN=C2C3=CC(=C(C=C3)Cl)Cl)Cl Target 85 MERPPGLRPGAGGPWEMRERLGTGGFGNVCLYQHRELDLKIAIKSC... 11.1
Drug 1916 COC1=C(C=C2C(=C1)CCN=C2C3=CC(=C(C=C3)Cl)Cl)Cl Target 170 MRPSGTAGAALLALLAALCPASRALEEKKVCQGTSNKLTQLGTFED... 11.1
Drug 1916 COC1=C(C=C2C(=C1)CCN=C2C3=CC(=C(C=C3)Cl)Cl)Cl Target 68 MELAALCRWGLLLALLPPGAASTQVCTGTDMKLRLPASPETHLDML... 11.1

however, I changed the dataset from KIBA to DAVIS
from tdc.multi_pred import DTI
data = DTI(name = 'DAVIS')
df = data.get_data(format='df')
df.head()

  Drug_ID Drug Target_ID Target Y
11314340 Cc1[nH]nc2ccc(-c3cncc(OCC(N)Cc4ccccc4)c3)cc12 AAK1 MKKFFDSRREQGGSGLGSGSSGGGGSTSGLGSGYIGRVFGIGRQQV... 43.0
11314340 Cc1[nH]nc2ccc(-c3cncc(OCC(N)Cc4ccccc4)c3)cc12 ABL1p PFWKILNPLLERGTYYYFMGQQPGKVLGDQRRPSLPALHFIKGAGK... 10000.0
11314340 Cc1[nH]nc2ccc(-c3cncc(OCC(N)Cc4ccccc4)c3)cc12 ABL2 MVLGTVLLPPNSYGRDQDTSLCCLCTEASESALPDLTDHFASCVED... 10000.0
11314340 Cc1[nH]nc2ccc(-c3cncc(OCC(N)Cc4ccccc4)c3)cc12 ACVR1 MVDGVMILPVLIMIALPSPSMEDEKPKVNPKLYMCVCEGLSCGNED... 10000.0
11314340 Cc1[nH]nc2ccc(-c3cncc(OCC(N)Cc4ccccc4)c3)cc12 ACVR1B MAESAGASSFFPLVVLLLAGSGGSGPRGVQALLCACTSCLQANYTC... 10000.0

I think it detects local KIBA file, so it didn't download the new version. Could you try remove the local file first, and do it again for KIBA?

It works now. Thanks a lot.