Where is the training data in example.ipynb

Question

Where is the training data in example.ipynb

Closed this issue 4 years ago · 18 comments

I tried to run example.ipynb, but I got

FileNotFoundError: [Errno 2] File /n/scratch3/users/k/kh278/bindingdb/fold1/train.csv does not exist: '/n/scratch3/users/k/kh278/bindingdb/fold1/train.csv'

In addition, when I run python train.py --task ${task_name} to run the experiments, I only can choose the task as 'biosnap'. However, I got

Traceback (most recent call last):
File "train.py", line 206, in
model_max, loss_history = main()
File "train.py", line 156, in main
auc, auprc, f1, logits, loss = test(testing_generator, model_max)
File "train.py", line 64, in test
loss = loss_fct(logits, label)
File "/home/xzhang/miniconda3/envs/DeepPurpose/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/xzhang/miniconda3/envs/DeepPurpose/lib/python3.7/site-packages/torch/nn/modules/loss.py", line 530, in forward
return F.binary_cross_entropy(input, target, weight=self.weight, reduction=self.reduction)
File "/home/xzhang/miniconda3/envs/DeepPurpose/lib/python3.7/site-packages/torch/nn/functional.py", line 2519, in binary_cross_entropy
"Please ensure they have the same size.".format(target.size(), input.size()))
ValueError: Using a target size (torch.Size([32])) that is different to the input size (torch.Size([16])) is deprecated. Please ensure they have the same size.

Thanks

Answer 1 · 2020-12-30T23:59:45.000Z

Hi, can you try now? I fixed the bugs.

Answer 2 · 2020-12-31T00:58:27.000Z

Thank you! Now it works.
By the way, I want to modify DAVIS and KIBA datasets. Could you please tell me how to do it? Where is the original dataset? I want to have the drugs' name and proteins' name, not only have SMILES and sequence of the proteins.
In addition, what is ESPF?

Thank you for your hard work and happy holidays!

Answer 3 · 2020-12-31T01:36:01.000Z

Hi there,

Actually, you can modify the dataset using Pandas directly. This means to manipulate the csv file directly.

And the original dataset is provided in https://zitniklab.hms.harvard.edu/TDC/multi_pred_tasks/dti/

Answer 4 · 2020-12-31T02:05:37.000Z

Thank you very much. I will try it later.

Answer 5 · 2021-01-04T01:18:10.000Z

@limberc
I am sorry to ask you again.
Where is the .csv files. I check the link that you gave to me, but I used
from tdc.multi_pred import DTI
data = DTI(name = 'DAVIS')
split = data.get_split()
I got data, but data is a DTI object, I can't apply Pandas to it.

Thanks

Answer 6 · 2021-01-04T04:35:59.000Z

Hi, you can use data.get_data() to get the pandas dataframe. Also, it is downloaded in the local folder with name data

Answer 7 · 2021-01-04T07:31:24.000Z

Thank you so much. I got the data using data.get_data(), but I could not find the files that show Drug xxx's real drug's name (ex. cid) and Taget xxx's real protein name (pid or uniprot)

Thank you very much

Answer 8 · 2021-01-04T07:57:06.000Z

The Davis Dataset can be found in http://staff.cs.utu.fi/~aatapa/data/DrugTarget/. There are CID files

Answer 9 · 2021-01-04T08:03:30.000Z

Thank you!

Answer 10 · 2021-01-09T07:32:25.000Z

I used data.get_data for KIBA dataset, I got the followings:

	Drug_ID	Drug	Target_ID	Target	Y
Drug 1916	COC1=C(C=C2C(=C1)CCN=C2C3=CC(=C(C=C3)Cl)Cl)Cl	Target 219	MTVKTEAAKGTLTYSRMRGMVAILIAFMKQRRMGLNDFIQKIANNS...	0.000000
Drug 1916	COC1=C(C=C2C(=C1)CCN=C2C3=CC(=C(C=C3)Cl)Cl)Cl	Target 217	MSWSPSLTTQTCGAWEMKERLGTGGFGNVIRWHNQETGEQIAIKQC...	1.100000

I don't know how you sort your drugs and proteins and got Drug 1916, and Target 219, Target 217. I mean Drug 1916 corresponds to which chemical names (CID) and Target 219 corresponds to which protein.

By the way, how do you explain Y=0? I think it is not correct. I saw that people plotted figures of Y_pred vs. Y_true always starting from 8.0 or 9.0. Did they delete the data with Y<9 and then train their models or just plot from 9.0? What do you think? I think you included these smaller values and trained your models. Is it correct?

Answer 11 · 2021-01-09T07:43:03.000Z

The drug 1916 is just an index, not a cid.

I think y=0 is normal. Note that the KIBA score is not a affinity score, it is a combination score. So I don’t think it is comparable with what people normally plot the binding affinity score, which also should not remove the points below 8-9. What people normally do is sometime to plot the binding affinity in log scale, which could leads to all values from 0-10. But i haven’t found any plot making plots removing below 8.

Answer 12 · 2021-01-09T07:59:50.000Z

Thank you for your fast response.
Maybe you didn't fully understand my question. I want to know how you made this index. When you output some results show a pair of drug 1916 +Target 219, I want to know which chemical, which protein. I think you have a lookup table about this, but I can't find it.

Please see DeepDTA paper
https://watermark.silverchair.com/bty593.pdf?token=AQECAHi208BE49Ooan9kkhW_Ercy7Dm3ZL_9Cf3qfKAc485ysgAAAr8wggK7BgkqhkiG9w0BBwagggKsMIICqAIBADCCAqEGCSqGSIb3DQEHATAeBglghkgBZQMEAS4wEQQMwdnRROlf5YK6C1ivAgEQgIICco2YpNaMhjJh0WiWUXhvqff0iy-Psz5iUcpxFobOQ09hl0JweRmi0BmbZyrtDGiYzwQehQ286rcu9ZTn-ReftYGYQQC7oCpcdCfaZrrtVrhG06K_8IKwrhMVno4lY5raQR1jU94CBx_ES85oskWB-oOmX0zIJf5xR1FKwWq5RnvVdL1sBTypi01dJlcy2HzCBxoxbziqR69TZniy-IA6BQLiLDx8Rgw3bYW6qZ54bTJRnJUvlMcFHGGNsYIOKmcp3CSxxKuJLX5X4TC_QFg0sIaIe5LVUUA_wcvNhPDYhBEQcaJP0AVYM5rdNR9pXZtwttcv77V4xF1jOER0XafZX-ocEWvMFz-UM-Mt3SCW4ZKaGUHQh3iT2LiawH6iKSiVJF8MVw6IB8e94n6mIbbwUuWXPGG18S2b5tQUTOfT_wRT1QBayg8TV0zyIaOcyIf-zuZUfzHqOikbS2ntDl9zyohMdXfMgvyARx_i1j135g2Db3w4QdrasDm-HbzUV92eytFvI2CrTFr62Yg4LlTLBx9GbfAwj29nFtr3iINDHhtDy-ytwI9m5NxpTY7todeyH4aORz8B6gHLjAIoFi4X9-kiRM2sNq3aGz2PFHWhW5SYMF68sFv90RVPJO9qWegVIVPzfD9lqMD1RPGXPJZFOPVlVimKSxGMbnUgPgGUqp0z6T7Uc0QhMLjAXOHlQLIu9r1G12miIe8Kaz7mGP3Ymn7pxygOPWbei-gm7jLfrLiw5bhRecS4sX1r3Hq2Yd2eI4xXiNAYOBk7I4VhRQcf9aqdNVMWF3WbISHvNdfjQlRORwy9H3uZcGFEIXQYGMpqYuqi

on fig.4 for KIBA data.

Answer 13 · 2021-01-14T18:14:06.000Z

This is resolved in the new TDC version, you can update and the ID is now Uniprot IDs

Answer 14 · 2021-01-14T22:53:48.000Z

Thank you so much. However, maybe you forgot to change the KIBA dataset.

Answer 15 · 2021-01-14T22:56:09.000Z

It is also updated, with the chembl ids.

Answer 16 · 2021-01-15T03:48:31.000Z

I used the following lines:
from tdc.multi_pred import DTI
data = DTI(name = 'KIBA')
df = data.get_data(format='df')
df.head()

	Drug_ID	Drug	Target_ID	Target
Drug 1916	COC1=C(C=C2C(=C1)CCN=C2C3=CC(=C(C=C3)Cl)Cl)Cl	Target 219	MTVKTEAAKGTLTYSRMRGMVAILIAFMKQRRMGLNDFIQKIANNS...	11.1
Drug 1916	COC1=C(C=C2C(=C1)CCN=C2C3=CC(=C(C=C3)Cl)Cl)Cl	Target 217	MSWSPSLTTQTCGAWEMKERLGTGGFGNVIRWHNQETGEQIAIKQC...	11.1
Drug 1916	COC1=C(C=C2C(=C1)CCN=C2C3=CC(=C(C=C3)Cl)Cl)Cl	Target 85	MERPPGLRPGAGGPWEMRERLGTGGFGNVCLYQHRELDLKIAIKSC...	11.1
Drug 1916	COC1=C(C=C2C(=C1)CCN=C2C3=CC(=C(C=C3)Cl)Cl)Cl	Target 170	MRPSGTAGAALLALLAALCPASRALEEKKVCQGTSNKLTQLGTFED...	11.1
Drug 1916	COC1=C(C=C2C(=C1)CCN=C2C3=CC(=C(C=C3)Cl)Cl)Cl	Target 68	MELAALCRWGLLLALLPPGAASTQVCTGTDMKLRLPASPETHLDML...	11.1

however, I changed the dataset from KIBA to DAVIS
from tdc.multi_pred import DTI
data = DTI(name = 'DAVIS')
df = data.get_data(format='df')
df.head()

	Drug_ID	Drug	Target_ID	Target
11314340	Cc1[nH]nc2ccc(-c3cncc(OCC(N)Cc4ccccc4)c3)cc12	AAK1	MKKFFDSRREQGGSGLGSGSSGGGGSTSGLGSGYIGRVFGIGRQQV...	43.0
11314340	Cc1[nH]nc2ccc(-c3cncc(OCC(N)Cc4ccccc4)c3)cc12	ABL1p	PFWKILNPLLERGTYYYFMGQQPGKVLGDQRRPSLPALHFIKGAGK...	10000.0
11314340	Cc1[nH]nc2ccc(-c3cncc(OCC(N)Cc4ccccc4)c3)cc12	ABL2	MVLGTVLLPPNSYGRDQDTSLCCLCTEASESALPDLTDHFASCVED...	10000.0
11314340	Cc1[nH]nc2ccc(-c3cncc(OCC(N)Cc4ccccc4)c3)cc12	ACVR1	MVDGVMILPVLIMIALPSPSMEDEKPKVNPKLYMCVCEGLSCGNED...	10000.0
11314340	Cc1[nH]nc2ccc(-c3cncc(OCC(N)Cc4ccccc4)c3)cc12	ACVR1B	MAESAGASSFFPLVVLLLAGSGGSGPRGVQALLCACTSCLQANYTC...	10000.0

Answer 17 · 2021-01-15T03:50:16.000Z

I think it detects local KIBA file, so it didn't download the new version. Could you try remove the local file first, and do it again for KIBA?

Answer 18 · 2021-01-15T04:03:06.000Z

It works now. Thanks a lot.