About preparing DAVIS and KIBA data folds.

Question

About preparing DAVIS and KIBA data folds.

Closed this issue 9 months ago · 3 comments

Hello,

Can you share more details about how you prepared DAVIS and KIBA datasets?

I downloaded these datasets from here https://github.com/chao1224/GraphMVP/tree/main/datasets. Then preprocessed as instructed there. I then combined the resulting train.csv and test.csv files to create the full dataset. Then I used scaffold splitting to split this full dataset to train, valid and test. For DAVIS I used the transformed affinities (-np.log10(y / 1e9) to train the model. Is this the approach you used as well?

It would be great if you could add your preprocessed train, valid and test folds to the repository.

Answer 1 · 2024-03-23T01:59:30.000Z

Thanks for you interest.

Unfortunately I can't find the preprocessed train/valid/test files.

For the two dta datasets, we randomly split them into trian/valid/test sets, following the setting of GraphMVP. Below is from the GraphMVP's paper:

Table 5: Results for four molecular property prediction tasks (regression) and two DTA tasks
(regression). We report the mean RMSE of 3 seeds with scaffold splitting for molecular property
downstream tasks, and mean MSE for 3 seeds with random splitting on DTA tasks. For GraphMVP ,
we set M = 0.15 and C = 5. The best performance for each task is marked in bold. We omit the std
here since they are very small and indistinguishable. For complete results, please check Appendix G.4.

We did not perform any preprocessing except the preprocessing.py in GraphMVP. But we applied normalization to the labels in the tuning stage. (SimSGT/regression/tuning_dta.py/train_dta, line 246)

Answer 2 · 2024-03-23T15:22:42.000Z

Thank you very much for the details! Just for clarification, did you use the same test.csv as prepared in preprocess.py in the GraphMVP repository (https://github.com/chao1224/GraphMVP/blob/main/datasets/dti_datasets/davis/preprocess.py) ?

Answer 3 · 2024-03-24T03:01:48.000Z

Thank you very much for the details! Just for clarification, did you use the same test.csv as prepared in preprocess.py in the GraphMVP repository (https://github.com/chao1224/GraphMVP/blob/main/datasets/dti_datasets/davis/preprocess.py) ?

Yes. As shown in line 177~187 of regression/tuning_dta.py, we use the original train.csv and test.csv files processed by GraphMVP's preprocess.py.