AlexanderKroll/ESP

Dataset for result of Table.1

Closed this issue · 4 comments

Dear AlexanderKroll:
Thanks for the solid work and dataset construction. I'm doing the related work in the area of enzyme-drug interaction. I wonder if the two csv files :df_UID_MID_train_exp_1_1.csv /df_UID_MID_test_exp_phylo_1_1.csv are the original data for the result of Table 1? And is it ok to directly use these files as the benchmark for comparison experiments after preprocessing? Looking forward to your reply!

Dear Hong-yu-Zhang,

That sounds interesting! Yes, that is correct; we trained the model on this train file (and also performed hyperparameter optimization using this data). The test set was used at the end to validate final model performance and was not previously used for any design/hyperparamter choices of the model.

Best,
Alex

@AlexanderKroll
Dear Alex:
Thank you for clarifying! We will follow your settings, partitioning a portion of the training set to use as a validation set, and employing the test set for the final evaluation. Thank you once again for your contribution to the dataset.
Best,
Hongyu

Dear Alex:
I found that the two aforementioned CSV files contain approximately 27,000 entries for training and 7,000 entries for testing, respectively, which is different from the number reported in the paper, i.e., "69,365 entries". Is this because the data was pruned to ensure that the sequence similarity between the test set and the training set is less than 80%? Or did I miss some important files?

By the way, after using the script to filter out invalid molecules, the final training set contains 23402 entries, and the test set contains 5685 entries.

Looking forward to your reply!

Hongyu

Dear Hongyu,

you are right and I am very sorry for pointing you to the wrong file! Sorry for this confusion! The correct files are ""df_train_with_ESM1b_ts_GNN.pkl" / "df_train_with_ESM1b_ts_GNN.pkl" in the data/splits folder.

Best,
Alex