Question about the dataset split

Question

Question about the dataset split

Closed this issue 2 years ago · 3 comments

Hi, it's really a great work! But I'm a little confused about the PDBbindv2020 dataset split.
In the arxiv paper you wrote

We followed the same time split as defined in EquiBind paper ........we had 17787 structures for training, 968 for validation and 363 for testing

but the EquiBind paper wrote

From the remaining complexes that are older than 2019, we remove those with ligands contained in the test set, giving 17 347 complexes for training and validation. These are divided into 968 validation complexes, which share no ligands with the remaining 16 379 train complexes

so, compared with the EquiBind train dataset, your train dataset contains 1408 more samples due to not remove those with same ligands contained in the test set?

If that's the case, I'm concerned this may lead to data leakage...

Answer 1 · 2022-06-15T04:50:49.000Z

Hello, Thanks for the kind words.
This is a very good point, I will redo the training with those removed and get back to you. A quick check shows there are 48 test set smiles having been seen in the training set (14% of 363 total), I feel it won't affect the conclusion much.

Answer 2 · 2022-06-17T15:07:38.000Z

Hello,
Here is what I get after re-training with the exact same dataset as the equibind paper.

Ligand RMSD Percentiles 25%, 50%, 75%, mean, % below threshold 2A, 5A
2.7, 4.2, 7.6, 7.7, 18.7, 56.4
Centroid Distance Percentiles 25%, 50%, 75%, mean, % below threshold 2A, 5A
0.8, 1.8, 3.9, 5.5, 54.5, 80.1

the difference seems small. some metrics are slightly worse, some are slightly better. I think the reason for this is that only 6 exact protein-ligand pairs exist in the previous training set. For other removed training pdbs, their ligands are the same, but proteins are different. When the similarity of the protein to the corresponding test set protein is small, this extra information might not help much to the prediction of test set protein-ligand interaction, maybe even make the prediction harder.

Since changing the dataset also affect our other results like the ablation study, we will re-do all of those, and report the number in next arXiv update, probably before the paper is officially accepted.

Also, spoiler alert, we are working on a few tweaks to the model to improve the general performance. please come back later in about 1 or 2 months and check out our latest model.
The first tweak we made is adding the local frame orientation into the prediction.
The result with the new dataset (the same as equibind dataset) is here:

Ligand RMSD Percentiles 25%, 50%, 75%, mean, % below threshold 2A, 5A
2.4, 4.0, 8.0, 7.4, 18.0, 57.3
Centroid Distance Percentiles 25%, 50%, 75%, mean, % below threshold 2A, 5A
0.8, 1.7, 4.2, 5.3, 55.1, 76.9

Answer 3 · 2022-06-18T04:53:05.000Z

Thanks for the feedback! Looking forward to your next arXiv update！