Questions about pretraining datasets

Question

Questions about pretraining datasets

koalaaaaaaaaa opened this issue a year ago · 2 comments

Hi @chao1224 , thanks for your code very much, I have some questions on pretraining datasets, hope that you can help me with it when it's convenient~
After reading your code, I'm wondering Why GEOM_drugs is used for pretraining while GEOM_qm9 is not used?, and as far as I know, GEOM datasets consists of GEOM_drugs and GEOM_qm9.
Thank you for your kindest reply!

Answer 1 · 2023-05-22T12:49:11.000Z

Hi @koalaaaaaaaaa,

Thank you for the question. There are two reasons.

The main reason is that GEOM_Drugs contains more diverse drugs (e.g., in terms of the atom types and atom numbers), while QM9 is limited to 4 atom types and at most 9 atoms for each conformation. At pretraining, we expect the data distribution to be as diverse as possible, so we prefer GEOM_Drugs.
Another reason is that initially, we were thinking about if we should do downstream on QM9 tasks, so we didn't consider QM9 when preparing the dataset.
- Later, we didn't do that in the GraphMVP paper due to the datasize. In the initial version of GraphMVP, we only consider 250K conformations, and QM9 has 130K conformations, and it is hard to claim this is a reasonable pretraining-finetuning setting, so we didn't include that.

However, I also want to highlight that there are larger conformation datasets, like Molecule3D and PCQM4Mv2. GraphMVP was done in July 2021, and these two datasets were released after that. Both datasets are extracted from PubChemQC project with around 4M conformations, and Molecule3D has slightly more conformations. In our recent work, MoleculeSDE / GraphMVPv2 (codes will be released in one week), we redo the GraphMVP pretraining on PCQM4Mv2, where the QM9 can be taken as the downstream task this time.

Hope this answers your question.

Answer 2 · 2023-05-22T13:22:58.000Z

Thank you for your detailed and kindest reply!! It helped me a lot.