About dataset_info in datasets_config.py

Hii! Thanks for your impressive work on the molecular diffusion model! But I'm wondering how to leverage EDM on a custom dataset. The keys n_nodes and distances in datasets_config.py confuse me. How can I obtain these items from my custom dataset? I would really appreciate it if you could help. Thanks!

Hi,
n_nodes is simply how much atoms a molecule (or points a point cloud) contains. It is used to build a histogram (which is a categorical distributions) so we can sample the number of atoms. So {1: 10000, 2: 13000} means that there are 10000 molecules with 1 atom and 13000 with 2.

About the "distances" I am not a 100% sure but I think it is something for analysis of samples after training. I don't think it is necessary to train / sample from a model. @vgsatorras might know this better

distances has been calculted in this function:

e3_diffusion_for_molecules/qm9/analyze.py

Line 173 in fce07d7

    
           hist_dist = Histogram_cont(name='Histogram relative distances', ignore_zeros=True)

It is just the histogram of relative distances between atoms. But this is not really necessary to train the model. It is just for some analysis when comparing the distribution of relative distances between generated and sampled molecules.

Then n_nodes is the histogram on the number of nodes.

Best,
Victor