Variance of the training/testing results

Question

Variance of the training/testing results

qiyan98 opened this issue 2 years ago · 4 comments

Hi there,

Thanks for sharing the code for your wonderful project. I have a question about the variance of the sampling results. I ran the training on the grid dataset using the default config file (with an arbitrary seed).

The testing-time performance metrics I got are:
MMD_full {'degree': 0.460601, 'cluster': 0.008495, 'orbit': 0.126024, 'spectral': 0.681714}

On the other hand, The MMD results claimed on the paper are: deg: 0.111, clus: 0.005, orbit: 0.070 for GDSS and deg: 0.171, clus: 0.011, orbit: 0.223 for GDSS-seq.

The MMD results of the samples generated by the provided checkpoint model are:
MMD_full {'degree': 0.093013, 'cluster': 0.00718, 'orbit': 0.101709, 'spectral': 0.793645}

I understand that the random seed could affect the sampling results. But this variance is a bit large in my perspective (especially the network I trained myself). Do you have any insights about this? The previous EDP-GNN baseline seems to have a large variance when the number of generated samples is small. Do you think it could be attributed to the intrinsics of the score-based model?

Best,
Qi

Answer 1 · 2022-08-18T08:18:06.000Z

Hi Qi,

Thanks for reaching out. Although we did not encounter such a large variance problem for our trained models, the random seeds of training and sampling could affect the results since the training/sampling of score-based models depends heavily on the sampled noises. This could be intensified for larger graphs such as grid.

Furthermore, we provide the generation performance of using 1024 generated samples in Section D.1 of our paper. We can observe that the performance is similar to that of using a smaller number of samples. Thus evaluating the performance with a small number of samples (which is actually the same number of graphs in the test set) would not attribute to the large variance.

Answer 2 · 2022-08-20T20:48:32.000Z

Hi Jaehyeong,

Thanks for your explanation on the randomness. By the way, your paper presents an interesting variant, GDSS-seq, where the interaction modeling between adjacency matrix and node features is not as good as GDSS. I wondered what would happen if there was no joint generation of node features, for example, to generate the grid dataset without using the one-hot encoded degree embedding. What's your take on this?

Many thanks,
Qi

Answer 3 · 2022-08-21T08:17:48.000Z

Hi Qi,

Thanks for your interest. You could modify grid.yaml by changing data.init from "deg" to "zeros" or "ones" to test the effect of different node features.

For the community_small dataset, using one-hot encoded degree embedding resulted in better performance compared to using ones or zeros node features, since exploiting the degree information leads to easier learning of node-edge dependency.

Answer 4 · 2022-08-23T17:08:08.000Z

Hi Jaehyeong,

Thanks for your helpful comments! Have a wonderful day.

Best,
Qi