shehzaidi/pre-training-via-denoising

Finetuning on QM9, some tasks' loss suddenly become 'nan'

Closed this issue · 4 comments

Hi, thanks for your great work, I want to reproduce the finetuning result of qm9, but i find some subtasks' loss suddenly change to 'nan' when finetuning;
such as Task dipole_moment:
image
and Task gap:
image
I use the same config (ET-QM9-FT.yaml) to all subtasks of QM9, i am wandering whether it is proper to apply same hyperameters on all subtasks.
Look forward to your favourable reply, thanks!

Hi @fengshikun, thanks for your message! We only benchmarked this model on the HOMO and LUMO tasks in QM9, where we initially also had an issue with this architecture by default having unstable training. We've only observed this issue with the TorchMD-NET architecture, so it's likely an architecture-specific issue. This is why we added the --layernorm-on-vec whitened option (more on this hopefully coming soon!) to the original model, which stabilized training for HOMO/LUMO.

We haven't benchmarked this on the other tasks, but I would recommend starting with adding --layernorm-on-vec whitened (potentially at every layer instead of just at the end as it currently is) to improve training stability. Also, note that the original model parameterization by Thölke & De Fabritiis (2022) changes for different tasks ([see here](torchmd/torchmd-net#64); this includes dipole_moment which uses a special output module), so you might have to do some minor parameter surgery to use the pre-trained weights.

Thanks for your reply, i also find the 'nan' comes with the infinite intermidiate result of network, and add the normalization layer helps! And I just notice that the QM9 results with denoising pretraining from paper is based on the GNS-TAT architecture, is that part of code avalible in this repo?

Great -- feel free to send PRs if you have any suggestions! The code for GNS-TAT is not yet open source. Unfortunately, I'm not sure if/when that will be made open source, but I'll let you know if there are any updates on this!

Thanks for your help again, i am clear now and close this issue