How can I modify Training and loss forces of my system specially Li atoms in my LGPO system?
ElhamPisheh opened this issue · 6 comments
Dear all,
I have recently Used Nequip-Allegro Framework to retrain DFT data in the LGPO systems. I have used 30000 configurations for retraining DFT data.By using the ASE calculator generated ML forces to compare the forces with DFT forces .Unfortunately the Loss of forces did not improve at all. I have changed rcutoff from 5 to 7 and 14.I have Changed Max epoch from 100 to 200. I have changed batch_size from 1 to 4 and 6 . I have changed different splits (80-20 and 70-30) for training and validations.I have checked Whether to shuffle the training data or not. I have checked Themathematical expression for the overall LOSS and changed the force loss coefficient of 1.0 to 100. . I have checked different seeds to have different training and validation sets. I have checked different lmax=1 and 2.
I tried to do anything to modify ML forces (loss_f,loss_e and loss) specially for Li atoms.
The total loss still remains near 23 and loss_f is near 0.23 with the force coefficient of 100. and the total loss can be modified to 0.23 with the force coefficient of 1. and loss_f is still unchanged and near 0.23 .However,these improvements seem not successful and still large.
Do you have any new ideas to help me improve forces and overall results?
Name Epoch wal (hours) LR loss_f loss_e loss f_mae f_rmse e_mae e/N_mae
Train 200 9.969141667 0.002 0.23 0.0368 23.0 0.213 0.485 0.867 0.0173
Validation 200 9.969141667 0.002 0.204 0.000193 20.4 0.200 0.456 0.699 0.014
No suggestion for addressing my issue???
From my personal experience, keeping batch size small like 1 or 4 is good practice in this framework and I have seen increasing batch size decreasing the performance. I would suggest keeping l_max
= 2 or 3 as more angular resolution provides better accuracy.
You may then try tuning the architecture by
increasing num_layers
: 2, 4;
increasing num_tensor_features
: 32, 64 and also adjusting two_body_latent_mlp_latent_dimensions
and latent_mlp_latent_dimensions
accordingly; This will provide more channels.
adjusting learning_rate
: 0.005, 0.001;
Hope this can help!
Dear David,
I will check all of them and let all know the outcome.
Thanks a lot,
Best Regards,
Elham
From my personal experience, keeping batch size small like 1 or 4 is good practice in this framework and I have seen increasing batch size decreasing the performance. I would suggest keeping
l_max
= 2 or 3 as more angular resolution provides better accuracy.You may then try tuning the architecture by increasing
num_layers
: 2, 4; increasingnum_tensor_features
: 32, 64 and also adjustingtwo_body_latent_mlp_latent_dimensions
andlatent_mlp_latent_dimensions
accordingly; This will provide more channels. adjustinglearning_rate
: 0.005, 0.001;Hope this can help!
Dear David,
I have a problem due to the time limitation of my system.It is 2 days and after that, the server will stop our job.
For example I could only calculate 80 epochs of 200 after 2 days. Is there any way to start the job from where it left off??? To start from 81 epochs for example?
Thanks for your time,
Elham
Hi Elham,
Thanks for your question! Allegro
will restart from the best model saved from the previous run when you keep the same run_name
in your config file. In your case, the best model should be result at the 80th epoch as I assume the loss has not plateaued yet. You can also set append: true
as in the example.yaml
so the log will be appended.
Dear David,
Thanks for your guidance and your time.