[BUG] Checkpointing Causes Memory Crashes On Large Systems
Closed this issue · 1 comments
Describe the bug
Currently streaming to serialized checkpoint (.s3
) files causes crashes when doing FEP on large systems due to the machine running out of memory. For example, this particular system of around 770 residues causes memory crashes on a 32GB RAM machine after around 1ns of dynamics when the frame saving frequency is 100ps
. The memory usage of the SOMD2 process can also be seen to steadily climb as the simulation progresses. While it is possible to circumvent this issue by either severely decreasing the frame saving frequency or disabling checkpointing entirely, neither serve as a good long term solution to the problem.
To reproduce
Extract the provided tar.gz
file and run the SOMD2 input file with:
somd2 perturbablemin_system.bss --timestep 1fs --cutoff-type rf --equilibration-timestep 1fs --equilibration-time 500ps --checkpoint-frequency 500ps --frame-frequency 100ps
Given the timestep used and the size of the system, it might take awhile to run the FEP simulation to the point where it crashes. Also I believe running this on multiple GPUs at once is more problematic than just a single one, since with multiple GPU runs there will be multiple trajectories being stored in the memory at a given time.
Input files
Environment information:
SOMD2 version: 0.1.dev266+ge2e7a97
Sire version: 2024.1.0.dev+d971cfd
Thanks @akalpokas. I've tried writing the trajectory at each checkpoint, then using system.delete_all_frames()
to clear the buffer. However, this simply overwrites the trajectory each time with the frames between checkpoints, rather than appending new frames to the existing trajectory file. Perhaps there is another way to flush the memory. I've also had a look at the DCD
and DCDFile
classes in sire, but don't see an obvious way to append frames to an existing file.