[BUG] Trajectories are not properly stored after `mols.delete_all_frames()` is called.

Question

[BUG] Trajectories are not properly stored after `mols.delete_all_frames()` is called.

mb2055 opened this issue 4 months ago · 16 comments

mb2055 commented 4 months ago

Describe the bug
When periodically deleting the frames contained in a system (as is done in SOMD2 when checkpointing), any trajectories contained in a system after calling mols.delete_all_frames() will only contain a single frame. This issue also seems to cause a segmentation fault, but only around 50% of the time.

To Reproduce
Run the following script:

import sire as sr

mols = sr.load_test_files("ala.top", "ala.crd")
mols = mols.minimisation().run().commit()

d = mols.dynamics(timestep="4fs", temperature="25oC")

for i in range(3):
    d.run("1ps", frame_frequency="0.2ps")
    mols = d.commit()
    print(f"mols currently has {mols.num_frames()} frames")
    mols.delete_all_frames()

the output will be the following:

mols currently has 5 frames
mols currently has 1 frames
mols currently has 1 frames
FATAL: exception not rethrown
Aborted (core dumped)

Expected behavior
The output should be

mols currently has 5 frames
mols currently has 5 frames
mols currently has 5 frames

(please complete the following information):

OS: Ubuntu 22.04.4 LTS
Version of Python: 3.12.4
Version of sire: 2024.3.0.dev
I confirm that I have checked this bug still exists in the latest released version of sire: yes

Additional context
Note that everything works as expected if mols.delete_all_frames() is not called:

mols currently has 5 frames
mols currently has 10 frames
mols currently has 15 frames

Answer 1 · 2024-07-25T14:29:37.000Z

Just to add that I have also been seeing...

FATAL: exception not rethrown
Aborted (core dumped)

about 50% of the time on interpreter shutdown on my feature_emle branch having merged in devel following the 2024.2.0 release. In this case I am not calling .delete_all_frames() anywhere. I just run a short block of dynamics and get the energy trajectory.

Answer 2 · 2024-07-25T16:33:00.000Z

That is interesting - the bug is likely happening because the cache is being improperly shared between the copy of mols in d, and the copy of mols that was output by d.commit(). In theory, it should not be shared, and so running mols.delete_all_frames() should not be deleting all frames for the mols in d, and so the right output from your script should be

mols currently has 5 frames
mols currently has 10 frames
mols currently has 15 frames

Let me think about this and investigate further.

Answer 3 · 2024-10-16T13:41:13.000Z

If you use the return_as_system=True kwarg to d.commit() then you can see that the underlying system (for the dynamics object) does still contain all of the frames, i.e.:

import sire as sr

mols = sr.load_test_files("ala.top", "ala.crd")
mols = mols.minimisation().run().commit()

d = mols.dynamics(timestep="4fs", temperature="25oC")

for i in range(3):
    d.run("1ps", frame_frequency="0.2ps")
    mols = d.commit(return_as_system=True)
    print(f"mols currently has {mols.num_frames()} frames")
    mols.delete_all_frames()

Gives:

mols currently has 5 frames
mols currently has 10 frames
mols currently has 15 frames

This still has issues with random crashes on interpreter exit, though.

Looking at the logic in the d.commit() it seems that the issue must be during the update of the original system:

    def commit(self, return_as_system: bool = False):
        if self.is_null():
            return

        self._update_from(
            state=self._get_current_state(include_coords=True, include_velocities=True),
            state_has_cv=(True, True),
            nsteps_completed=self._current_step,
        )

        self._sire_mols.set_energy_trajectory(self._energy_trajectory, map=self._map)

        self._sire_mols.set_ensemble(self.ensemble())

        if return_as_system:
            return self._sire_mols.clone()

        elif self._orig_mols is not None:
            from ..system import System

            if System.is_system(self._orig_mols):
                return self._sire_mols
            else:
                r = self._orig_mols.clone()
                r.update(self._sire_mols.molecules())
                return r
        else:
            return self._sire_mols.clone()

I'll take a look to see what's going wrong.

Answer 4 · 2024-10-16T13:46:20.000Z

Okay, this seems to fix it:

diff --git a/src/sire/mol/_dynamics.py b/src/sire/mol/_dynamics.py
index cdb8a56b..965a67f9 100644
--- a/src/sire/mol/_dynamics.py
+++ b/src/sire/mol/_dynamics.py
@@ -1144,7 +1144,7 @@ class DynamicsData:
             from ..system import System

             if System.is_system(self._orig_mols):
-                return self._sire_mols
+                return self._sire_mols.clone()
             else:
                 r = self._orig_mols.clone()
                 r.update(self._sire_mols.molecules())

@chryswoods: Let me know if this is the correct solution.

I'll look at how to actually delete the frames for the purposes of chunking trajectories in somd2.

Answer 5 · 2024-10-16T14:45:55.000Z

Hmm, but if you delete the frames from the underlying system object, then it doesn't work as expected again:

import sire as sr

mols = sr.load_test_files("ala.top", "ala.crd")
mols = mols.minimisation().run().commit()

d = mols.dynamics(timestep="4fs", temperature="25oC")

for i in range(3):
    d.run("1ps", frame_frequency="0.2ps")
    mols = d.commit()
    print(f"mols currently has {mols.num_frames()} frames")
    d._d._sire_mols.delete_all_frames()

Gives:

mols currently has 5 frames
mols currently has 0 frames
mols currently has 0 frames

Answer 6 · 2024-10-16T14:59:20.000Z

Yeah, there's weirdness going on. For example, if you delete a single frame (the first frame in this case, but it doesn't matter):

import sire as sr

mols = sr.load_test_files("ala.top", "ala.crd")
mols = mols.minimisation().run().commit()

d = mols.dynamics(timestep="4fs", temperature="25oC")

for i in range(3):
    d.run("1ps", frame_frequency="0.2ps")
    mols = d.commit()
    print(f"mols currently has {mols.num_frames()} frames")
    d._d._sire_mols.delete_frame(0)

Gives:

mols currently has 5 frames
mols currently has 4 frames
mols currently has 3 frames

Deleting two frames:

import sire as sr

mols = sr.load_test_files("ala.top", "ala.crd")
mols = mols.minimisation().run().commit()

d = mols.dynamics(timestep="4fs", temperature="25oC")

for i in range(3):
    d.run("1ps", frame_frequency="0.2ps")
    mols = d.commit()
    print(f"mols currently has {mols.num_frames()} frames")
    d._d._sire_mols.delete_frame(0)
    d._d._sire_mols.delete_frame(0)

Gives:

mols currently has 5 frames
mols currently has 3 frames
mols currently has 1 frames

I think deleting all frames must close the trajectory file handle somehow, so you are only ever left with one frame. That said, the coordinates of the molecule do appear to be updated, e.g.:

import sire as sr

mols = sr.load_test_files("ala.top", "ala.crd")
mols = mols.minimisation().run().commit()

d = mols.dynamics(timestep="4fs", temperature="25oC")

for i in range(3):
    d.run("1ps", frame_frequency="0.2ps")
    mols = d.commit()
    print(f"mols currently has {mols.num_frames()} frames")
    d._d._sire_mols.delete_all_frames()
    print(mols.coordinates()

Gives:

mols currently has 5 frames
( 13.3499 Å, 14.5876 Å, 12.5228 Å )
mols currently has 1 frames
( 13.4061 Å, 14.5797 Å, 12.6688 Å )
mols currently has 1 frames
( 13.2908 Å, 14.5102 Å, 12.5739 Å )

Answer 7 · 2024-10-16T15:33:36.000Z

This is what happens when I print out the call to System::saveFrame:

System::saveFrame
System::saveFrame
System::saveFrame
System::saveFrame
System::saveFrame
mols currently has 5 frames
System::saveFrame
System::saveFrame
System::saveFrame
System::saveFrame
System::saveFrame
mols currently has 0 frames
System::saveFrame
System::saveFrame
System::saveFrame
System::saveFrame
System::saveFrame
mols currently has 0 frames

So the function to save the frames is being called the correct number of times. I think the number of frames must be inconsistent somewhere.

Answer 8 · 2024-10-16T15:56:41.000Z

Ah, I see, there's a SystemFrames object that stores snapshots from the live trajectory. I think this isn't in sync with the trajectory property in the system somehow. In any case, it's easy to adapt the chunking since, I believe, the trajectory is no longer streamed to file. I can just track the current number of frames in the system and only write the new frames to the chunk. I'll just do this and we can figure out the underlying issue here at a later point if it's too fiddly.

Answer 9 · 2024-10-16T16:37:28.000Z

Yes, I've updated somd2 and it works perfectly. Can revisit this at a later date if needed.

Answer 10 · 2024-10-17T01:16:09.000Z

Yes, this looks like the issue. There's a lot of challenge and complexity in this code because it mixes objects which are copy on write (System, Molecule) with objects that appear to be copy on write, but are actually explicitly shared (Trajectory contents) and hidden objects that are explicitly shared and (almost) globally cached (the SystemFrames that are stored in the cache. To make it more complex, the cache auto-offloads to disk as each cache frame fills up.

The code was tested well for the normal case (generating a trajectory and saving it) but there's a lot of edge case bugs I guess when doing things like deleting individual trajectory frames from on implicitly shared copy, while the other copy needs a trajectory with all the frames in their original order. In these cases the code has to do a lot of juggling and memory movement.

It would be worth at a future point looking at what somd2 really needs in terms of trajectory management and adding some custom code that implements that scheme (e.g. in C++, so that the trajectory doesn't need to be copied or excessively moved around).

Answer 11 · 2024-10-17T01:20:17.000Z

Also, correct, the trajectory isn't now streamed to a file. The frames are held in a cache that auto-offloads to disk as needed. You only write to a file when you explicitly save the trajectory. The cache keeps about 512MB of trajectory in memory, if I remember correctly, with the rest going to a page cache on disk. The cache is very efficient, memory mapping pages and giving you good random access performance to individual frames (each frame is paged individually). There's also an intermediary cache and a little look ahead so that frames are pulled from disk to memory efficiently, and stay there as long as needed. In general, it should mean that you don't need to do manual trajectory management now during a simulation or have to worry about filling memory. Instead, you can choose what frames to save after the simulation is complete.

Answer 12 · 2024-10-17T05:24:11.000Z

Great. This is what I'm now doing, i.e. writing the last N frames as a chunk at a checkpoint, then assembling the whole trajectory at the end. I guess I could probably keep the file open throughout and append, but this works well for now and avoids a partially written or corrupted file.