nomad-coe/atomistic-parsers

Problems with Gromacs parser from umbrella sampling simulation

JFRudzinski opened this issue · 27 comments

A user posted some issues with the Gromacs parser for a pair of umbrella sampling simulations. It is unclear what the issue is upon first glance (probably there are several issues). Will debug here...

First problem: str_to_energies function is not properly parsing the log file. str_to_energies function has this "P" in the regex matcher, what is this for? What happens is that in the example data #2 it is leaving out "P" whenever it begins a key word (possibly only follow by a period, I am not sure). If I remove the P and just require 2 spaces after the end of the string, I get the expected behavior.

@ladinesa Is this correct? I don't know what the original intention was here. (You might want to look at the rest of the str_to_energies function

def str_to_energies(val_in):
  energy_keys_re = re.compile(r'(.+?)(?:  |\Z| P)')  # CURRENT
  energy_keys_re = re.compile(r'(.+?)(?:  |\Z|  )')  # SUGGESTED
  
 Example data #1:
 
    Energies (kJ/mol)
        G96Bond       G96Angle  Improper Dih.        LJ (SR)        LJ (LR)
    1.34229e+01    1.35642e+01    5.09136e+00    4.40070e+03   -7.67386e+01
  Disper. corr.   Coulomb (SR)   Coul. recip.      Potential    Kinetic En.
   -5.50745e+01   -2.50242e+04   -2.67009e+03   -2.33933e+04    3.69409e+03
   Total Energy    Temperature Pres. DC (bar) Pressure (bar)    dVremain/dl
   -1.96992e+04    2.91821e+02   -1.21477e+02   -2.75978e+02   -9.93820e+01
   
 Example data #2:
 
    Energies (kJ/mol)
           Bond            U-B    Proper Dih.  Improper Dih.      CMAP Dih.
    3.95020e+03    1.15459e+04    1.35009e+04    6.79906e+02   -3.09901e+02
          LJ-14     Coulomb-14        LJ (SR)  Disper. corr.   Coulomb (SR)
    3.17579e+03    7.45808e+04    2.23587e+05   -1.21064e+04   -2.46824e+06
   Coul. recip. Position Rest.   COM Pull En.      Potential    Kinetic En.
    6.83743e+03    6.02819e-04    3.23664e-12   -2.14279e+06    4.13386e+05
   Total Energy  Conserved En.    Temperature Pres. DC (bar) Pressure (bar)
   -1.72941e+06   -1.72950e+06    3.14117e+02   -1.31770e+02    2.11987e+02
   Constr. rmsd
    3.63345e-06

Above suggestion was not sufficient, fix by Alvin:
energy_keys_re = re.compile(r'(.+?)(?: |\Z| (?=P))')

Ok, this was actually more complicated then we initially thought. The difficulty is dealing with single spaces between keys when the following key reaches the maximum character limit (14). For the record, here is the new code which addresses the issue:

        def str_to_energies(energies_block):

            max_char_name = 14
            rows = energies_block.strip().split("\n")
            names = [rows[i_row] for i_row in range(len(rows)) if i_row % 2 == 0]
            values = [rows[i_row] for i_row in range(len(rows)) if i_row % 2 != 0]

            names_re = r"(?<!\S)[A-Za-z0-9.()\/-]{1,}(?: [\S][A-Za-z0-9.()\/-]{1,})*(?!\S)"
            values_re = r"[-+]?\d+\.\d+e[+-]?\d+"
            names = [re.findall(names_re, name) for name in names]
            names = [name for group in names for name in group]
            names_split = []
            for name in names:
                if len(name) <= max_char_name:
                    names_split.append(name.strip())
                else:
                    remainder_len = len(name) % max_char_name
                    divisor = int(len(name) / max_char_name)
                    names_split.append(name[:remainder_len - divisor].strip())
                    for i in range(int(len(name) / max_char_name)):
                        start = remainder_len - divisor + (i + 1) + i * max_char_name
                        end = start + max_char_name
                        names_split.append(name[start:end].strip())

            values = [re.findall(values_re, val) for val in values]
            values = [val for group in values for val in group]

            if len(names_split) != len(values):
                self.logger.error('Error matching energy values.')

            energies = {}
            for key, val in zip(names_split, values):
                if key == 'Temperature':
                    energies[key] = float(val) * ureg.kelvin
                elif key.startswith('Pres'):
                    key = key.rstrip(' (bar)')
                    energies[key] = float(val) * ureg.bar
                else:
                    energies[key] = float(val) / MOL * ureg.kJ

            return energies

Note: I also had to change the step_info regex slightly, because it was falsely matching another instance of step in my example log file. New code:

            Quantity(
                'step_info',
                r'[\r\n]\s*(Step.+\n[\d\.\- ]+)',
                str_operation=str_to_step_info, convert=False)]

I am now continuing the debugging process...

Ok, this was actually more complicated then we initially thought. The difficulty is dealing with single spaces between keys when the following key reaches the maximum character limit (14). For the record, here is the new code which addresses the issue:

        def str_to_energies(energies_block):

            max_char_name = 14
            rows = energies_block.strip().split("\n")
            names = [rows[i_row] for i_row in range(len(rows)) if i_row % 2 == 0]
            values = [rows[i_row] for i_row in range(len(rows)) if i_row % 2 != 0]

            names_re = r"(?<!\S)[A-Za-z0-9.()\/-]{1,}(?: [\S][A-Za-z0-9.()\/-]{1,})*(?!\S)"
            values_re = r"[-+]?\d+\.\d+e[+-]?\d+"
            names = [re.findall(names_re, name) for name in names]
            names = [name for group in names for name in group]
            names_split = []
            for name in names:
                if len(name) <= max_char_name:
                    names_split.append(name.strip())
                else:
                    remainder_len = len(name) % max_char_name
                    divisor = int(len(name) / max_char_name)
                    names_split.append(name[:remainder_len - divisor].strip())
                    for i in range(int(len(name) / max_char_name)):
                        start = remainder_len - divisor + (i + 1) + i * max_char_name
                        end = start + max_char_name
                        names_split.append(name[start:end].strip())

            values = [re.findall(values_re, val) for val in values]
            values = [val for group in values for val in group]

            if len(names_split) != len(values):
                self.logger.error('Error matching energy values.')

            energies = {}
            for key, val in zip(names_split, values):
                if key == 'Temperature':
                    energies[key] = float(val) * ureg.kelvin
                elif key.startswith('Pres'):
                    key = key.rstrip(' (bar)')
                    energies[key] = float(val) * ureg.bar
                else:
                    energies[key] = float(val) / MOL * ureg.kJ

            return energies

ok I will have a look when pr is ready.

@ladinesa You can continue to ignore these for now, I just want to document the issues I experience.

Next problem:

Something is wrong with the trr file provided, even though it is present. We used to only check xtc file if trr file was missing. So, now I check to see if the positions are present in the MDA universe made from the trr file, and if not, we try the xtc file if it exists. This is in parse() function:

        # I have no idea if output trajectory file can be specified in input
        trr_file = self.get_gromacs_file('trr')
        trr_file_nopath = trr_file.rsplit('.', 1)[0]
        trr_file_nopath = trr_file_nopath.rsplit('/')[-1]
        xtc_file = self.get_gromacs_file('xtc')
        xtc_file_nopath = xtc_file.rsplit('.', 1)[0]
        xtc_file_nopath = xtc_file_nopath.rsplit('/')[-1]
        if not trr_file_nopath.startswith(self._basename):
            trajectory_file = xtc_file if xtc_file_nopath.startswith(self._basename) else trr_file
        else:
            trajectory_file = trr_file

        self.traj_parser.mainfile = topology_file
        self.traj_parser.auxilliary_files = [trajectory_file]
        # check to see if the trr file can be read properly (and has positions), otherwise try xtc file instead
        positions = getattr(self.traj_parser, 'universe', None)
        positions = getattr(positions, 'atoms', None) if positions else None
        positions = getattr(positions, 'positions', None) if positions else None
        if positions is None:
            self.traj_parser.auxilliary_files = [xtc_file] if xtc_file else [trr_file]

A few more things that I corrected so far:

  1. Another issue with the energy regex. Sometimes there is no empty line after the energies block, some error in writing to the file I guess. Instead there is a line about the load imbalance. So, I changed the regex of grabbing the energy block to r'Energies \(kJ/mol\)\s*([\s\S]+?)\nD.* step.* load imb.*|\n\n',. Not sure how generalizable this is, but we can see if other cases occur.

  2. There was some error occurring in the storage of the Rg values within the archive. This probably had to do with varying lengths of the calculation and system lists. I re-coded this function within the workflow normalizer, so that it now properly deals with the system references within calculation.

  3. While addressing item 2 above, I realized that the archive_to_universe function in atom_utils did not properly set the trajectory times. I now added some code to import the spacing between system frames (in terms of physical times) within this function.

The above fixing resulted in the first example file provided by the user to be parsed properly. I am now continuing to test the remaining examples...

@ladinesa Could you take a look at some point at the loop over atoms (for n in range(n_atoms):) in the parse_method() function of the gromacs parser?

The data that the user supplied is a relatively large system (~150k atoms). I put some restrictions in place in terms of calculating rdfs and msds, so that this is skipped for molecule types that have more than some cutoff number of molecules, since in many cases for protein simulations there are tens of thousands of water molecules, and calculating these quantities is a waste. (I guess in general it is questionable how useful these things are anyway, this is something we will assess in the future as custom observables are supported).

In any case, once I made these restrictions, I timed the various functions to see where the bottleneck of processing was:
done with initial parsing: 0.036649219195048016 min
done with method parsing: 3.9186079184214275 min
done with atoms loop in parse_method: 3.5282757679621377 min
done with system parsing: 0.2758171478907267 min
done with calculation parsing: 0.049816056092580156 min
done with input parsing: 4.337231318155924e-05 min
done with workflow parsing: 1.7323389212290445 min
done with calculating rdfs: 0.05297237237294515 min
done with calculating msds: 1.6793344179789225 min

A decent portion of the parsing time is taken in this atoms loop in parse_method, ~3.5 min out of ~7 min for complete processing (parsing + normalization). I was just wondering if we could make this more efficient, especially since parse_system() is also dealing with the atoms list and creates the entire topology but only takes ~0.3 min.

@ladinesa Could you take a look at some point at the loop over atoms (for n in range(n_atoms):) in the parse_method() function of the gromacs parser?

The data that the user supplied is a relatively large system (~150k atoms). I put some restrictions in place in terms of calculating rdfs and msds, so that this is skipped for molecule types that have more than some cutoff number of molecules, since in many cases for protein simulations there are tens of thousands of water molecules, and calculating these quantities is a waste. (I guess in general it is questionable how useful these things are anyway, this is something we will assess in the future as custom observables are supported).

In any case, once I made these restrictions, I timed the various functions to see where the bottleneck of processing was: done with initial parsing: 0.036649219195048016 min done with method parsing: 3.9186079184214275 min done with atoms loop in parse_method: 3.5282757679621377 min done with system parsing: 0.2758171478907267 min done with calculation parsing: 0.049816056092580156 min done with input parsing: 4.337231318155924e-05 min done with workflow parsing: 1.7323389212290445 min done with calculating rdfs: 0.05297237237294515 min done with calculating msds: 1.6793344179789225 min

A decent portion of the parsing time is taken in this atoms loop in parse_method, ~3.5 min out of ~7 min for complete processing (parsing + normalization). I was just wondering if we could make this more efficient, especially since parse_system() is also dealing with the atoms list and creates the entire topology but only takes ~0.3 min.

can you send me the file so I can benchmark it reliably? Have to tried commenting out the loop to see if this is really the bottle neck?

Yes, if you comment out the loop, the entire parsing takes 3.5 min less.

I have included 2 tests in HU box, since the zip is ~1.5GB: https://box.hu-berlin.de/f/fbde8aa6b21c4f6e9140/?dl=1
There are 2 test systems there, main files = npt2.log, umbrella2.log. Each have ~150k atoms, but umbrella2.log has a longer trajectory.

NOTE: For testing these files you should either:

  1. use the current version of this branch (I just pushed my latest changes which also include some simple print statements with the timings of each part of the code) and also the corresponding version of NOMAD - https://gitlab.mpcdf.mpg.de/nomad-lab/nomad-FAIR/-/issues/1643 -- These include my changes which avoid calculating the rdfs and msds for the >100k water molecules. Otherwise the parsing will take forever.
  2. If you want to just use the dev version, simply comment out parse_workflow() and only parse (i.e., don't run the normalization)

Let me know if you need anything else or want to discuss. I want to make a few more changes to the code, and then I will probably be ready for the PR

Yes, if you comment out the loop, the entire parsing takes 3.5 min less.

I have included 2 tests in HU box, since the zip is ~1.5GB: https://box.hu-berlin.de/f/fbde8aa6b21c4f6e9140/?dl=1 There are 2 test systems there, main files = npt2.log, umbrella2.log. Each have ~150k atoms, but umbrella2.log has a longer trajectory.

NOTE: For testing these files you should either:

  1. use the current version of this branch (I just pushed my latest changes which also include some simple print statements with the timings of each part of the code) and also the corresponding version of NOMAD - https://gitlab.mpcdf.mpg.de/nomad-lab/nomad-FAIR/-/issues/1643 -- These include my changes which avoid calculating the rdfs and msds for the >100k water molecules. Otherwise the parsing will take forever.
  2. If you want to just use the dev version, simply comment out parse_workflow() and only parse (i.e., don't run the normalization)

Let me know if you need anything else or want to discuss. I want to make a few more changes to the code, and then I will probably be ready for the PR

I cannot think of any way to address this rather than creating AtomParameters only for unique species. Is this okay?

Yes, if you comment out the loop, the entire parsing takes 3.5 min less.
I have included 2 tests in HU box, since the zip is ~1.5GB: https://box.hu-berlin.de/f/fbde8aa6b21c4f6e9140/?dl=1 There are 2 test systems there, main files = npt2.log, umbrella2.log. Each have ~150k atoms, but umbrella2.log has a longer trajectory.
NOTE: For testing these files you should either:

  1. use the current version of this branch (I just pushed my latest changes which also include some simple print statements with the timings of each part of the code) and also the corresponding version of NOMAD - https://gitlab.mpcdf.mpg.de/nomad-lab/nomad-FAIR/-/issues/1643 -- These include my changes which avoid calculating the rdfs and msds for the >100k water molecules. Otherwise the parsing will take forever.
  2. If you want to just use the dev version, simply comment out parse_workflow() and only parse (i.e., don't run the normalization)

Let me know if you need anything else or want to discuss. I want to make a few more changes to the code, and then I will probably be ready for the PR

I cannot think of any way to address this rather than creating AtomParameters only for unique species. Is this okay?

Hmmm, I think it would be ok for many cases. Typically mass and charge is set via atomtypes in gromacs, so that would be consistent, however it is possible to overwrite these for individual atoms. Can we default to doing it by species type (and by this I mean the label within atom parameters not the atomic element, the labels are more specific generally), and then do some sort of check for inconsistencies? Or will this slow your approach down anyway?

default

do some sort of check for inconsistencies

to check for inconsistency you mean to make sure charge, resid, motype etc is the same for the same label? If not, I should also write it as an AtomParameter?

default

do some sort of check for inconsistencies

to check for inconsistency you mean to make sure charge, resid, motype etc is the same for the same label? If not, I should also write it as an AtomParameter?

I meant to check if each identical label has all the same attributes and if not then create a new entry. I would suggest a grouping hierarchy where atoms with all same AtomParameters are grouped together into a single type with the corresponding attributes and then simple a list of atom indices.

AtomParameters

yes this is what i meant. I will create a new branch to implement this from develop or from your working branch?

In my example, charges vary for the same label. should I include it in the comparison?

AtomParameters

yes this is what i meant. I will create a new branch to implement this from develop or from your working branch?

Let's create a different branch I think.

In my example, charges vary for the same label. should I include it in the comparison?

I think if any of the attributes differ then we need to create a new group, unless you have some other way to keep tracking of the varying attributes. It may end up being too tedious or defaulting to the old way. One thing that may help this is to improve the storage of the FF, which will define these things more clearly...but that is out of scope for now.

have a look at the changes in a9e769c

have a look at the changes in a9e769c

Yeah this looks good to me! Have you considered downstream effects though? I know, e.g., that in the archive_to_universe function in atomutils we use the masses, etc., to build the universe. This is then used to calculate COM for the rdf and msd calculation. We need to check any usage of the atom parameters throughout the code.

@ladinesa FYI - I had to revert the MDAnalysis requirement to version 2.1.0. Version 2.5.0 has some errors when reading some versions of trr files. All the tests within the atomistic parsers pass, but just wanted to let you know in case you had some other usage somewhere that needed the newer version (I presume not and I am the one who put 2.5.0, but I can't remember).

On a slightly related note. I always get this warning:

WARNING  nomad.client         2023-08-03T07:28:33 Incompatible version of MDAnalysis.
  - nomad.client.parser: parsers/gromacs
  - nomad.commit: 
  - nomad.deployment: devel
  - nomad.service: unknown nomad service
  - nomad.version: 1.2.1.dev45+g1514d2286.d20230803

Even when I am using the version of MDA specified in pyproject.toml. Do you know where this is coming from? I would like to correct this.

@ladinesa FYI - I had to revert the MDAnalysis requirement to version 2.1.0. Version 2.5.0 has some errors when reading some versions of trr files. All the tests within the atomistic parsers pass, but just wanted to let you know in case you had some other usage somewhere that needed the newer version (I presume not and I am the one who put 2.5.0, but I can't remember).

On a slightly related note. I always get this warning:

WARNING  nomad.client         2023-08-03T07:28:33 Incompatible version of MDAnalysis.
  - nomad.client.parser: parsers/gromacs
  - nomad.commit: 
  - nomad.deployment: devel
  - nomad.service: unknown nomad service
  - nomad.version: 1.2.1.dev45+g1514d2286.d20230803

Even when I am using the version of MDA specified in pyproject.toml. Do you know where this is coming from? I would like to correct this.

This is coming from get_force_field_parameters, where I used the reader from version 2.0.0. I am really happy about this function. Maybe we can remove this or not use it for now until we find a more robust solution.

have a look at the changes in a9e769c

Yeah this looks good to me! Have you considered downstream effects though? I know, e.g., that in the archive_to_universe function in atomutils we use the masses, etc., to build the universe. This is then used to calculate COM for the rdf and msd calculation. We need to check any usage of the atom parameters throughout the code.

Actually, I did have a warning about the atomtypes not having a length not equal to number of atoms during normalization. Not sure though how to deal with it without digging a bit.

have a look at the changes in a9e769c

Yeah this looks good to me! Have you considered downstream effects though? I know, e.g., that in the archive_to_universe function in atomutils we use the masses, etc., to build the universe. This is then used to calculate COM for the rdf and msd calculation. We need to check any usage of the atom parameters throughout the code.

Actually, I did have a warning about the atomtypes not having a length not equal to number of atoms during normalization. Not sure though how to deal with it without digging a bit.

I can try to go through and address this if you would like. I have just been digging through the Gromacs parser, so I may be able to find these things quickly.

I guess the question is how we deal with this more generally. For example, are we going to make equivalent changes for the Lammps parser as well? How will a user know whether the AtomParameters are grouped or not in general?

rameters are

It seems that we cannot sample AtomParameters like frames right? In that case, we cannot do anything about it, if it takes an additional 5 mins to write 150k entries then so be it.

Let me think about this more once I am done with this other PR...