openforcefield/openff-toolkit

`Topology.identical_molecule_groups` is slow and order-dependent with some `_SimpleMoleucles`

mattwthompson opened this issue · 2 comments

Describe the bug

Repeated calls to _SimpleAtom.molecule_atom_index result in Topology.identical_molecule_groups being quite slow for some systems, at least when

  • Molecules are represented, at least in part, by _SimpleMolecules
  • There is a single large molecule and many copies of a smaller molecule (i.e. a protein with waters)
  • The large molecule is listed first

On its face (i.e. just reading the reproduction without context) this sounds like a wild use case, but it's only the basis by which Interchange.from_openmm is able to process externally-prepared systems (at least now 1). OpenMM infrastructure in particularly strong with protein force fields in ways that OpenFF's is not yet, so this is a common state for users to end up at.

To Reproduce

import time

from openff.toolkit.topology._mm_molecule import _SimpleMolecule
from openff.toolkit import Molecule, Topology


# wget -O protein.pdb \
# https://raw.githubusercontent.com/openforcefield/protein-ligand-benchmark/a0b67b93c4f12f63a8bb417a1f6e4df4dc10c6c5/data/shp2/01_protein/crd/protein.pdb
protein = _SimpleMolecule.from_molecule(Topology.from_pdb("protein.pdb").molecule(0))
water = _SimpleMolecule.from_molecule(Molecule.from_smiles("O"))

n = 100

for order in ("before", "after", "middle"):
    match order:
        case "before":
            topology = Topology.from_molecules([protein] + n * [water])
        case "after":
            topology = Topology.from_molecules(n * [water] + [protein])
        case "middle":
            topology = Topology.from_molecules(
                int(n / 2) * [water] + [protein] + int(n / 2) * [water]
            )

    start = time.time()

    topology.identical_molecule_groups

    print(order, round(time.time() - start, 3))

    del topology

Output

$ sudo py-spy record -o profile.svg -- python protein-check.py                                          8:06:11  ☁  fix-1022 ☂
py-spy> Sampling process 100 times a second. Press Control-C to exit.

LICENSE: Could not open license file "oe_license.txt" in local directory
LICENSE: N.B. OE_LICENSE environment variable is not set
LICENSE: N.B. OE_DIR environment variable is not set
LICENSE: No product keys!
LICENSE: No product keys!
LICENSE: No product keys!
LICENSE: No product keys!
The OpenEye Toolkits are found to be installed but not licensed and therefore will not be used.
The OpenEye Toolkits require a (free for academics) license, see https://docs.eyesopen.com/toolkits/python/quickstart-python/license.html
The OpenEye Toolkits are found to be installed but not licensed and therefore will not be used.
The OpenEye Toolkits require a (free for academics) license, see https://docs.eyesopen.com/toolkits/python/quickstart-python/license.html
before 66.432
after 0.638
middle 0.758

py-spy> Stopped sampling because process exited
py-spy> Wrote flamegraph data to 'profile.svg'. Samples: 7901 Errors: 0

(It seems like I haven't locally updated my OpenEye license, but I don't think that should matter here.)

Computing environment (please complete the following information):

Additional context

The flamegraph tells a pretty clear story:

image

If using Molecules only, the result is pretty quick:

before 0.126
after 0.006
middle 0.006

Footnotes

  1. Maybe in the future we'll be able to just create Molecule objects for many chemistries with just a bond graph, element information, and explicit hydrogens ... I forget which combination of things are and are not possible to make chemically well-defined molecules out of

A pretty naive solution does the trick:

diff --git a/openff/toolkit/topology/_mm_molecule.py b/openff/toolkit/topology/_mm_molecule.py
index 99eb0728..3d98e6c1 100644
--- a/openff/toolkit/topology/_mm_molecule.py
+++ b/openff/toolkit/topology/_mm_molecule.py
@@ -10,6 +10,7 @@ TypedMolecule TODOs
 
 """
 
+import functools
 from typing import TYPE_CHECKING, Generator, Iterable, NoReturn, Optional, Union
 
 from openff.units.elements import MASSES, SYMBOLS
@@ -618,7 +619,7 @@ class _SimpleAtom:
                 if atom is not self:
                     yield atom
 
-    @property
+    @functools.cached_property
     def molecule_atom_index(self) -> int:
         return self.molecule.atoms.index(self)
 
before 1.358
after 0.016
middle 0.015
image

This is now in the development branch, after bb9d97f

before 1.419
after 0.055
middle 0.018

It's a little strange that the order matters, but right now I can't prioritize finding out why or possibly fixing it. Happy to see somebody else take a stab at it