`Topology.identical_molecule_groups` is slow and order-dependent with some `_SimpleMoleucles`
mattwthompson opened this issue · 2 comments
Describe the bug
Repeated calls to _SimpleAtom.molecule_atom_index
result in Topology.identical_molecule_groups
being quite slow for some systems, at least when
- Molecules are represented, at least in part, by
_SimpleMolecule
s - There is a single large molecule and many copies of a smaller molecule (i.e. a protein with waters)
- The large molecule is listed first
On its face (i.e. just reading the reproduction without context) this sounds like a wild use case, but it's only the basis by which Interchange.from_openmm
is able to process externally-prepared systems (at least now 1). OpenMM infrastructure in particularly strong with protein force fields in ways that OpenFF's is not yet, so this is a common state for users to end up at.
To Reproduce
import time
from openff.toolkit.topology._mm_molecule import _SimpleMolecule
from openff.toolkit import Molecule, Topology
# wget -O protein.pdb \
# https://raw.githubusercontent.com/openforcefield/protein-ligand-benchmark/a0b67b93c4f12f63a8bb417a1f6e4df4dc10c6c5/data/shp2/01_protein/crd/protein.pdb
protein = _SimpleMolecule.from_molecule(Topology.from_pdb("protein.pdb").molecule(0))
water = _SimpleMolecule.from_molecule(Molecule.from_smiles("O"))
n = 100
for order in ("before", "after", "middle"):
match order:
case "before":
topology = Topology.from_molecules([protein] + n * [water])
case "after":
topology = Topology.from_molecules(n * [water] + [protein])
case "middle":
topology = Topology.from_molecules(
int(n / 2) * [water] + [protein] + int(n / 2) * [water]
)
start = time.time()
topology.identical_molecule_groups
print(order, round(time.time() - start, 3))
del topology
Output
$ sudo py-spy record -o profile.svg -- python protein-check.py 8:06:11 ☁ fix-1022 ☂
py-spy> Sampling process 100 times a second. Press Control-C to exit.
LICENSE: Could not open license file "oe_license.txt" in local directory
LICENSE: N.B. OE_LICENSE environment variable is not set
LICENSE: N.B. OE_DIR environment variable is not set
LICENSE: No product keys!
LICENSE: No product keys!
LICENSE: No product keys!
LICENSE: No product keys!
The OpenEye Toolkits are found to be installed but not licensed and therefore will not be used.
The OpenEye Toolkits require a (free for academics) license, see https://docs.eyesopen.com/toolkits/python/quickstart-python/license.html
The OpenEye Toolkits are found to be installed but not licensed and therefore will not be used.
The OpenEye Toolkits require a (free for academics) license, see https://docs.eyesopen.com/toolkits/python/quickstart-python/license.html
before 66.432
after 0.638
middle 0.758
py-spy> Stopped sampling because process exited
py-spy> Wrote flamegraph data to 'profile.svg'. Samples: 7901 Errors: 0
(It seems like I haven't locally updated my OpenEye license, but I don't think that should matter here.)
Computing environment (please complete the following information):
- Operating system: macOS 14.4
- Output of running
conda list
: https://gist.github.com/mattwthompson/fc7d5c67b2953aff1d07ddffa31b980f
Additional context
The flamegraph tells a pretty clear story:
If using Molecule
s only, the result is pretty quick:
before 0.126
after 0.006
middle 0.006
Footnotes
-
Maybe in the future we'll be able to just create
Molecule
objects for many chemistries with just a bond graph, element information, and explicit hydrogens ... I forget which combination of things are and are not possible to make chemically well-defined molecules out of ↩
A pretty naive solution does the trick:
diff --git a/openff/toolkit/topology/_mm_molecule.py b/openff/toolkit/topology/_mm_molecule.py
index 99eb0728..3d98e6c1 100644
--- a/openff/toolkit/topology/_mm_molecule.py
+++ b/openff/toolkit/topology/_mm_molecule.py
@@ -10,6 +10,7 @@ TypedMolecule TODOs
"""
+import functools
from typing import TYPE_CHECKING, Generator, Iterable, NoReturn, Optional, Union
from openff.units.elements import MASSES, SYMBOLS
@@ -618,7 +619,7 @@ class _SimpleAtom:
if atom is not self:
yield atom
- @property
+ @functools.cached_property
def molecule_atom_index(self) -> int:
return self.molecule.atoms.index(self)
before 1.358
after 0.016
middle 0.015
This is now in the development branch, after bb9d97f
before 1.419
after 0.055
middle 0.018
It's a little strange that the order matters, but right now I can't prioritize finding out why or possibly fixing it. Happy to see somebody else take a stab at it