Using neural net potentials to calculate energies and nonequilibrium monte carlo to sample tautomeric states.
Accurate calculations of tautomer ratios are hard. Quantum aspects of the proton transfer reaction and entropy changes introduced by the change in double bond positions present a particular theoretical challenge. Most of the published work on tautomer ratio calculations have phenomenological character and closer inspection reveal unjustified assumptions and/or cancelation of errors [1].
A reliable method for the determination of tautomer ratios would ideally include conformational sampling in explicit water and the calculation of the potential energy with QM level accuracy. These two goals seem reasonable and within reach with the development of neural net potentials like ANI-1 that enable energy calculation with DFT level accuracy and force field like time costs [2]. This allows the exploration of the conformational degrees of freedom while ensuring high level QM energy calculations.
This package contains code to run monte carlo (MC) and non-equilibrium candidate monte carlo simulations [3] (NCMC) using the ANI-1ccx potential to calculate work distributions for forward and reverse transformation between a set of tautomers and utilizes bennet’s acceptance ratio (BAR) to estimate the free energy difference between tautomer states.
NCMC constructs a non equilibrium protocol that achieves high acceptance rates with short correlation times. Instead of instantaneous proposing a new tautomer state from an equilibrium sample it uses a coupling parameter to construct a non equilibrium process with incremental changes to slowly transform tautomer 1 to tautomer 2. After each incremental perturbation the system is allowed to relax -- in such a way highly efficient proposals can be constructed.
The current implementation of the protocol does not actually accept any of these proposals. The work is recorded and used to calculate the free energy difference between the two states.
The propagation along the coupling parameter for the nonequilibrium protocol has three stages that contribute to the final work value:
- Decoupling of the hydrogen
- Moving the hydrogen to a new position
- Coupling the hydrogen
An instantaneous MC protocol would perform all three described steps in a single propagation step while the nonequilibrium protocol uses small incremental perturbations for (1) and (3).
The (de)coupling is performed by linearly scaling the energy contribution of the hydrogen that changes its binding partner to the total energy of the system. This is possible because the total energy of a molecule (Et) is computed as the sum over the output of neural net potential for each individual atom:
There are still contributions of a specific atom i to all other atoms if removed from the total sum of neural net potentials, as eqn. (2), (3) and eqn (4) in [2] contain sums over neighboring atoms. There are two ways to scale these ‘indirect’ contributions of a given atom i to its neighbors: either linearly scaling the atomic environment vectors between an environment with atom i and an environment without atom i. The downside of such an approach is that a linear scaling of the atomic environment vector does not necessarily result in a linear scaling of the energy. Another possibility is to scale the energy directly between an atomic environment vector with atom i and without atom i which would guarantee that the energy change of the system is linear in respect to a coupling parameter.
We decided on linearly scaling the total energy. That means at any given lambda there are two values: is calculated with the unmodified atomic environment vector and contains the sum over all atoms. is calculated with the modified atomic environment vector for which the atom that will be modified is removed. To scale the energies depending on lambda the following equation is used for the final energy:
The work values for step (1) and step (3) of the protocol (decoupling and coupling) is the sum over the dE along the protocol. For step (2) the work calculations will be discussed in more detail.
The acceptance ratio for proposing to move from configuration x (initial coordinates) in thermodynamic state A (tautomeric state 1) to configuration x’ (proposed coordinates) in thermodynamic state B (tautomeric state 2) is calculated as follows.
The first ratio describes the probability of in thermodynamic state B divided by the probability of in thermodynamic state A. This ratio is simply the difference in energy of the system in the initial and proposed position. The second ratio describes the probability density functions for the proposal process that generates configurations in tautomer state B given a configuration in tautomer state A and vice versa. The proposal density function and depends on the equilibrium bond length between the heavy atom and hydrogen. Since the hydrogen can move between acceptor and donor atoms with different elements and the equilibrium bond length can change the proposal density is not symmetric and the calculation can not be omitted.
There are two types of restrains (flat bottom restraint and harmonic restraint) added to the heavy atom - hydrogen bond during the nonequilibrium protocol in order to keep the hydrogen near its bonded partner while not fully coupled to the environment. These restraints are not contributing to the energy of the endpoints and only to the 'alchemical' part of the protocol. At the endpoints only the flat bottom restraint is active. The two restraint are combined in the following way.
We generated equilibrium samples (using openMM and gaff2) for the two tautomeric forms of mol298 and used openMM, ANI-1ccx and psi4 (wB97X/6-31g*) to generate energy histograms and plot the energy as a function of the timestep.
Results of the MC protocol that switches tautomer states in a single step from an equilibrium sample are shown for mol298 in both directions. It samples the hydrogen positions from a gaussian shell around the hydrogen acceptor heavy atom. The gaussian shell is defined by the equilibrium bond length between the hydrogen and the hydrogen acceptor heavy atom element and the standard deviation (currently set to 0.1 Angstrom for all hydrogen - heavy element pairs).
Results of the MC protocol that switches tautomer states in a single step from an equilibrium sample.
Bar estimate of the work: 2.53 +- 0.73 kcal/mol.
For the NCMC protocol result plots will only be shown for mol298. A typical protocol involves 5000 steps of Langevin dynamics (0.5 fs stepsize) in the beginning from which conformations are randomly drawn to start a single NCMC protocol run. A NCMC protocol run involves 1000 +1 perturbation steps. 500 steps to decouple the hydrogen, 1 step to move the hydrogen to its new position (the MC move is the same as in the instantaneous protocol) and 500 to couple the hydrogen to its environment. For each (de)couple increment 20 Langevin dynamics steps are performed. A single protocol takes on 2 CPUs around 30 minutes.
The distance between the hydrogen acceptor (in green) and the hydrogen donor (in blue) and the tautomer-hydrogen as well as the applied restraints (in read) are shown as a function of the protocol length.
The forward/reverse work distribution of NCMC protocol is:
The BAR estimate of the work is -0.33 +- 0.11 kcal/mol. Seperate QM calculations of the same molecule using orca, B3LYP/aug-cc-pVTZ resulted in a reference energy difference of -3.36 kcal/mol.
The cumulative standard deviation for the NCMC protocol is shown below.
The work values and standard deviation for each protocol step for the transformation of tautomer 1 to tautomer 2:
The work values and standard deviation for each protocol step for the transformation of tautomer 2 to tautomer 1:
A trajectory without restraints around the hydrogen - heavy atom bond: https://drive.google.com/file/d/1_yFzTneNdiWb2LD5AMgjOYOboKaPgZ9N/view?usp=sharing
A trajectory from the current protocol, with restraints: https://drive.google.com/file/d/1BieyQ7odaljaOQGVHGV7LAP7VrLMeqoQ/view?usp=sharing
The jupyter notebook notebooks/NCMC-tautomers.ipynb can be used to start the NCMC protocol (1000 steps for the perturbation, 150 repetitions of the protocol) for a set of tautomer SMILES. So all you need for starting the protocl are a set of SMILES strings that can be interconverted by moving a single hydrogen. As long as the two SMILES describe two tautomeric states of the same small molecule the hydrogen that needs to move, the heavy atom donor that looses the hydrogen and the heavy atom acceptor that receives the hydorgen are automatically detected and the protocol needs no further input.
The calculation of the energies in the current torchANI implementation (https://github.com/aiqm/torchani) is in float32. This led initially to numerical underflow and other precision loss bugs with increased NCMC protocol length, since the work of a protocol is a sum of many small increments. To mitigate this issue, we sum the energies in float64.
class DoubleAniModel(torchani.nn.ANIModel):
def forward(self, species_aev):
# change dtype
species, aev = species_aev
species_ = species.flatten()
present_species = torchani.utils.present_species(species)
aev = aev.flatten(0, 1)
output = torch.full_like(species_, self.padding_fill,
dtype=torch.float32)
for i in present_species:
mask = (species_ == i)
input_ = aev.index_select(0, mask.nonzero().squeeze())
output.masked_scatter_(mask, self[i](input_).squeeze())
output = output.view_as(species)
return species, self.reducer(output.double(), dim=1)
Equilibrium samples are generated using Langevin Dynamics in a pure python implementation. A starting position is provided and equilibrium samples are generated and returned.
class LangevinDynamics(object):
def __init__(self, atom_list:str, temperature:int, force:ANI1_force_and_energy):
self.force = force
self.temperature = temperature
self.atom_list = atom_list
def run_dynamics(self, x0:np.ndarray,
n_steps:int = 100,
stepsize:unit.quantity.Quantity = 1 * unit.femtosecond,
collision_rate:unit.quantity.Quantity = 10/unit.picoseconds,
progress_bar:bool = False,
):
"""Unadjusted Langevin dynamics.
Parameters
----------
x0 : array of floats, unit'd (distance unit)
initial configuration
force : callable, accepts a unit'd array and returns a unit'd array
assumes input is in units of distance
output is in units of energy / distance
n_steps : integer
number of Langevin steps
stepsize : float > 0, in units of time
finite timestep parameter
collision_rate : float > 0, in units of 1/time
controls the rate of interaction with the heat bath
progress_bar : bool
use tqdm to show progress bar
Returns
-------
traj : [n_steps + 1 x dim] array of floats, unit'd
trajectory of samples generated by Langevin dynamics
"""
assert(type(x0) == unit.Quantity)
assert(type(stepsize) == unit.Quantity)
assert(type(collision_rate) == unit.Quantity)
assert(type(self.temperature) == unit.Quantity)
# generate mass arrays
mass_dict_in_daltons = {'H': 1.0, 'C': 12.0, 'N': 14.0, 'O': 16.0}
masses = np.array([mass_dict_in_daltons[a] for a in self.atom_list]) * unit.daltons
sigma_v = np.array([unit.sqrt(kB * self.temperature / m) / speed_unit for m in masses]) * speed_unit
v0 = np.random.randn(len(sigma_v),3) * sigma_v[:,None]
# convert initial state numpy arrays with correct attached units
x = np.array(x0.value_in_unit(distance_unit)) * distance_unit
v = np.array(v0.value_in_unit(speed_unit)) * speed_unit
# traj is accumulated as a list of arrays with attached units
traj = [x]
# dimensionless scalars
a = np.exp(- collision_rate * stepsize)
b = np.sqrt(1 - np.exp(-2 * collision_rate * stepsize))
# compute force on initial configuration
F = self.force.calculate_force(x)
trange = range(n_steps)
if progress_bar:
trange = tqdm(trange)
for _ in trange:
# v
v += (stepsize * 0.5) * F / masses[:,None]
# r
x += (stepsize * 0.5) * v
# o
v = (a * v) + (b * sigma_v[:,None] * np.random.randn(*x.shape))
# r
x += (stepsize * 0.5) * v
F = self.force.calculate_force(x)
# v
v += (stepsize * 0.5) * F / masses[:,None]
norm_F = np.linalg.norm(F)
# report gradient norm
if progress_bar:
trange.set_postfix({'|force|': norm_F})
# check positions and forces are finite
if (not np.isfinite(x).all()) or (not np.isfinite(norm_F)):
print("Numerical instability encountered!")
return traj
traj.append(x)
return traj
The real workhorse of the NCMC protocoll is the AlchemicalANI class. It enables the linear scaling of the energies between two atomic environment vectors as a function of a coupling parameter.
class LinearAlchemicalANI(AlchemicalANI):
def __init__(self, alchemical_atom):
"""Scale the indirect contributions of alchemical atoms to the energy sum by
linearly interpolating, for other atom i, between the energy E_i^0 it would compute
in the _complete absence_ of the alchemical atoms, and the energy E_i^1 it would compute
in the _presence_ of the alchemical atoms.
(Also scale direct contributions, as in DirectAlchemicalANI)
"""
super().__init__(alchemical_atom)
self.neural_networks = self._load_model_ensemble(self.species, self.ensemble_prefix, self.ensemble_size)
def forward(self, species_coordinates):
# for now only allow one alchemical atom
# LAMBDA = 1: fully interacting
# species, AEVs of fully interacting system
species, coordinates, lam = species_coordinates
species, aevs = self.aev_computer(species_coordinates[:-1])
# neural net output given these AEVs
nn_1 = self.neural_networks((species, aevs))[1]
E_1 = self.energy_shifter((species, nn_1))[1]
if float(lam) == 1.0:
E = E_1
else:
# LAMBDA = 0: fully removed
# species, AEVs of all other atoms, in absence of alchemical atoms
mod_species = torch.cat((species[:, :self.alchemical_atom], species[:, self.alchemical_atom+1:]), dim=1)
mod_coordinates = torch.cat((coordinates[:, :self.alchemical_atom], coordinates[:, self.alchemical_atom+1:]), dim=1)
mod_aevs = self.aev_computer((mod_species, mod_coordinates))[1]
# neural net output given these modified AEVs
nn_0 = self.neural_networks((mod_species, mod_aevs))[1]
E_0 = self.energy_shifter((species, nn_0))[1]
E = (lam * E_1) + ((1 - lam) * E_0)
return species, E
Copyright (c) 2019, Marcus Wieder & Josh Fass // MSKCC
Project based on the Computational Molecular Science Python Cookiecutter version 1.0.