neutromeratio

Using neural net potentials to calculate energies and nonequilibrium monte carlo to sample tautomeric states.

Introduction

Accurate calculations of tautomer ratios are hard. Quantum aspects of the proton transfer reaction and entropy changes introduced by the change in double bond positions present a particular theoretical challenge. Most of the published work on tautomer ratio calculations have phenomenological character and closer inspection reveal unjustified assumptions and/or cancelation of errors [1].

A reliable method for the determination of tautomer ratios would ideally include conformational sampling in explicit water and the calculation of the potential energy with QM level accuracy. These two goals seem reasonable and within reach with the development of neural net potentials like ANI-1 that enable energy calculation with DFT level accuracy and force field like time costs [2]. This allows the exploration of the conformational degrees of freedom while ensuring high level QM energy calculations.

This package contains code to run monte carlo (MC) and non-equilibrium candidate monte carlo simulations [3] (NCMC) using the ANI-1ccx potential to calculate work distributions for forward and reverse transformation between a set of tautomers and utilizes bennet’s acceptance ratio (BAR) to estimate the free energy difference between tautomer states.

Theoretical background

NCMC constructs a non equilibrium protocol that achieves high acceptance rates with short correlation times. Instead of instantaneous proposing a new tautomer state from an equilibrium sample it uses a coupling parameter to construct a non equilibrium process with incremental changes to slowly transform tautomer 1 to tautomer 2. After each incremental perturbation the system is allowed to relax -- in such a way highly efficient proposals can be constructed.

The current implementation of the protocol does not actually accept any of these proposals. The work is recorded and used to calculate the free energy difference between the two states.

The propagation along the coupling parameter for the nonequilibrium protocol has three stages that contribute to the final work value:

Decoupling of the hydrogen
Moving the hydrogen to a new position
Coupling the hydrogen

An instantaneous MC protocol would perform all three described steps in a single propagation step while the nonequilibrium protocol uses small incremental perturbations for (1) and (3).

The (de)coupling is performed by linearly scaling the energy contribution of the hydrogen that changes its binding partner to the total energy of the system. This is possible because the total energy of a molecule (Et) is computed as the sum over the output of neural net potential for each individual atom:

$E_{t} = \sum^{\text{all atoms}}_{i} E_{i}$

There are still contributions of a specific atom i to all other atoms if removed from the total sum of neural net potentials, as eqn. (2), (3) and eqn (4) in [2] contain sums over neighboring atoms. There are two ways to scale these ‘indirect’ contributions of a given atom i to its neighbors: either linearly scaling the atomic environment vectors between an environment with atom i and an environment without atom i. The downside of such an approach is that a linear scaling of the atomic environment vector does not necessarily result in a linear scaling of the energy. Another possibility is to scale the energy directly between an atomic environment vector with atom i and without atom i which would guarantee that the energy change of the system is linear in respect to a coupling parameter.

We decided on linearly scaling the total energy. That means at any given lambda there are two $E_{t}$ values: $E_{t, \lambda=0}$ is calculated with the unmodified atomic environment vector and contains the sum over all atoms. $E_{t, \lambda=1}$ is calculated with the modified atomic environment vector for which the atom that will be modified is removed. To scale the energies depending on lambda the following equation is used for the final energy:

$E_{t} = (\lambda * E_{t, \lambda=0}) + ((1 - \lambda) * E_{t, \lambda=1})$

The work values for step (1) and step (3) of the protocol (decoupling and coupling) is the sum over the dE along the protocol. For step (2) the work calculations will be discussed in more detail.

The acceptance ratio for proposing to move from configuration x (initial coordinates) in thermodynamic state A (tautomeric state 1) to configuration x’ (proposed coordinates) in thermodynamic state B (tautomeric state 2) is calculated as follows.

$R_{A\rightarrow B}(x\rightarrow {x}') = \frac{p_{B}({x}')}{p_{A}({x})} \frac{g_{B\rightarrow A}{({x}'\rightarrow x})}{g_{A\rightarrow B}{(x\rightarrow {x}'})}$

The first ratio describes the probability of in thermodynamic state B divided by the probability of in thermodynamic state A. This ratio is simply the difference in energy of the system in the initial and proposed position. The second ratio describes the probability density functions for the proposal process that generates configurations in tautomer state B given a configuration in tautomer state A and vice versa. The proposal density function $g(B\rightarrow A)$ and $g(A\rightarrow B)$ depends on the equilibrium bond length between the heavy atom and hydrogen. Since the hydrogen can move between acceptor and donor atoms with different elements and the equilibrium bond length can change the proposal density is not symmetric and the calculation can not be omitted.

There are two types of restrains (flat bottom restraint and harmonic restraint) added to the heavy atom - hydrogen bond during the nonequilibrium protocol in order to keep the hydrogen near its bonded partner while not fully coupled to the environment. These restraints are not contributing to the energy of the endpoints and only to the 'alchemical' part of the protocol. At the endpoints only the flat bottom restraint is active. The two restraint are combined in the following way.

$\text{restraint} = (1-\lambda) * \text{flat\_bottom\_restraint} + \lambda * \text{harmonic\_restraint}$

Preliminary results

Running Langevin dynamics using ANI-1ccx

We generated equilibrium samples (using openMM and gaff2) for the two tautomeric forms of mol298 and used openMM, ANI-1ccx and psi4 (wB97X/6-31g*) to generate energy histograms and plot the energy as a function of the timestep.

Instantaneous MC

Results of the MC protocol that switches tautomer states in a single step from an equilibrium sample are shown for mol298 in both directions. It samples the hydrogen positions from a gaussian shell around the hydrogen acceptor heavy atom. The gaussian shell is defined by the equilibrium bond length between the hydrogen and the hydrogen acceptor heavy atom element and the standard deviation (currently set to 0.1 Angstrom for all hydrogen - heavy element pairs).

Results of the MC protocol that switches tautomer states in a single step from an equilibrium sample.

Bar estimate of the work: 2.53 +- 0.73 kcal/mol.

NCMC

For the NCMC protocol result plots will only be shown for mol298. A typical protocol involves 5000 steps of Langevin dynamics (0.5 fs stepsize) in the beginning from which conformations are randomly drawn to start a single NCMC protocol run. A NCMC protocol run involves 1000 +1 perturbation steps. 500 steps to decouple the hydrogen, 1 step to move the hydrogen to its new position (the MC move is the same as in the instantaneous protocol) and 500 to couple the hydrogen to its environment. For each (de)couple increment 20 Langevin dynamics steps are performed. A single protocol takes on 2 CPUs around 30 minutes.

The distance between the hydrogen acceptor (in green) and the hydrogen donor (in blue) and the tautomer-hydrogen as well as the applied restraints (in read) are shown as a function of the protocol length.

The forward/reverse work distribution of NCMC protocol is:

The BAR estimate of the work is -0.33 +- 0.11 kcal/mol. Seperate QM calculations of the same molecule using orca, B3LYP/aug-cc-pVTZ resulted in a reference energy difference of -3.36 kcal/mol.

The cumulative standard deviation for the NCMC protocol is shown below.

The work values and standard deviation for each protocol step for the transformation of tautomer 1 to tautomer 2:

The work values and standard deviation for each protocol step for the transformation of tautomer 2 to tautomer 1:

Some movies:

A trajectory without restraints around the hydrogen - heavy atom bond: https://drive.google.com/file/d/1_yFzTneNdiWb2LD5AMgjOYOboKaPgZ9N/view?usp=sharing

A trajectory from the current protocol, with restraints: https://drive.google.com/file/d/1BieyQ7odaljaOQGVHGV7LAP7VrLMeqoQ/view?usp=sharing

What do you need to start

The jupyter notebook notebooks/NCMC-tautomers.ipynb can be used to start the NCMC protocol (1000 steps for the perturbation, 150 repetitions of the protocol) for a set of tautomer SMILES. So all you need for starting the protocl are a set of SMILES strings that can be interconverted by moving a single hydrogen. As long as the two SMILES describe two tautomeric states of the same small molecule the hydrogen that needs to move, the heavy atom donor that looses the hydrogen and the heavy atom acceptor that receives the hydorgen are automatically detected and the protocol needs no further input.

Implementation details

The calculation of the energies in the current torchANI implementation (https://github.com/aiqm/torchani) is in float32. This led initially to numerical underflow and other precision loss bugs with increased NCMC protocol length, since the work of a protocol is a sum of many small increments. To mitigate this issue, we sum the energies in float64.

class DoubleAniModel(torchani.nn.ANIModel):

    def forward(self, species_aev):
        # change dtype
        species, aev = species_aev
        species_ = species.flatten()
        present_species = torchani.utils.present_species(species)
        aev = aev.flatten(0, 1)

        output = torch.full_like(species_, self.padding_fill,
                                dtype=torch.float32)
        for i in present_species:
            mask = (species_ == i)
            input_ = aev.index_select(0, mask.nonzero().squeeze())
            output.masked_scatter_(mask, self[i](input_).squeeze())
        
        output = output.view_as(species)
        return species, self.reducer(output.double(), dim=1)

Equilibrium samples are generated using Langevin Dynamics in a pure python implementation. A starting position is provided and equilibrium samples are generated and returned.

class LangevinDynamics(object):

    def __init__(self, atom_list:str, temperature:int, force:ANI1_force_and_energy):
        self.force = force
        self.temperature = temperature
        self.atom_list = atom_list


    def run_dynamics(self, x0:np.ndarray, 
                    n_steps:int = 100,
                    stepsize:unit.quantity.Quantity = 1 * unit.femtosecond,
                    collision_rate:unit.quantity.Quantity = 10/unit.picoseconds,
                     progress_bar:bool = False,
            ):
        """Unadjusted Langevin dynamics.

        Parameters
        ----------
        x0 : array of floats, unit'd (distance unit)
            initial configuration
        force : callable, accepts a unit'd array and returns a unit'd array
            assumes input is in units of distance
            output is in units of energy / distance
        n_steps : integer
            number of Langevin steps
        stepsize : float > 0, in units of time
            finite timestep parameter
        collision_rate : float > 0, in units of 1/time
            controls the rate of interaction with the heat bath
        progress_bar : bool
            use tqdm to show progress bar

        Returns
        -------
        traj : [n_steps + 1 x dim] array of floats, unit'd
            trajectory of samples generated by Langevin dynamics

        """

        assert(type(x0) == unit.Quantity)
        assert(type(stepsize) == unit.Quantity)
        assert(type(collision_rate) == unit.Quantity)
        assert(type(self.temperature) == unit.Quantity)

        # generate mass arrays
        mass_dict_in_daltons = {'H': 1.0, 'C': 12.0, 'N': 14.0, 'O': 16.0}
        masses = np.array([mass_dict_in_daltons[a] for a in self.atom_list]) * unit.daltons
        sigma_v = np.array([unit.sqrt(kB * self.temperature / m) / speed_unit for m in masses]) * speed_unit
        v0 = np.random.randn(len(sigma_v),3) * sigma_v[:,None]
        # convert initial state numpy arrays with correct attached units
        x = np.array(x0.value_in_unit(distance_unit)) * distance_unit
        v = np.array(v0.value_in_unit(speed_unit)) * speed_unit

        # traj is accumulated as a list of arrays with attached units
        traj = [x]
        # dimensionless scalars
        a = np.exp(- collision_rate * stepsize)
        b = np.sqrt(1 - np.exp(-2 * collision_rate * stepsize))

        # compute force on initial configuration
        F = self.force.calculate_force(x)
        trange = range(n_steps)
        if progress_bar:
            trange = tqdm(trange)
        for _ in trange:

            # v
            v += (stepsize * 0.5) * F / masses[:,None]
            # r
            x += (stepsize * 0.5) * v
            # o
            v = (a * v) + (b * sigma_v[:,None] * np.random.randn(*x.shape))
            # r
            x += (stepsize * 0.5) * v
            F = self.force.calculate_force(x)
            # v
            v += (stepsize * 0.5) * F / masses[:,None]

            norm_F = np.linalg.norm(F)
            # report gradient norm
            if progress_bar:
                trange.set_postfix({'|force|': norm_F})
            # check positions and forces are finite
            if (not np.isfinite(x).all()) or (not np.isfinite(norm_F)):
                print("Numerical instability encountered!")
                return traj
            traj.append(x)
        return traj

The real workhorse of the NCMC protocoll is the AlchemicalANI class. It enables the linear scaling of the energies between two atomic environment vectors as a function of a coupling parameter.

class LinearAlchemicalANI(AlchemicalANI):
    def __init__(self, alchemical_atom):
        """Scale the indirect contributions of alchemical atoms to the energy sum by
        linearly interpolating, for other atom i, between the energy E_i^0 it would compute
        in the _complete absence_ of the alchemical atoms, and the energy E_i^1 it would compute
        in the _presence_ of the alchemical atoms.
        (Also scale direct contributions, as in DirectAlchemicalANI)
        """

        super().__init__(alchemical_atom)      
        self.neural_networks = self._load_model_ensemble(self.species, self.ensemble_prefix, self.ensemble_size)

    def forward(self, species_coordinates):
        # for now only allow one alchemical atom

        # LAMBDA = 1: fully interacting
        # species, AEVs of fully interacting system
        species, coordinates, lam = species_coordinates
        species, aevs = self.aev_computer(species_coordinates[:-1])
        # neural net output given these AEVs
        nn_1 = self.neural_networks((species, aevs))[1]
        E_1 = self.energy_shifter((species, nn_1))[1]
        
        if float(lam) == 1.0:
            E = E_1
        else:
            # LAMBDA = 0: fully removed
            # species, AEVs of all other atoms, in absence of alchemical atoms
            mod_species = torch.cat((species[:, :self.alchemical_atom],  species[:, self.alchemical_atom+1:]), dim=1)
            mod_coordinates = torch.cat((coordinates[:, :self.alchemical_atom],  coordinates[:, self.alchemical_atom+1:]), dim=1) 
            mod_aevs = self.aev_computer((mod_species, mod_coordinates))[1]
            # neural net output given these modified AEVs
            nn_0 = self.neural_networks((mod_species, mod_aevs))[1]
            E_0 = self.energy_shifter((species, nn_0))[1]
            E = (lam * E_1) + ((1 - lam) * E_0)
        return species, E

Copyright

Acknowledgements

Project based on the Computational Molecular Science Python Cookiecutter version 1.0.

hainm/neutromeratio