A simple interface to allow electrostatic embedding of machine learning potentials using an ORCA-like interface. Based on code by Kirill Zinovjev. An example sander implementation is provided. This works by reusing the existing interface between sander and ORCA, meaning that no modifications to sander are needed. The embedding model currently supports the HCNOS elements. We plan to add support for further elements in the near future.
Further details can be found in our paper, available here. Please
cite this work if you use emle-engine
in your research. Supplementary
information and data can be found here. For
the original theory behind EMLE, please refer to this
publication.
First create a conda environment with all of the required dependencies:
conda env create -f environment.yaml
conda activate emle
If this fails, try using mamba as a replacement for conda.
For GPU functionality, you will need to install appropriate CUDA drivers on
your host system along with NVCC, the CUDA compiler driver. (This doesn't come
with cudatoolkit
from conda-forge
.)
(Depending on your CUDA setup, you might need to prefix the environment creation
command above with something like CONDA_OVERRIDE_CUDA="11.2"
to resolve an
environment that is compatible with your CUDA driver.)
Finally, install emle-engine
:
python setup.py install
If you are developing and want an editable install, use:
python setup.py develop
To start an EMLE calculation server:
emle-server
For usage information, run:
emle-server --help
(By default, an emle_settings.yaml
file will be written to the working directory
containing the settings used to configure the server. This can be used to re-run
an existing simulation using the --config
option or EMLE_CONFIG
environment
variable. Additional emle_pid.txt
and emle_port.txt
files contain the process
ID and port of the server.)
To launch a client to send a job to the server:
orca orca_input
Where orca_input
is the path to a fully specified ORCA input file. In the
examples given here, the orca
executable will be called by sander
when
performing QM/MM, i.e. we are using a fake ORCA as the QM backend.
(Alternatively, just running orca orca_input
will try to connect to an existing
server and start one for you if a connection is not found.)
The server and client should both connect to the same hostname and port. This
can be specified in a script using the environment variables EMLE_HOST
and
EMLE_PORT
. If not specified, then the same default values will be used for
both the client and server.
To stop the server:
emle-stop
The embedding method relies on in vacuo energies and gradients, to which
corrections are added based on the predictions of the model. At present we
support the use of
TorchANI,
DeePMD-kit,
ORCA,
SQM,
XTB, or
PySander
for the backend, providing reference MM or QM with EMLE embedding, and pure EMLE
implementations. To specify a backend, use the --backend
argument when launching
emle-server
, e.g:
emle-server --backend torchani
(The default backend is torchani
.)
When using the orca
backend, you will also need to specify the path to the
real orca
exectubale using the --orca-path
command-line argument, or the
EMLE_ORCA_PATH
environment variable. (To check that EMLE is running, look for
a log or settings file in the working directory.) The input for orca
will
be taken from the &orc
section of the sander
configuration file, so use this
to specify the method, etc.
When using deepmd
as the backend you will also need to specify a model
file to use. This can be passed with the --deepmd-model
command-line argument,
or using the EMLE_DEEPMD_MODEL
environment variable. This can be a single file, or
a set of model files specified using wildcards, or as a comma-separated list.
When multiple files are specified, energies and gradients will be averaged
over the models. The model files need to be visible to the emle-server
, so we
recommend the use of absolute paths.
When using pysander
or sqm
as the backend you will also need to specify the
path to an AMBER parm7 topology file for the QM region. This can be specified
using the --parm7
command-line argument, or via the EMLE_PARM7
environment
variable.
We also provide a flexible way of supporting external backends via a callback function that can be specified via:
emle-server --external-backend module.function
The function
should take a single argument, an ase.Atoms
object for the QM region, and return the energy in Hartree as a float along
with the gradients in Hartree/Bohr as a numpy.ndarray
. The external backend
can also be supplied using the EMLE_EXTERNAL_BACKEND
environment variable.
When set, the external backend will take precendence over any other backend.
If the callback is a function within a local module, then make sure that the
directory containing the module is in sys.path
, or is visible to the emle-server
,
e.g. the server is launched from the same directory as the module. Alternatively,
use --plugin-path
to specify the path to a directory containing the module.
This can also be specified using the EMLE_PLUGIN_PATH
environment variable.
Make sure that this is an absolute path so that it is visible to the server
regardless of where it is launched.
We also support the use Rascal
for the calculation of delta-learning corrections to the in vacuo energies and
gradients. To use, you will need to specify a model file using the --rascal-model
command-line argument, or via the EMLE_RASCAL_MODEL
environment variable.
Note that the chosen backend must match the one used to train the model. At present this is left to the user to specify. In future we aim to encode the backend in the model file so that it can be selected automatically.
We currently support CPU
and CUDA
as the device for PyTorch.
This can be configured using the EMLE_DEVICE
environment variable, or by
using the --device
argument when launching emle-server
, e.g.:
emle-server --device cuda
When no device is specified, we will preferentially try to use CUDA
if it is
available. By default, the first CUDA
device index will be used. If you want
to specify the index, e.g. when running on a multi-GPU setup, then you can use
the following syntax:
emle-server --device cuda:1
This would tell PyTorch
that we want to use device index 1
. The same formatting
works for the environment varialbe, e.g. EMLE_DEVICE=cuda:1
.
We support electrostatic, mechanical, non-polarisable, and MM embedding.
Here non-polarisable emedding uses the EMLE model to predict charges for the
QM region, but ignores the induced component of the potential. MM embedding
allows the user to specify fixed MM charges for the QM atoms, with induction
once again disabled. Obviously we are advocating our electrostatic embedding
scheme, but the use of different embedding schemes provides a useful reference
for determining the benefit of using electrostatic embedding for a given system.
The embedding method can be specified using the EMLE_METHOD
environment
variable, or when launching the server, e.g.:
emle-server --method mechanical
The default option is (unsurprisingly) electrostatic
. When using MM
embedding you will also need to specify MM charges for the atoms within the
QM region. This can be done using the --mm-charges
option, or via the
EMLE_MM_CHARGES
environment variable. The charges can be specified as a list
of floats (space separated from the command-line, comma-separated when using
the environment variable) or a path to a file. When using a file, this should
be formatted as a single column, with one line per QM atom. The units
are electron charge.
Energies can be written to a file using the --energy-file
command-line argument
or the EMLE_ENERGY_FILE
environment variable. The frequency of logging can be
specified using --energy-frequency
or EMLE_ENERGY_FREQUENCY
. This should be an
integer specifying the frequency, in integration steps, at which energies are
written. (The default is 0, which means that energies aren't logged.) The output
will look something like the following, where the columns specify the current
step, the in vacuo energy and the total energy.
General log messages are written to the file specified by the --log-file
or
EMLE_LOG_FILE
options. (When using the Python API, by default, no log file is
used and diagnostic messages are written to sys.stderr
. When using emle-server
,
logs are by default redirected to emle_log.txt
.) The log level can be adjusted
by using the --log-level
or EMLE_LOG_LEVEL
options. For performance, the default
log level is set to ERROR
.
# Step E_vac (Eh) E_tot (Eh)
0 -495.724193647246 -495.720214843750
1 -495.724193662147 -495.720214843750
2 -495.722049429755 -495.718475341797
3 -495.717705026011 -495.714660644531
4 -495.714381769041 -495.711761474609
5 -495.712389051656 -495.710021972656
6 -495.710483833889 -495.707977294922
7 -495.708991110067 -495.706909179688
8 -495.708890005688 -495.707183837891
9 -495.711066677908 -495.709045410156
10 -495.714580371718 -495.712799072266
The EMLE implementation uses several ML frameworks to predict energies
and gradients. DeePMD-kit
or TorchANI can be used for the in vacuo
predictions and custom PyTorch code is used to predict
corrections to the in vacuo values in the presence of point charges.
The frameworks make heavy use of
just-in-time compilation.
This compilation is performed during to the first EMLE call, hence
subsequent calculatons are much faster. By using a long-lived server
process to handle EMLE calls from sander
we can get the performance
gain of JIT compilation.
A demo showing how to run EMLE on a solvated alanine dipeptide system can be found in the demo directory. To run:
cd demo
./demo.sh
Output will be written to the demo/output
directory.
It is possible to use emle-engine
to perform end-state correction (ESC)
for alchemical free-energy calculations. Here a λ value is used to
interpolate between the full MM (λ = 0) and EMLE (λ = 1) modified
potential. To use this feature specify the λ value from the command-line,
e.g.:
emle-server --lambda-interpolate 0.5
or via the ESC_LAMBDA_INTERPOLATE
environment variable. When performing
interpolation it is also necessary to specifiy the path to a topology file
for the QM region. This can be specified using the --parm7
command-line
argument, or via the EMLE_PARM7
environment variables You will also need to
specify the (zero-based) indices of the atoms within the QM region. To do so,
use the --qm-indices
command-line argument, or the EMLE_QM_INDICES
environment
variable. Finally, you will need specify MM charges for the QM atoms using
the --mm-charges
command-line argument or the EMLE_MM_CHARGES
environment
variable. These are used to calculate the electrostatic interactions between
point charges on the QM and MM regions.
It is possible to pass one or two values for λ. If a single value is used, then
the calculator will always use that value for interpolation, unless it is updated
externally using the --set-lambda-interpolate
command line option, e.g.:
emle-server --set-lambda-interpolate 1
Alternatively, if two values are passed then these will be used as initial and
final values of λ, with the additional --interpolate-steps
option specifying
the number of steps (calls to the server) over which λ will be linearly
interpolated. (This can also be specified using the EMLE_INTERPOLATE_STEPS
environment variable.) In this case the energy file (if written) will contain
output similar to that shown below. The columns specify the current step, the
current λ value, the energy at the current λ value, and the pure MM and EMLE
energies.
# Step λ E(λ) (Eh) E(λ=0) (Eh) E(λ=1) (Eh)
0 0.000000000000 -0.031915396452 -0.031915396452 -495.735900878906
5 0.100000000000 -49.588279724121 -0.017992891371 -495.720855712891
10 0.200000000000 -99.163040161133 -0.023267691955 -495.722106933594
15 0.300000000000 -148.726318359375 -0.015972195193 -495.717071533203
20 0.400000000000 -198.299896240234 -0.020024012774 -495.719726562500
25 0.500000000000 -247.870407104492 -0.019878614694 -495.720947265625
30 0.600000000000 -297.434417724609 -0.013046705164 -495.715332031250
35 0.700000000000 -347.003417968750 -0.008571878076 -495.715515136719
40 0.800000000000 -396.570098876953 -0.006970465649 -495.710876464844
45 0.900000000000 -446.150207519531 -0.019694851711 -495.720275878906
50 1.000000000000 -495.725952148438 -0.020683377981 -495.725952148438
We provide an interface between emle-engine
and OpenMM via the
Sire molecular simulation framework. This allows QM/MM simulations
to be run with OpenMM using EMLE for the embedding model. This provides improved
performance and flexibility in comparison to the sander
interface, although
the implementation should currently be treated as being experimental.
To use, first create an emle-sire
conda environment:
conda env create -f environment_sire.yaml
conda activate emle-sire
Next install emle-engine
into the environment:
python setup.py install
For instructions on how to use the emle-sire
interface, see the tutorial
documentation here.
When performing end-state correction simulations using the emle-sire
interface
there is no need to specify the lambda_interpolate
keyword when creating an
EMLECalculator
instance. Instead, interpolation can be enabled when creating a
Sire
dynamics object via the same keyword. (See the tutorial for details.)
The DeePMD-kit conda package pulls in a version of MPI which may cause
problems if using ORCA as the in vacuo backend, particularly when running
on HPC resources that might enforce a specific MPI setup. (ORCA will
internally call mpirun
to parallelise work.) Since we don't need any of
the MPI functionality from DeePMD-kit
, the problematic packages can be
safely removed from the environment with:
conda remove --force mpi mpich
Alternatively, if performance isn't an issue, simply set the number of
threads to 1 in the sander
input file, e.g.:
&orc
method='XTB2',
num_threads=1
/
When running on an HPC resource it can often take a while for the emle-server
to start. As such, the client will try reconnecting on failure a specified
number of times before raising an exception. (Sleeping 2 seconds between
retries.) By default, the client tries will try to connect 100 times. If this
is unsuitable for your setup, then the number of attempts can be configured
using the EMLE_RETRIES
environment variable.
When performing interpolation it is currently not possible to use AMBER force
fields with CMAP terms due to a memory deallocation bug in pysander
.