Using librascal for thousands of structures of 250 atoms each
wilsonbill opened this issue · 8 comments
I generated 3000 structures with 250 atoms in each structure with corresponding energy and force data to use for training a kernel ridge regression model. I selected 1000 "sparse points" from the training set to create the kernel matrix using SOAP features. By "sparse points" I mean feature vectors of sites in structures. How many "sparse points" or "inducing points" are used to create the GAP_2017_6_17_60_4_3_56_165 potential mentioned in [1]?
[1]Deringer, V. L., Bernstein, N., Bartók, A. P., Cliffe, M. J., Kerber, R. N., Marbella, L. E., ... & Csányi, G. (2018). Realistic atomistic structure of amorphous silicon from machine-learning-driven molecular dynamics. The journal of physical chemistry letters, 9(11), 2879-2885.
Does the following code look right for generating the kernel matrix and the kernel gradient matrix? I'm only using a single sparseFrame and 2 or 3 training frames since I run out of memory otherwise.
hypers = dict(soap_type="PowerSpectrum",
interaction_cutoff=iCut,
max_radial=radial,
max_angular=angular,
gaussian_sigma_constant=gSigma,
gaussian_sigma_type="Constant",
cutoff_smooth_width=sCut,
normalize=False,
radial_basis="GTO",
compute_gradients=True,
expansion_by_species_method='structure wise',
)
soap = SphericalInvariants(**hypers)
managSparse= soap.transform(sparseFrames)
soap1 = SphericalInvariants(**hypers)
manag_train= soap1.transform(trainFrames)
from rascal.models.sparse_points import SparsePoints
spPts=SparsePoints(soap)
spPts.extend(managSparse, [[siteIdx]])
kernel = Kernel(soap, name='GAP', zeta=zeta, target_type='Structure', kernel_type='Sparse')
X_sparse=spPts
KNM = kernel(managers_train, X_sparse)
KNM_down = kernel(managers_train, X_sparse, grad=(True, False))
Assuming you mean this model
https://github.com/libAtoms/silicon-testing-framework/blob/master/models/GAP/gp_iter6_sparse9k.xml
it is 9000 sparse points as far as I understand quip model files.
I am not sure why there is such a massive RAM usage for just 3 train frames and 1 sparse frame, even with 250 atoms. How many atomic species does your system have? For such large structures it could be beneficial for RAM usage to use expansion_by_species_method='environment wise',
, but I don't think that this is the problem here. The RAM usage will highly depend on your max_radial
, max_angular
and with gradients also on the cutoff
.
It would be helpful if your code snippet contains the actual parameters values or how they were computed to get a feeling for your memory consumption. Maybe even a snippet of your dataset (for example the 4 frames you used) so we can analyze the memory consumption, since I don't see anything obviously wrong in your code.
Hello, the feature matrix for your whole training set is quite heavy (there are as many gradient features as there are atoms times 3). My advice it to compute the sparse points (without gradients) and then compute the representation (with gradients) and the kernel matrix one structure at a time.
Then your memory footprint will be mostly tied to the size of the kernel matrix and not to the feature matrix.
This is implemented in the feat/gaptools
branch if you want a snippet.
Thank you for the responses.
I found feat/gaptools. Can you suggest a particular python file that computes the kernel one structure at a time?
If there is a workflow that has been used with such heavy loads that would be most helpful.
The following computes the inducing points without gradients and the other structures with gradients which is what I understand Felix to be saying.
zeta= 1
iterat1=0
iterat=6
perIter1=3
radial=16
angular=16
interaction_cutoff=5.0;
gaussian_sigma_constant=0.4;
cutoff_smooth_width=0.5;
import numpy as np
import createFeat3
from rascal.models import Kernel, train_gap_model
import pymatgen.core.structure as pm
exCoord=np.load('coord29.npy')
exBox=np.load('box29.npy')
natom= int(exCoord.shape[1]/3)
j=28
siteIndex=236
if True:
h2oCell= exBox[j].reshape([3,3])
h2oLatt= pm.Lattice(h2oCell)
h2oFrac= h2oLatt.get_fractional_coords(exCoord[j].reshape((3,natom)).T)
#h2oFrac= h2oLatt.get_fractional_coords(exCoord[j].reshape((natom,3)))
#hFrac = the fractional coordinants of h atoms
#hFrac= h2oFrac[typat==1]
rStruct= pm.Structure(pm.Lattice(h2oCell),['Si']*natom,h2oFrac, to_unit_cell=True)
from rascal.utils import from_dict, to_dict, CURFilter, dump_obj, load_obj, get_score, print_score
iCut=interaction_cutoff
gSigma=gaussian_sigma_constant
sCut= cutoff_smooth_width
hypers = dict(soap_type="PowerSpectrum",
interaction_cutoff=iCut,
max_radial=radial,
max_angular=angular,
gaussian_sigma_constant=gSigma,
gaussian_sigma_type="Constant",
cutoff_smooth_width=sCut,
normalize=False,
radial_basis="GTO",
compute_gradients=False,
expansion_by_species_method='structure wise',
)
from rascal.representations import SphericalInvariants
soap = SphericalInvariants(**hypers)
from pymatgen.io.ase import AseAtomsAdaptor as AAA
frames=[AAA.get_atoms(rStruct)]
managSparse= soap.transform(frames)
from rascal.models.sparse_points import SparsePoints
spPts=SparsePoints(soap)
spPts.extend(managSparse, [[siteIndex]])
structs=[]
for j in range(0,4):
h2oCell= exBox[j].reshape([3,3])
h2oLatt= pm.Lattice(h2oCell)
h2oFrac= h2oLatt.get_fractional_coords(exCoord[j].reshape((3,natom)).T)
#h2oFrac= h2oLatt.get_fractional_coords(exCoord[j].reshape((natom,3)))
#hFrac = the fractional coordinants of h atoms
#hFrac= h2oFrac[typat==1]
rStruct= pm.Structure(pm.Lattice(h2oCell),['Si']*natom,h2oFrac, to_unit_cell=True)
structs.append(rStruct)
hypers = dict(soap_type="PowerSpectrum",
interaction_cutoff=iCut,
max_radial=radial,
max_angular=angular,
gaussian_sigma_constant=gSigma,
gaussian_sigma_type="Constant",
cutoff_smooth_width=sCut,
normalize=False,
radial_basis="GTO",
compute_gradients=True,
expansion_by_species_method='structure wise',
)
frames= [AAA.get_atoms(struct) for struct in structs]
soap = SphericalInvariants(**hypers)
managers_train= soap.transform(frames)
from rascal.models import Kernel, train_gap_model
kernel = Kernel(soap, name='GAP', zeta=zeta, target_type='Structure', kernel_type='Sparse')
KNM = kernel(managers_train, spPts)
KNM_down = kernel(managers_train, spPts, grad=(True, False))
Sorry for the delay. You can have a look at this example notebook librascal/examples/zundel_i-PI.ipynb
which shows how to fit a potential. Internally, it computes the kernel matrix progressively so you should not have problems with the memory anymore.
What @felixmusil mentioned is the function compute_KNM
the part of
kernel = Kernel(soap, name='GAP', zeta=zeta, target_type='Structure', kernel_type='Sparse')
KNM = compute_KNM(tqdm(frames_train, leave=True, desc="Computing kernel matrix"), X_sparse, kernel, soap)
model = train_gap_model(kernel, frames_train, KNM, X_sparse, y_train, energy_baseline,
grad_train=-f_train, lambdas=[1e-12, 1e-12], jitter=1e-13)
It could be that you still run out of memory, since your parameters and structures are large. In that case try to reduce your max_radial parameter to maybe 10 (the model you mention uses max_angular=12 max_radial=10). The parameters you use should be no problem for a HPC with 100GB RAM, but for a personal computer it might be at the limit.
Thank you both again for the very helpful replies.
The file librascal/examples/zundel_i-PI.ipynb looks like it would be exactly what I need. However I can't find functions like sparsify_environments() or calculate_representation() even though I am looking at branch feat/gaptools.
I see the following line in gp_iter6_sparse9k.xml which has many of the SOAP parameters.
<descriptor>soap l_max=12 n_max=10 atom_sigma=0.5 zeta=4 cutoff=5.0 cutoff_transition_width=1.0 central_weight=1.0 n_sparse=9000...
Thank you Michele for the pointer to the function for choosing parameters.
I found sparsify_environments in librascal/bindings/rascal/models/gaptools.py. I was using the github search which doesn't seem to seach nonmaster branches.
Creating the kernel matrix is embarassing parallelizable so I've made some code using singularity, mpi and PBS batch scripts to run it on an HPC. I will see if changing l_max=12 n_max=10 reduces the load so that an HPC isn't necessary.