Clone from Xie
This software implementes Crystal Diffusion Variational AutoEncoder (CDVAE), which generates the periodic structure of materials.
It has several main functionalities:
- Generate novel, stable materials by learning from a dataset containing existing material structures.
- Generate materials by optimizing a specific property in the latent space, i.e. inverse design.
(torch2.0.1+cu118 for example)
It is suggested to use conda
(by conda or miniconda)
to create a python>=3.8(3.11 is suggested) environment first, then run the following pip
commands in this environment.
pip install torch -i https://download.pytorch.org/whl/cu118
pip install pyg_lib torch_scatter torch_sparse torch_cluster torch_spline_conv -f https://data.pyg.org/whl/torch-2.0.0+cu118.html
pip install -r requirements.txt
pip install -e .
Modify the following environment variables in file .env
.
PROJECT_ROOT
: path to the folder that contains this repoHYDRA_JOBS
: path to a folder to store hydra outputs
PROJECT_ROOT="<project root>" # `pwd` for example
HYDRA_JOBS="<project root>/log" # in project root for example
All datasets are directly available on data/
with train/valication/test splits. You don't need to download them again. If you use these datasets, please consider to cite the original papers from which we curate these datasets.
Find more about these datasets by going to our Datasets page.
To train a CDVAE, run the following command:
python cdvae/run.py \
model=vae/vae_nocond \ # vae is default
project=... group=... expname=... \
data=... \ # file name without .yml suffix in ./conf/data/
optim.optimizer.lr=1e-4 optim.lr_scheduler.min_lr=1e-5 \
data.teacher_forcing_max_epoch=250 data.train_max_epochs=4000 \
model.beta=0.01 \
model.fc_num_layers=1 model.latent_dim=... \
model.hidden_dim=... model.lattice_dropout=... \ # MLP part
model.hidden_dim=... model.latent_dim=... \
[model.conditions.cond_dim=...] \
For more control options see ./conf
To train with multi-gpu:
CUDA_VISIBLE_DEVICES=0,1 python cdvae/run.py \
... \ # can take the same options as before
train.pl_trainer.devices=2 \
+train.pl_trainer.strategy=ddp_find_unused_parameters_true
To use other datasets, use data=carbon
and data=mp_20
instead.
CDVAE uses hydra to configure hyperparameters, and users can
modify them with the command line or configure files in conf/
folder.
After training, model checkpoints can be found in $HYDRA_JOBS/singlerun/project/group/expname
.
To generate materials, run recon first (can skip):
python scripts/evaluate.py --model_path MODEL_PATH --tasks recon
then
python scripts/evaluate.py --model_path MODEL_PATH --tasks gen \
[--formula=H2O/--train_data=*.pkl] \ # if composition condition
[--energy=-1/--energy_per_atom=-1] \ # if energy condition
--batch_size=50
MODEL_PATH
will be the path to the trained model. Users can choose one or several of the 3 tasks:
recon
: reconstruction, reconstructs all materials in the test data. Outputs can be found ineval_recon.pt
lgen
: generate new material structures by sampling from the latent space. Outputs can be found ineval_gen.pt
.opt
: generate new material strucutre by minimizing the trained property in the latent space (requiresmodel.predict_property=True
). Outputs can be found ineval_opt.pt
.
eval_recon.pt
, eval_gen.pt
, eval_opt.pt
are pytorch pickles files containing multiple tensors that describes the structures of M
materials batched together. Each material can have different number of atoms, and we assume there are in total N
atoms. num_evals
denote the number of Langevin dynamics we perform for each material.
frac_coords
: fractional coordinates of each atom, shape(num_evals, N, 3)
atom_types
: atomic number of each atom, shape(num_evals, N)
lengths
: the lengths of the lattice, shape(num_evals, M, 3)
angles
: the angles of the lattice, shape(num_evals, M, 3)
num_atoms
: the number of atoms in each material, shape(num_evals, M)
To compute evaluation metrics, run the following command:
python scripts/compute_metrics.py --root_path MODEL_PATH --tasks recon gen opt
MODEL_PATH
will be the path to the trained model. All evaluation metrics will be saved in eval_metrics.json
.
The software is primary written by Tian Xie, with signficant contributions from Xiang Fu.
The GNN codebase and many utility functions are adapted from the ocp-models by the Open Catalyst Project. Especially, the GNN implementations of DimeNet++ and GemNet are used.
The main structure of the codebase is built from NN Template.
For the datasets, Perov-5 is curated from Perovksite water-splitting, Carbon-24 is curated from AIRSS data for carbon at 10GPa, MP-20 is curated from Materials Project.
Please consider citing the following paper if you find our code & data useful.
@article{xie2021crystal,
title={Crystal Diffusion Variational Autoencoder for Periodic Material Generation},
author={Xie, Tian and Fu, Xiang and Ganea, Octavian-Eugen and Barzilay, Regina and Jaakkola, Tommi},
journal={arXiv preprint arXiv:2110.06197},
year={2021}
}
Please leave an issue or reach out to Tian Xie (txie AT csail DOT mit DOT edu) if you have any questions.