This project aims at predicting both the energy accuracy and the simulation duration of a DFT simulation for a given set of input parameters and a prescribed chemical structure. Furthermore, we have implemented an approach to solve the inverse problem, i.e. to generate a set of computationally optimal input parameters for a prescribed chemical structure and energy accuracy.
This repository contains all the code used during the course of the project, including:
- scripts to prepare and launch DFT simulations on the EPFL cluster Fidis (Note that access credentials as well as computing time budget are required to launch them)
- scripts to parse the data from the JSON file and assemble it to the final raw dataset, i.e. with the chemical structures only given as strings
- scripts to assemble the final dataset for different encodings of the chemical structures. These datasets are not saved, but only assembled ad-hoc when they are needed. However, the data loading routines are factored out into a python model (see
code/tools/
), such that it is simple to retrieve them for other purposes. - scripts to train our models
- notebooks to explore the data and analyse our models
- scripts to create the plots in the report
All the requirements to run this project can be found in requirements.txt
.
To install them using pip, run the following command:
pip3 install -r requirements.txt
Please place you at the root of the project before running the following commands.
This command generates one data.csv
with all data.json
files from the subfolders inside data
:
python3 code/data_preprocessing/parsing_utils.py
Note: since all the data are already parsed, running the previous command is not necessary.
After the following commands are run, the trained models and the datasets they were training/testing on are saved in models
folder.
python3 code/regression/delta_E.py
python3 code/regression/log_delta_E.py
python3 code/regression/sim_time.py
python3 code/classification/delta_E.py
Note: the terminal output of all these commands is automatically saved in baselines
folder in HTML format.
python3 code/hyperparameter_tuning/delta_E_class.py
python3 code/hyperparameter_tuning/delta_E_reg.py
python3 code/hyperparameter_tuning/log_delta_E.py
python3 code/hyperparameter_tuning/sim_time.py
Note: the terminal output of all these commands is automatically saved in hyperparameter_tuning
folder in HTML format.
Before executing any commands in this section, please make sure you trained and saved the models using the commands in the training models section.
The results of the optimization are saved in code/optimization/optimization_results.json
file.
python3 code/optimization/optimization.py
-
baselines
: html files with training results for different target models and different structure encoding methods. -
code
-
code/data_scraping
:- julia scripts for launching simulations
-
code/data_preprocessing
: scripts to preprocess the data- julia scripts
- parse simulation results to json files
- python scripts
- parse simulation results in json files to one csv file
data/data.csv
- parse simulation results in json files to one csv file
- julia scripts
-
code/eda
- notebooks for exploratory data analysis
-
code/tools
: toolbox containing base methods for data preprocessing, model training and evaluation-
code/tools/encoding_periodic_table.ipynb
: generatecode/tools/periodic_table_info.json
file containing information about the periodic table and used in some encoding methods -
code/tools/data_loader.py
: methods to load data fromdata/data.csv
-
-
code/sandbox
: notebooks for testing new ideas and debugging -
code/regression
: scripts training, evaluating and saving models for the different regression targets defined for this project. -
code/classification
: scripts training, evaluating and saving models for$\Delta E$ classification target (order of magnitude). -
code/hyperparameter_tuning
: scripts for hyperparameter tuning for the different models. Use RandomizedSearchCV from the sklearn library to find the best hyperparameters for the different targets. Note that a standard machine might not be able to handle the computational effort of many iterations. -
code/model_analysis
: notebooks for analyzing the predictions of the different models.-
code/model_analysis/baseline_analysis.ipynb
: analysis of the predictions of the different regression models. -
code/model_analysis/classification_decision_boundaries.ipynb
: generate figures displaying the decision boundaries of the$\Delta E$ classifer. -
code/model_analysis/decision_boundaries.py
: script efficiently plotting the decision boundary of the classifier for a given structure (cf. classification_decision_boundaries.ipynb).
-
-
code/optimization
: contains a script implementing the simulation parameter optimization procedure described in the report.
-
-
data
: contains all the data files. In this folder you may find subfolders with the name of structures which contain simulation results. You may also find 3 csv files:-
data/data.csv
: contains the data used for the project. This dataset is built using all the data from the structure folders. -
data/ref_energy.csv
: contains the reference energy for each structure.
-
-
models
: folder in which models trained are saved (python scripts fromcode/regression
save their models in this folder). -
plots
: some plots are saved here.
You may contact us about the project via the following e-mail adresses:
- Martin Uhrin: martin.uhrin@epfl.ch (supervisor)
- Louis Ponet: louis.ponet@epfl.ch (co-supervisor)
- Auguste Poiroux: auguste.poiroux@epfl.ch
- Nataliya Paulish: nataliya.paulish@epfl.ch
- Philipp Weder: philipp.weder@epfl.ch
Licensed under the MIT License
© 2021 anp