We present networks, weights and stacking algorithms for our third place solutions in IDAO 2022.
- πΎ Team members
- π Citation
- π Overview
- πΏ Data
- βοΈ Method
- π Results
- π Installation and Dependencies
- π Structure
Empty here, hope not for long.
Two-dimensional transition metal dichalcogenides (TMDCs) are relatively new and currently unexamined. These materials can contain naturally-occuring defects, that are extremely important for materials properties and performance. Predicting the band gap (energy difference between the valence band and conduction band) is extremely important for understanding the conducting properties of materials. In semi-finals of an International Data Analysis Olympiad 2022 participants were asked to create both high-performant and fast algorithms that would be able to predict the materials' band gap from its molecular structure. Our team finished third and in this work we share the description of our approach, networks weights and code that is sufficient for inference.
Two-dimensional transition metal dichalcogenides (TMDCs) are relatively new types of materials that have remarkable properties ranging from semiconducting, metallic, magnetic, superconducting to optical. The chemical composition of TMDCs is MXβ; where M is the group of transition elements most popular Molybdenum and Tungsten, and X is usually Sulfur or Selenium. Atomically thin TMDCs usually contain various defects, which enrich the lattice structure and give rise to many intriguing properties. Engineered point defects in two-dimensional (2D) materials offer an attractive platform for solid-state devices that exploit tailored optoelectronic, quantum emission, and resistive properties. Naturally occurring defects are also unavoidably important contributors to material properties and performance. The immense variety and complexity of possible defects make it challenging to experimentally control, probe, or understand atomic-scale defect-property relationships. In the figure above you can find vacancy and substitution defects in an 8x8 MoSβ crystal lattice.
Band gap is one of the important physical attributes which describe certain characteristics of the material, that helps deriving material qualities including electric conductivity or catalytic power or photo-optical properties. Band gap is the energy difference between the valence band and conduction band and is closely related to the energy difference between highest occupied molecular orbital (HOMO) and lowest unoccupied molecular orbital (LUMO), materials with overlapping (between valence band and conduction band) or very small band gap are conductors and materials with small bandgap are semiconductors while materials with large bandgap are insulators.
The task is to predict band gap energy for each crystal structure.
Energy within Threshold (EwT) is designed to measure the practical usefulness of a model for replacing DFT by evaluating whether the predicted energy is close to the ground truth (DFT energy). EwT is defined as the fraction of structures in which the predicted energy is within 0.02 eV
(electronvolt) of the ground truth energy.
The training dataset is in the data
directory in the baseline and structured into a directory called structures
containing 2967 crystal structures as a json file named with a unique identifier and is containing a special pymatgen structure (check pymatgen documentation for reference), that contains information about crystal parameters, cartesian coordinates of each atom, atom types, and other information.
The targets are stored in a csv file named targets.csv
containing two columns; the first is the unique identifier of the structure and the other is the band gap value for each structure. The train and test sets are constructed by sampling the corresponding subset without replacement.
Train/test samples:
- The training sample contains 1796 examples.
- The public test sample contains 1484 examples.
- The private test sample contains 1483 examples.
data
βββ dichalcogenides_private
β βββ structures
β βββ 6149087231cf3ef3d4a9f848.json
β βββ 6149c48031cf3ef3d4a9f84a.json
β βββ...
βββ dichalcogenides_public
βββ structures
β βββ 6146dd853ac25c70a5c6cdeb.json
β βββ 6146e9103ac25c70a5c6cded.json
β βββ ...
βββ targets.csv
The training dataset is in the data
directory, and contains eval
(private) and train
(public) directories. Inside them one can find defects
and no_defects
folders, original structures in different formats are stored at no_defects
directory while defects
directory contains complement to the original crystal lattice.
Inside the defects
and no_defects
directories one can find original data in pymatgen-format, jarvis-adapted structures, CFID-descripted structures (for more info check out its source code) and graph features (only for no_defects
-directories, for more info check out adhoc/scripts/graph_features.py:19
).
You can use data in Kaggle datasets-format:
Structure samples:
Data structure:
data
βββ eval
β βββ defects
β β βββ cfid
β β β βββ eval.csv
β β βββ cifs
β β β βββ 6149087231cf3ef3d4a9f848.cif
β β β βββ 6149c48031cf3ef3d4a9f84a.cif
β β β βββ ...
β β β βββ atom_init.json
β β βββ jarvis
β β β βββ 6149c48031cf3ef3d4a9f84a.vasp
β β β βββ 6149f3853ac25c70a5c6ce01.vasp
β β β βββ ...
β β βββ pymatgen
β β βββ 6149087231cf3ef3d4a9f848.json
β β βββ 6149c48031cf3ef3d4a9f84a.json
β β βββ...
β βββ no_defects
β β βββ cfid
β β β βββ eval.csv
β β βββ cifs
β β β βββ 6149087231cf3ef3d4a9f848.cif
β β β βββ 6149c48031cf3ef3d4a9f84a.cif
β β β βββ ...
β β β βββ atom_init.json
β β βββ graph
β β β βββ eval.csv
β β βββ jarvis
β β β βββ 6149c48031cf3ef3d4a9f84a.vasp
β β β βββ 6149f3853ac25c70a5c6ce01.vasp
β β β βββ ...
β β βββ pymatgen
β β βββ 6149087231cf3ef3d4a9f848.json
β β βββ 6149c48031cf3ef3d4a9f84a.json
β β βββ...
βββ train
βββ defects
β βββ cfid
β β βββ train.csv
β βββ cifs
β β βββ 6146dd853ac25c70a5c6cdeb.cif
β β βββ 6146e9103ac25c70a5c6cded.cif
β β βββ ...
β β βββ atom_init.json
β βββ graph
β β βββ train.csv
β βββ jarvis
β β βββ 6146dd853ac25c70a5c6cdeb.vasp
β β βββ 6146e9103ac25c70a5c6cded.vasp
β β βββ ...
β βββ pymatgen
β βββ 6146dd853ac25c70a5c6cdeb.json
β βββ 6146e9103ac25c70a5c6cded.json
β βββ...
βββ no_defects
βββ cfid
β βββ eval.csv
βββ cifs
β βββ 6146dd853ac25c70a5c6cdeb.cif
β βββ 6146e9103ac25c70a5c6cded.cif
β βββ ...
β βββ atom_init.json
βββ jarvis
β βββ 6146dd853ac25c70a5c6cdeb.vasp
β βββ 6146e9103ac25c70a5c6cded.vasp
β βββ ...
βββ pymatgen
βββ 6146dd853ac25c70a5c6cdeb.json
βββ 6146e9103ac25c70a5c6cded.json
βββ...
The original ALIGNN and MEGNet frameworks were fine-tuned and used in a following way:
- First of all, EDA was conducted: EDA source code.
- We described the data using CFID descriptor froM the JARVIS-ML package, source code is here.
- Secondly we used complements to our structures for predictions (Schottky defects). We computed complement structures (code in
adhoc/scripts/atoms_to_defects.py
) for future steps. Structure examples are here. - Next, some graph features from networkx.algorithms package were computed. See the
adhoc/scripts/graph_features.py
and functions documentation for more info.
All these steps are performed in adhoc/datasets_converter.ipynb
- Afterwards we tuned twice 4 pre-trained ALIGNN Nets on pymatgen-adapted data using Google Colab, starting models can be found in the
models/ALIGNN/pretrained
directory. Resulting models are in themodels/ALIGNN/fine-fine-tuned
directory and ALIGNN was forked for better usability. - Then we tuned 4 pre-trained ALIGNN models for complementary structures data. Resulting models are in
models/ALIGNN/defects
.
These steps are performed in adhoc/ALIGNN_train_inference.ipynb
, here's Colab notebook.
- Then MEGNet was trained on complement structures only and CFID-descriptors for complement structures were computed too.
This step is in adhoc/MEGNet_train_inference.ipynb
- 8 ALIGNN Nets (for structures and complements) predictions were calculated, mixed with MegNet predictions and given to CatBoostRegressor along with graph features and descriptors...
This step can be found in adhoc/CatBoost_train_inference.ipynb
, here's Kaggle notebook.
...PROFIT!
In case you want to test our model, you can use two notebooks:
adhoc/datasets_converter.ipynb
adhoc/All_inferences.ipynb
βοΈ The last one is not tested properly
- Google Colab Pro+ with Tesla V100 for ALIGNN train and inference,
- MacBook PRO 2019 CPU (1.4 GHz Quad-Core Intel Core i5) for MEGNet train, inference and for CatBoost inference,
- Kaggle CPU for CatBoost train.
- Colab notebook with ALIGNN fine-tuning and inference.
- Colab notebook with ALIGNN inference.
- Colab notebook with datasets download scripts (jrom jarvis), unused in final submission.
- Big dataset for Track 1 on Kaggle with features and predictions for gradient boosting.
- Notebook for Track 1 on Kaggle: stacking ALIGNN and MegNet regressors using self-made graph features by @sunruslan and jarvis CFID descriptor.
- Downsized dataset for Track 2 on Kaggle with graph features and CFID-descriptors for mixing with MegNet, used for training final model on Track 2.
- Notebook for Track 2 on Kaggle for boosting training over MegNet predictions and Track 2 dataset features. For more information about stacking method check the Track 2 branch.
Result on the public and private leaderboard with respect to the metric that was used by the organizers of the IDAO 2022.
LB Results:\
A score | A_pub | A_private | B score | B_pub | B_private | Total |
---|---|---|---|---|---|---|
0.923 | 0.933 | 0.923 | 0.929 | 0.930 | 0.929 | 1.952 |
Which is TOP-3 overall, TOP-3 for track 2 and TOP-5 for track 1.
python
3.9.6pyenv
from herepoetry
:pip install poetry
- all the needed packages from
pyproject.toml
and your ownvenv
:pyenv install 3.9.6 && pyenv local 3.9.6
poetry
instruction can be found herepoetry update
pyproject.toml
orrequirements.txt
with dependencies needed.- initialize the pre-commit hook with
poetry run pre-commit install
- you can find the
get_data.sh
script in thedata/
folder:cd data/ && /bin/bash get_data.sh
ad-hoc
: a directory for notebooks and ad-hoc scripts.- Contains everybody's sandboxes.
scripts
: a directory for models and training scripts.- Please create separate branches for any hypothesis you have.
edas
: Exploratory Data Analysis notebooks.trainPreds4EDA
: useful predictions on train for the further research purposes.
data
: a directory for datasets.configs
: unused directory with ALIGNN and MEGNet configuration files samples.images
: service directory for READMEs.models
: directory with ALIGNN and MEGNet model weights.predictions
: stored predictions from ALIGNN, MEGNet and CatBoost.