📈 International Data Analysis Olympiad

👯 Team: NESCafé Gold 3in1

We present networks, weights and stacking algorithms for our third place solutions in IDAO 2022.

🔖 Contents

👾 Team members

🔗 Citation

Empty here, hope not for long.

👀 Overview

⏰ Abstract

Two-dimensional transition metal dichalcogenides (TMDCs) are relatively new and currently unexamined. These materials can contain naturally-occuring defects, that are extremely important for materials properties and performance. Predicting the band gap (energy difference between the valence band and conduction band) is extremely important for understanding the conducting properties of materials. In semi-finals of an International Data Analysis Olympiad 2022 participants were asked to create both high-performant and fast algorithms that would be able to predict the materials' band gap from its molecular structure. Our team finished third and in this work we share the description of our approach, networks weights and code that is sufficient for inference.

🥋 Task

Two-dimensional transition metal dichalcogenides (TMDCs) are relatively new types of materials that have remarkable properties ranging from semiconducting, metallic, magnetic, superconducting to optical. The chemical composition of TMDCs is MX₂; where M is the group of transition elements most popular Molybdenum and Tungsten, and X is usually Sulfur or Selenium. Atomically thin TMDCs usually contain various defects, which enrich the lattice structure and give rise to many intriguing properties. Engineered point defects in two-dimensional (2D) materials offer an attractive platform for solid-state devices that exploit tailored optoelectronic, quantum emission, and resistive properties. Naturally occurring defects are also unavoidably important contributors to material properties and performance. The immense variety and complexity of possible defects make it challenging to experimentally control, probe, or understand atomic-scale defect-property relationships. In the figure above you can find vacancy and substitution defects in an 8x8 MoS₂ crystal lattice.

Band gap is one of the important physical attributes which describe certain characteristics of the material, that helps deriving material qualities including electric conductivity or catalytic power or photo-optical properties. Band gap is the energy difference between the valence band and conduction band and is closely related to the energy difference between highest occupied molecular orbital (HOMO) and lowest unoccupied molecular orbital (LUMO), materials with overlapping (between valence band and conduction band) or very small band gap are conductors and materials with small bandgap are semiconductors while materials with large bandgap are insulators.

The task is to predict band gap energy for each crystal structure.

Quality Metric

Energy within Threshold (EwT) is designed to measure the practical usefulness of a model for replacing DFT by evaluating whether the predicted energy is close to the ground truth (DFT energy). EwT is defined as the fraction of structures in which the predicted energy is within 0.02 eV (electronvolt) of the ground truth energy.

💿 Data

🔐 Original data structure

The training dataset is in the data directory in the baseline and structured into a directory called structures containing 2967 crystal structures as a json file named with a unique identifier and is containing a special pymatgen structure (check pymatgen documentation for reference), that contains information about crystal parameters, cartesian coordinates of each atom, atom types, and other information.

The targets are stored in a csv file named targets.csv containing two columns; the first is the unique identifier of the structure and the other is the band gap value for each structure. The train and test sets are constructed by sampling the corresponding subset without replacement.

Train/test samples:

The training sample contains 1796 examples.
The public test sample contains 1484 examples.
The private test sample contains 1483 examples.

data
├── dichalcogenides_private
│   └── structures
│	    ├── 6149087231cf3ef3d4a9f848.json
│	    ├── 6149c48031cf3ef3d4a9f84a.json
│	    └──...
└── dichalcogenides_public
    ├── structures
    │	├── 6146dd853ac25c70a5c6cdeb.json
    │	├── 6146e9103ac25c70a5c6cded.json
    │	└── ...
    └── targets.csv

🍀 Our data structure

The training dataset is in the data directory, and contains eval (private) and train (public) directories. Inside them one can find defects and no_defects folders, original structures in different formats are stored at no_defects directory while defects directory contains complement to the original crystal lattice.
Inside the defects and no_defects directories one can find original data in pymatgen-format, jarvis-adapted structures, CFID-descripted structures (for more info check out its source code) and graph features (only for no_defects-directories, for more info check out adhoc/scripts/graph_features.py:19).

You can use data in Kaggle datasets-format:

Structure samples:

The universal (ideal) band gap structure:
A high band gap structure:
A complement to high band gap structure:
A low band gap structure:
A complement to low band gap structure:

Data structure:

data
├── eval
│   ├── defects
│   │   ├── cfid
│   │   │   └── eval.csv
│   │   ├── cifs
│   │   │   ├── 6149087231cf3ef3d4a9f848.cif
│   │   │   ├── 6149c48031cf3ef3d4a9f84a.cif
│   │   │   ├── ...
│   │   │   └── atom_init.json
│   │   ├── jarvis
│   │   │   ├── 6149c48031cf3ef3d4a9f84a.vasp
│   │   │   ├── 6149f3853ac25c70a5c6ce01.vasp
│   │   │   └── ...
│   │   └── pymatgen
│   │       ├── 6149087231cf3ef3d4a9f848.json
│   │       ├── 6149c48031cf3ef3d4a9f84a.json
│   │       └──...
│   ├── no_defects
│   │   ├── cfid
│   │   │   └── eval.csv
│   │   ├── cifs
│   │   │   ├── 6149087231cf3ef3d4a9f848.cif
│   │   │   ├── 6149c48031cf3ef3d4a9f84a.cif
│   │   │   ├── ...
│   │   │   └── atom_init.json
│   │   ├── graph
│   │   │   └── eval.csv
│   │   ├── jarvis
│   │   │   ├── 6149c48031cf3ef3d4a9f84a.vasp
│   │   │   ├── 6149f3853ac25c70a5c6ce01.vasp
│   │   │   └── ...
│   │   └── pymatgen
│   │       ├── 6149087231cf3ef3d4a9f848.json
│   │       ├── 6149c48031cf3ef3d4a9f84a.json
│   │       └──...
└── train
    ├── defects
    │   ├── cfid
    │   │   └── train.csv
    │   ├── cifs
    │   │   ├── 6146dd853ac25c70a5c6cdeb.cif
    │   │   ├── 6146e9103ac25c70a5c6cded.cif
    │   │   ├── ...
    │   │   └── atom_init.json
    │   ├── graph
    │   │   └── train.csv
    │   ├── jarvis
    │   │   ├── 6146dd853ac25c70a5c6cdeb.vasp
    │   │   ├── 6146e9103ac25c70a5c6cded.vasp
    │   │   └── ...
    │   └── pymatgen
    │       ├── 6146dd853ac25c70a5c6cdeb.json
    │       ├── 6146e9103ac25c70a5c6cded.json
    │       └──...
    └── no_defects
        ├── cfid
        │   └── eval.csv
        ├── cifs
        │   ├── 6146dd853ac25c70a5c6cdeb.cif
        │   ├── 6146e9103ac25c70a5c6cded.cif
        │   ├── ...
        │   └── atom_init.json
        ├── jarvis
        │   ├── 6146dd853ac25c70a5c6cdeb.vasp
        │   ├── 6146e9103ac25c70a5c6cded.vasp
        │   └── ...
        └── pymatgen
            ├── 6146dd853ac25c70a5c6cdeb.json
            ├── 6146e9103ac25c70a5c6cded.json
            └──...

⚙️ Method

🪜 Steps

The original ALIGNN and MEGNet frameworks were fine-tuned and used in a following way:

First of all, EDA was conducted: EDA source code.
We described the data using CFID descriptor froM the JARVIS-ML package, source code is here.
Secondly we used complements to our structures for predictions (Schottky defects). We computed complement structures (code in adhoc/scripts/atoms_to_defects.py) for future steps. Structure examples are here.
Next, some graph features from networkx.algorithms package were computed. See the adhoc/scripts/graph_features.py and functions documentation for more info.

All these steps are performed in adhoc/datasets_converter.ipynb

Afterwards we tuned twice 4 pre-trained ALIGNN Nets on pymatgen-adapted data using Google Colab, starting models can be found in the models/ALIGNN/pretrained directory. Resulting models are in the models/ALIGNN/fine-fine-tuned directory and ALIGNN was forked for better usability.
Then we tuned 4 pre-trained ALIGNN models for complementary structures data. Resulting models are in models/ALIGNN/defects.

These steps are performed in adhoc/ALIGNN_train_inference.ipynb, here's Colab notebook.

Then MEGNet was trained on complement structures only and CFID-descriptors for complement structures were computed too.

This step is in adhoc/MEGNet_train_inference.ipynb

8 ALIGNN Nets (for structures and complements) predictions were calculated, mixed with MegNet predictions and given to CatBoostRegressor along with graph features and descriptors...

This step can be found in adhoc/CatBoost_train_inference.ipynb, here's Kaggle notebook.

...PROFIT!

👀 Inference

In case you want to test our model, you can use two notebooks:

adhoc/datasets_converter.ipynb
adhoc/All_inferences.ipynb

❗️ The last one is not tested properly

🚗 Specs

Google Colab Pro+ with Tesla V100 for ALIGNN train and inference,
MacBook PRO 2019 CPU (1.4 GHz Quad-Core Intel Core i5) for MEGNet train, inference and for CatBoost inference,
Kaggle CPU for CatBoost train.

☁ Cloud notebooks and datasets

Colab notebook with ALIGNN fine-tuning and inference.
Colab notebook with ALIGNN inference.
Colab notebook with datasets download scripts (jrom jarvis), unused in final submission.
Big dataset for Track 1 on Kaggle with features and predictions for gradient boosting.
Notebook for Track 1 on Kaggle: stacking ALIGNN and MegNet regressors using self-made graph features by @sunruslan and jarvis CFID descriptor.
Downsized dataset for Track 2 on Kaggle with graph features and CFID-descriptors for mixing with MegNet, used for training final model on Track 2.
Notebook for Track 2 on Kaggle for boosting training over MegNet predictions and Track 2 dataset features. For more information about stacking method check the Track 2 branch.

🏆 Results

Result on the public and private leaderboard with respect to the metric that was used by the organizers of the IDAO 2022.

LB Results:\

A score	A_pub	A_private	B score	B_pub	B_private	Total
0.923	0.933	0.923	0.929	0.930	0.929	1.952

Which is TOP-3 overall, TOP-3 for track 2 and TOP-5 for track 1.

🛠 Installation and Dependencies

python 3.9.6
pyenv from here
poetry: pip install poetry
all the needed packages from pyproject.toml and your own venv:
- pyenv install 3.9.6 && pyenv local 3.9.6
- poetry instruction can be found here
- poetry update
pyproject.toml or requirements.txt with dependencies needed.
initialize the pre-commit hook with poetry run pre-commit install
you can find the get_data.sh script in the data/ folder: cd data/ && /bin/bash get_data.sh

🏗 Structure

ad-hoc: a directory for notebooks and ad-hoc scripts.
- Contains everybody's sandboxes.
- scripts: a directory for models and training scripts.
  - Please create separate branches for any hypothesis you have.
- edas: Exploratory Data Analysis notebooks.
- trainPreds4EDA: useful predictions on train for the further research purposes.
data: a directory for datasets.
configs: unused directory with ALIGNN and MEGNet configuration files samples.
images: service directory for READMEs.
models: directory with ALIGNN and MEGNet model weights.
predictions: stored predictions from ALIGNN, MEGNet and CatBoost.

yk4r2/idao_22