/OOD_Materials_Benchmark

Structure-based out-of-distribution (OOD) material property prediction: a benchmark study

OOD_Materials_Benchmark

Github repository for our manuscript - "Structure-based out-of-distribution (OOD) material property prediction: a benchmark study" [arxiv]

Authors: Sadman Sadeed Omee, Nihang Fu, Ming Hu, and Jianjun Hu.

Machine Learning and Evolution Laboratory,
Department of Computer Science and Engineering,
University of South Carolina,
SC, 29201, United States.

Traditional performance evaluation of material property prediction models through random splitting of the dataset frequently results in artificially high performance assessments due to inherent redundancy of typical material datasets. In real-world scenarios, machine learning (ML) models are usually employed to predict properties of novel exceptional materials that deviate from the training set distribution. It is thus a pressing question to provide an objective evaluation of ML models for property prediction of out-of-distribution (OOD) samples. Here we present a comprehensive benchmark study of structure-based graph neural networks (GNNs) for OOD materials property prediction. We formulate five different categories (LOCO, SparseXcluster, SparseYcluster, SparseXsingle, and SparseYsingle) of OOD problems for three benchmark datasets from the MatBench study, and perform extensive experiments.

Table of Contents

Algorithms

We chose the following graph neural network (GNN) algorithms for our benchmark study. These algorithms have multiple source codes from different repositories. We provided the links for the original implementations of the corresponding authors below. To get more details, the readers are requested to email the corresponding author at jianjunh@cse.sc.edu

  1. CGCNN
  2. MEGNet
  3. SchNet
  4. DimeNet++
  5. ALIGNN
  6. DeeperGATGNN
  7. coGN
  8. coNGN

Dataset

Datasets used in our work can be found in the data.zip file. The targets.csv files contain the ground truth properties. We used three datasets from the MatBench study. For simplicity, we refer to the matbench_dielectric dataset as the ‘dielectric dataset’, the matbench_log_gvrh dataset as the ‘elasticity dataset’, and the matbench_perovskites dataset as the ‘perovskites dataset’.

Target generation

In this work, we specifically concentrate on instances where the target set comprises no labeled samples. Accordingly, we propose the following target set generation methods to simulate real-world conditions for materials property prediction by creating 50 different folds for each method, where the test set for each fold differs in distribution from the train set.

Train-val-test data for each fold for each dataset can be found in folds.zip file. The directory of the folds for each category of targets are given below (different possible option are written curly braces):

folds/{dielectric,elasticity,perovskites}_folds/{train,val,test}/OFM_dielectric_{LOCO,SparseXcluster,SparseYcluster,SparseXsingle,SparseYsingle}_target_clusters50_{train,val,test}.json

These json files contain the ids of the materials to be used in each of the 50 folds, for each of the target generation method, and for each of the dataset. To take a look at how the five different target generation methods were created, unzip the target_generation.zip file.

Contributors

  1. Sadman Sadeed Omee (https://www.sadmanomee.com/)
  2. Dr. Jianjun Hu (http://www.cse.sc.edu/~jianjunh/)