Github repository for our manuscript - "Structure-based out-of-distribution (OOD) material property prediction: a benchmark study" [arxiv]
Authors: Sadman Sadeed Omee, Nihang Fu, Ming Hu, and Jianjun Hu.
Machine Learning and Evolution Laboratory,
Department of Computer Science and Engineering,
University of South Carolina,
SC, 29201, United States.
Traditional performance evaluation of material property prediction models through random splitting of the dataset frequently results in artificially high performance assessments due to inherent redundancy of typical material datasets. In real-world scenarios, machine learning (ML) models are usually employed to predict properties of novel exceptional materials that deviate from the training set distribution. It is thus a pressing question to provide an objective evaluation of ML models for property prediction of out-of-distribution (OOD) samples. Here we present a comprehensive benchmark study of structure-based graph neural networks (GNNs) for OOD materials property prediction. We formulate five different categories (LOCO, SparseXcluster, SparseYcluster, SparseXsingle, and SparseYsingle) of OOD problems for three benchmark datasets from the MatBench study, and perform extensive experiments.
We chose the following graph neural network (GNN) algorithms for our benchmark study. These algorithms have multiple source codes from different repositories. We provided the links for the original implementations of the corresponding authors below. To get more details, the readers are requested to email the corresponding author at jianjunh@cse.sc.edu
Datasets used in our work can be found in the data.zip file. The targets.csv files contain the ground truth properties. We used three datasets from the MatBench study. For simplicity, we refer to the matbench_dielectric dataset as the ‘dielectric dataset’, the matbench_log_gvrh dataset as the ‘elasticity dataset’, and the matbench_perovskites dataset as the ‘perovskites dataset’.
In this work, we specifically concentrate on instances where the target set comprises no labeled samples. Accordingly, we propose the following target set generation methods to simulate real-world conditions for materials property prediction by creating 50 different folds for each method, where the test set for each fold differs in distribution from the train set.
Train-val-test data for each fold for each dataset can be found in folds.zip file. The directory of the folds for each category of targets are given below (different possible option are written curly braces):
folds/{dielectric,elasticity,perovskites}_folds/{train,val,test}/OFM_dielectric_{LOCO,SparseXcluster,SparseYcluster,SparseXsingle,SparseYsingle}_target_clusters50_{train,val,test}.json
These json files contain the ids of the materials to be used in each of the 50 folds, for each of the target generation method, and for each of the dataset. To take a look at how the five different target generation methods were created, unzip the target_generation.zip file.
- Sadman Sadeed Omee (https://www.sadmanomee.com/)
- Dr. Jianjun Hu (http://www.cse.sc.edu/~jianjunh/)