Sorption MOFs SL

This repository contains codes related to the publication "Minimal crystallographic descriptors of sorption properties in hypothetical MOFs and role in sequential learning optimization" (https://www.nature.com/articles/s41524-022-00806-7). Datasets and trained pipelines are published on our Zenodo repository https://doi.org/10.5281/zenodo.6351366.

In particular:

Folder Models training + SHAP contains four .ipnyb files (one for each of the target properties of interest) to train a Random Forest based pipeline with hyperparameter tuning in 5-fold cross validation + SHAP analysis for detecting important features;
Folder Sequental learning contains the code for running the SL (three .m files for Kriging, one .ipnyb file for Random-Forest- and COMBO-based methodologies);
Folder Variable importances contains the complete ranking of the features, used to train the RF-regression models, provided by the SHAP analysis (for their meaning, please refer to "Machine learning with force-field inspired descriptors for materials: fast screening and mapping energy landscape", doi: 10.1103/physrevmaterials.2.083801);
File 2D Map & Database optimum.ipynb contains the Database optimum in terms of the specific energy, its thermodynamic ideal cycle (Fig. 7 of the paper), and the comparative 2D Map (Fig. 6 of the paper);
File policy.py is our modified version of the same file in the COMBO package https://github.com/tsudalab/combo (path combo/search/discrete/policy.py), to have, at each iteration, 100 evaluations of the next point to be tested, picking only the most preferred one.

Datasets creation

We constructed the four MOFs datasets (published here https://doi.org/10.5281/zenodo.6351366), each one with the same 1557 features and a different target property among Henry coefficient for CO2 (column name 'henry_coefficient_CO2_298K [mol/kg/bar]'), working capacity for CO2 (column name 'working_capacity_vacuum_swing_REPEAT_chg [mmol/g]'), Henry coefficient for H2O (column name 'henry_coefficient_H2O_298K [mol/kg/bar]') and surface area (column name 'surface_area [m^2/g]'), taking from here https://archive.materialscloud.org/2018.0016/v3

the properties of interest, screening_data.tar.gz, file top_MOFs_screening_data.csv containing over 8000 potential MOFs
the descriptors from the featurization (see below) of the corresponding over 8000 CIF files among the 300000 in MOF_database.tar.gz

Usage/Examples

To use one of the RF-pretrained models for doing new predictions:

Featurize your Crystallographic Information Files (CIFs)

import matminer
from matminer.featurizers.structure import JarvisCFID
import tqdm
import pandas as pd
import pymatgen as mp
import os

cif_path = 'type the path of the folder containing your CIFs'
cif_files = os.listdir(cif_path) 
jarvis = JarvisCFID()

jarvis_features = []

for cif in tqdm.tqdm(cif_files):
    cif_struc = mp.Structure.from_file(cif_path + cif)
    cif_feature = jarvis.featurize(cif_struc)
    jarvis_features.append(cif_feature)

Matminer_labels = jarvis.feature_labels()
Data = pd.DataFrame(jarvis_features, index = cif_files, columns = Matminer_labels)

Go on our Zenodo https://doi.org/10.5281/zenodo.6351366 and download the model you are interested in, then import the custom class for dropping the most correlated features, i.e.,

from sklearn.base import TransformerMixin, BaseEstimator
from joblib import dump, load

class MyDecorrelator(BaseEstimator, TransformerMixin): 
    
    def __init__(self, threshold):
        self.threshold = threshold
        self.correlated_columns = None

    def fit(self, X, y=None):
        correlated_features = set()  
        X = pd.DataFrame(X)
        corr_matrix = X.corr()
        for i in range(len(corr_matrix.columns)):
            for j in range(i):
                if abs(corr_matrix.iloc[i, j]) > self.threshold: # we are interested in absolute coeff value
                    colname = corr_matrix.columns[i]  # getting the name of column
                    correlated_features.add(colname)
        self.correlated_features = correlated_features
        return self

    def transform(self, X, y=None, **kwargs):
        return (pd.DataFrame(X)).drop(labels=self.correlated_features, axis=1)

Henry_H2O_model = load('Henry_H2O_model.joblib')

Predict with

Henry_H2O_model.predict(Data)

Otherwise, to use one of the AutoMatminer pretrained pipelines (supplementary material of the paper), download the one you are interested in from our Zenodo https://doi.org/10.5281/zenodo.6351366, and, after the featurization step, follow the instructions here https://hackingmaterials.lbl.gov/automatminer/basic.html#making-predictions.

Citations

If you find useful this repository for your research we would appreciate a citation to our paper: https://www.nature.com/articles/s41524-022-00806-7

giotre/MOFs

Sorption MOFs SL

Datasets creation

Usage/Examples

Citations