ppdebreuck/modnet

Does MODNet have facilities for including state variables such as temperature or pressure?

sgbaird opened this issue · 8 comments

I was looking through the docs and example notebooks and didn't see where this information might fit in with the typical pipelines, but maybe I missed something.

Hi @sgbaird,

There is at this stage nothing that "easily" includes state variables (at least not expicitely). Though, two quick solutions exists. If the properties are available on a fixed range (e.g. temperature dependent property), this could be used as a vector (multi-property). Another (explicit) way is to append the state to the generated features.

Hi @ppdebreuck,

Thanks for the quick reply! For the first one, it sounds like you mean add the temperature as an additional target property? For the second one, perhaps I can just append a column to the ModData.df_featurized attribute?

class MODData:
"""The MODData class takes takes a list of `pymatgen.Structure`
objects and creates a `pandas.DataFrame` that contains many matminer
features per structure. It then uses mutual information between
features and targets, and between the features themselves, to
perform feature selection using relevance-redundancy indices.
Attributes:
df_structure (pd.DataFrame): dataframe storing the `pymatgen.Structure`
representations for each structured, indexed by ID.
df_targets (pd.Dataframe): dataframe storing the prediction targets
per structure, indexed by ID.
df_featurized (pd.DataFrame): dataframe with columns storing all
computed features per structure, indexed by ID.
optimal_features (List[str]): if feature selection has been performed
this attribute stores a list of the selected features.
optimal_features_by_target (Dict[str, List[str]]): If feature selection has been performed
this attribute stores a list of the selected features, broken down by target property.
featurizer (MODFeaturizer): the class used to featurize the data.
__modnet_version__ (str): The MODNet version number used to create the object
cross_nmi (pd.DataFrame): If feature selection has been performed, this attribute
stores the normalized mutual information between all features.
feature_entropy (Dictionary): Information entropy of all features. Only computed after a call to compute cross_nmi.
num_classes (Dictionary): Defining the target types (classification or regression).
Should be constructed as follows: key: string giving the target name; value: integer n,
with n=0 for regression and n>=2 for classification with n the number of classes.
"""

Maybe something like the following:

from modnet.preprocessing import MODData
from modnet.models import MODNetModel

# Creating MODData
data = MODData(materials = structures,
               targets = targets,
              )
data.featurize()
data.df_featurized.append({"T": temperatures})
data.feature_selection(n=200)

# Creating MODNetModel
model = MODNetModel(target_hierarchy,
                    weights,
                    num_neurons=[[256],[64,64],[32]],
                    )
model.fit(data)

# Predicting on unlabeled data
data_to_predict = MODData(new_structures)
data_to_predict.featurize()
data_to_predict.df_featurized.append({"T": new temperatures})
df_predictions = model.predict(data_to_predict) # returns dataframe containing the prediction on new_structures

modified from Getting Started

I haven't tried this yet, but if it seems reasonable I will probably give it a go later today.

For solution (1), yes the idea would be to have one target per temperature, like the thermodynamical data notebook.

# Creating MODNetModel
model = MODNetModel([[["S_5K","S_300K","S_500K"]]],
                    {"S_5K":1,"S_300K":1,"S_500K":1},
                    num_neurons=[[256],[64],[64],[32]],
                    )

With a few limitations : implicit, fixed temperatures, should be available for each sample, slower to train

I would indeed try what you suggested.

It took some time, but I got it figured out and made an example notebook (see the PR above)

Cool! Indeed, option (1) was infeasible here. Thanks for this addition. A simple hyper opt might be worth adding as example:

from modnet.hyper_opt import FitGenetic
ga = FitGenetic(train)
model = ga.run(refit=0, nested=0, size_pop=10, num_generations=3, n_jobs=20) 
# size_pop, num_generations and n_jobs can be increased if computational power available

which avoids dealing with the model setup (num neurons etc.), around 5 mins to run and lowers MAE to +/- 2.2.

Btw, any benchmarking results available on this dataset ?

@ppdebreuck thanks!

Can you use both hyper_opt and EnsembleMODNetModel simultaneously? I'm guessing this just means using hyper_opt and then passing in a list of the optimized parameters to EnsembleMODNetModel. I tried with the EnsembleMODNetModel (no hyper_opt) and got a test MAE of around +/- 3.1 and test R^2 of 0.81.

As for benchmarking, in VickersHardnessPrediction/hv_predictions.py they split data according to:

train_test_split(train_size=0.9, test_size=0.1, random_state=100, shuffle=True)

They use XGBoost with recursive feature elimination (RFE) on physical descriptors. In the paper, they report an MSE of 5.7 GPa (RMSE --> 2.4) and an R-squared value of 0.97 (see also parity plots in Figure 2 of 10.1002/adma.202005112). The scripts they give in the repo aren't in a working state and it looks like a decent bit of work to resolve all the errors. If I don't get a response I might continue trying to refactor the repo. I'm also not sure if the repo is a reproducer for the paper results, so I wanted to run it myself.

Btw looks like modnet.hyper_opt isn't contained in 0.1.11, so I used:

pip install git+https://github.com/ppdebreuck/modnet@master

I know that MEGNet is geared towards state variables, but MEGNet only takes structures as inputs, not compositions. It can be paired with something like BOWSR, but I'd only imagine that working for single-phase structures (i.e. sort of a non-sensical physical representation if alloys are involved).

FitGenetic.run() will in fact always return an EnsembleModel, with the ensemble depending on the refit and nested argument.

  • If refit = 0: No refitting is done. Fitted models from the (nested) validation are simply reused. An ensemble is constructed from the best architecture over the inner folds (thus size 1 if nested=0, and size x if nested=x).
  • If refit = x >0 ; best params are refitted x times. Thus, an ensemble of x refitted models is returned. All models have the same architecture (i.e. the best founded by the GA). This is exactly what you want I think. (Using refit=1 would just be one MODNetModel with the Ensemble container.)

Thanks for the info! Yep, we need to clean things a bit up and make a new release on pypi when we find time :p

Ah, gotcha. Thank you!