Does MODNet have facilities for including state variables such as temperature or pressure?
sgbaird opened this issue · 8 comments
I was looking through the docs and example notebooks and didn't see where this information might fit in with the typical pipelines, but maybe I missed something.
Hi @sgbaird,
There is at this stage nothing that "easily" includes state variables (at least not expicitely). Though, two quick solutions exists. If the properties are available on a fixed range (e.g. temperature dependent property), this could be used as a vector (multi-property). Another (explicit) way is to append the state to the generated features.
Hi @ppdebreuck,
Thanks for the quick reply! For the first one, it sounds like you mean add the temperature as an additional target property? For the second one, perhaps I can just append a column to the ModData.df_featurized
attribute?
modnet/modnet/preprocessing.py
Lines 530 to 556 in 719e028
Maybe something like the following:
from modnet.preprocessing import MODData
from modnet.models import MODNetModel
# Creating MODData
data = MODData(materials = structures,
targets = targets,
)
data.featurize()
data.df_featurized.append({"T": temperatures})
data.feature_selection(n=200)
# Creating MODNetModel
model = MODNetModel(target_hierarchy,
weights,
num_neurons=[[256],[64,64],[32]],
)
model.fit(data)
# Predicting on unlabeled data
data_to_predict = MODData(new_structures)
data_to_predict.featurize()
data_to_predict.df_featurized.append({"T": new temperatures})
df_predictions = model.predict(data_to_predict) # returns dataframe containing the prediction on new_structures
modified from Getting Started
I haven't tried this yet, but if it seems reasonable I will probably give it a go later today.
For solution (1), yes the idea would be to have one target per temperature, like the thermodynamical data notebook.
# Creating MODNetModel
model = MODNetModel([[["S_5K","S_300K","S_500K"]]],
{"S_5K":1,"S_300K":1,"S_500K":1},
num_neurons=[[256],[64],[64],[32]],
)
With a few limitations : implicit, fixed temperatures, should be available for each sample, slower to train
I would indeed try what you suggested.
It took some time, but I got it figured out and made an example notebook (see the PR above)
Cool! Indeed, option (1) was infeasible here. Thanks for this addition. A simple hyper opt might be worth adding as example:
from modnet.hyper_opt import FitGenetic
ga = FitGenetic(train)
model = ga.run(refit=0, nested=0, size_pop=10, num_generations=3, n_jobs=20)
# size_pop, num_generations and n_jobs can be increased if computational power available
which avoids dealing with the model setup (num neurons etc.), around 5 mins to run and lowers MAE to +/- 2.2.
Btw, any benchmarking results available on this dataset ?
@ppdebreuck thanks!
Can you use both hyper_opt
and EnsembleMODNetModel
simultaneously? I'm guessing this just means using hyper_opt
and then passing in a list of the optimized parameters to EnsembleMODNetModel
. I tried with the EnsembleMODNetModel
(no hyper_opt
) and got a test MAE of around +/- 3.1 and test R^2 of 0.81.
As for benchmarking, in VickersHardnessPrediction/hv_predictions.py they split data according to:
train_test_split(train_size=0.9, test_size=0.1, random_state=100, shuffle=True)
They use XGBoost with recursive feature elimination (RFE) on physical descriptors. In the paper, they report an MSE of 5.7 GPa
(RMSE --> 2.4
) and an R-squared value of 0.97
(see also parity plots in Figure 2 of 10.1002/adma.202005112). The scripts they give in the repo aren't in a working state and it looks like a decent bit of work to resolve all the errors. If I don't get a response I might continue trying to refactor the repo. I'm also not sure if the repo is a reproducer for the paper results, so I wanted to run it myself.
Btw looks like modnet.hyper_opt
isn't contained in 0.1.11
, so I used:
pip install git+https://github.com/ppdebreuck/modnet@master
I know that MEGNet is geared towards state variables, but MEGNet only takes structures as inputs, not compositions. It can be paired with something like BOWSR, but I'd only imagine that working for single-phase structures (i.e. sort of a non-sensical physical representation if alloys are involved).
FitGenetic.run()
will in fact always return an EnsembleModel, with the ensemble depending on the refit
and nested
argument.
- If refit = 0: No refitting is done. Fitted models from the (nested) validation are simply reused. An ensemble is constructed from the best architecture over the inner folds (thus size 1 if nested=0, and size x if nested=x).
- If refit = x >0 ; best params are refitted x times. Thus, an ensemble of x refitted models is returned. All models have the same architecture (i.e. the best founded by the GA). This is exactly what you want I think. (Using refit=1 would just be one MODNetModel with the Ensemble container.)
Thanks for the info! Yep, we need to clean things a bit up and make a new release on pypi when we find time :p
Ah, gotcha. Thank you!