lab-cosmo/glosim

parameters to repeat the prediction on QM7b with a average kernel

lixinyuu opened this issue · 2 comments

Hi Dr Sandip and Professor Ceriotti,
I want to repeat your work in Science Advances in qm7b dataset. It looks like I did some steps wrong which always return MAE higher than 8 kcal/mol. Can you please give me some tips about which step I have made mistake?
What I did as follows.
1, I use

python glosim.py qm7.xyz -n 9 -l 9 -g 0.3 -c 3 --zeta 2 --peratom --kernel average

to generate the kernel matrix. According to the SI of your paper, the parameters should be -n 9 -l 9 -g 0.3 -c 3 --zeta 2 --periodic --norm , but I replaced the --periodic by --peratom, and deleted --nonorm because glosim.py had been updated.
2, I use the shufflesplit form scikit-learn to get the training kernel matrix and test kernel matrix. (I used random split, instead of choosing training set by FPS).

from sklearn.model_selection import ShuffleSplit
rs = ShuffleSplit(n_splits=10, test_size=.20, random_state=0) # 10 times independently running
train_index, test_index in rs.split(X):
    X_train, y_train = X[train_index][:, train_index], y[train_index]
    X_test, y_test = X[test_index][:, train_index], y[test_index]

3, I put X_train to a Gaussian Process Regression with different regularization parameters, and based on this to test the X_test.

The lowest MAE I got was 8 kcal/mol with average kernel (Cutoff_r = 4 A). The high MAE here may come from the random split, but I think it shouldn’t be responsible for a so high MAE. Can you please give me some suggestions?

an MAE of 8 kcal/mol is definitely not even close to the right ballpark. what are you using for training? make sure you learn the energies per atom, not the total atomization energy (i.e., that you divide the QM7 energies by the size of each molecule).

Thanks Prof Ceriotti, that's the point. I tried to learn total atomization energy instead of energies per atom. Now the MAE has been reduced to 0.7 kcal/mol. Thank you. Have a good weekend.