Our overall goal of the project is to predict the bandgap of organic semiconductors with Machine Leaning Methods. The dataset that we use is Harvard Clean Energy Project Database. To achieve this goal, Our tasks are:
- Calculate molecular descriptors for organic semiconductor from SMILES strings;
- Determine predictors for mathine learning mehthod by LASSO regression;
- Screen and optimize regression model;
- Build a wrapping function that help user to use our model.
RDkit is a very useful and opensource package which can be download very easily. By using the map calculation in the package, we can easily get thousands of descriptors from the SMILES strings. And use several methods to screen the predictors.
For all the regression models we choosed, 75% of the data are used to train the model and 25% are used to test the model. By choosing the model, we randomly choose a couple of small size of data to run it several times and calculate the average statistic data.
Import Linear Regression by using
from sklearn.linear_model import LinearRegression
The score of this model is 0.59.
Import Polynominal Regression by using
from sklearn.preprocessing import PolynomialFeatures
The score of this model is 0.54.
Import Random Forest Regression by using
from sklearn.ensemble import RandomForestRegressor
The score of this model is 0.67.
Import Keras by using
from keras.models import Sequential
from keras.layers import Dense
from keras.wrappers.scikit_learn import KerasRegressor
The score of this model is 0.57.
*Note: Tensor Flow Needed!
The Scatter Figure of the Predicted Bandgap for 4 Models are as follows.
The calculated statistic data are as follows:Error | Multiple Linear | Random Forest | Polynomial | Neural Network |
---|---|---|---|---|
MSE | 0.0450 | 0.0357 | 0.0503 | 0.1732 |
MAE | 0.1659 | 0.1425 | 0.1728 | 0.3349 |
MAPE | 0.0928 | 0.0792 | 0.0959 | 0.1883 |
0.5933 | 0.6772 | 0.5458 | 0.5665 | |
Kfold | 0.5906 | 0.6819 | 0.5906 | -0.0620 |
According to the figure and the table above, we choose the Random Forest Regression as our Prediction Model. And by optimizing it, we have a really good model, which has
$ conda create -c rdkit -n my-rdkit-env rdkit
This is recommended and we install this in this way. Here is the link for rdkit Note: Rdkit package muse be installed in your computer, and you may need to download it manually by yourself
1. $ pip install git+https://github.com/xuliang2019/Orc_Band.git