This report aims to interpret the importance of features of the pumpkin seed data set in predicting the class of the seed (Çerçevelik or Ürgüp Sivrisi) across three machine learning algorithms:
- Logistic Regression
- Random Forest
- Light GBM
These are the independent features(X) of the pumpkin data set:
eccentricity, solidity, extent, roundness, aspect_ration, compactness
The coefficients and features from the logistic regression from highest to lowest magnitude:
Features | coefficient | magnitude |
---|---|---|
convex_area | 2.591466 | 2.591466 |
aspect_ration | 1.785008 | 1.785008 |
solidity | 0.703159 | 0.703159 |
major_axis_length | 0.346242 | 0.346242 |
extent | 0.018317 | 0.018317 |
minor_axis_length | -0.230135 | 0.230135 |
roundness | -1.034266 | 1.034266 |
perimeter | -1.161847 | 1.161847 |
equiv_diameter | -1.547765 | 1.547765 |
eccentricity | -1.597774 | 1.597774 |
compactness | -1.969038 | 1.969038 |
According to the Logistic regression model, the convex
feature has the highest co-efficient. This DOES NOT mean that it affects the model the result of the model's prediction positively. The compactness
feature has the lowest co-efficient. This also This DOES NOT that it affects the model the result of the model's prediction negatively. We will put this to a test in the next section.
From the experiment in the notebook, it was discovered that the coefficients of the model can be interpreted thus:
- Increasing the
compactness
is likely to push the prediction towardsÇerçevelik
- Increasing the
convex area
is likely to push the prediction towardsÜrgüp Sivrisi
- Decreasing the
compactness
is likely to push the prediction towardsÜrgüp Sivrisi
- Decreasing the
convex area
is likely to push the prediction towardsÇerçevelik
The feature importance of the random forest model was found using two methods:
- The random forest
feature_importances_
method - ELI5
.feature_importances_ method
Features | importance |
---|---|
aspect_ration | 0.204342 |
eccentricity | 0.200888 |
compactness | 0.175086 |
roundness | 0.106204 |
major_axis_length | 0.070736 |
solidity | 0.053879 |
minor_axis_length | 0.042710 |
extent | 0.042271 |
perimeter | 0.036539 |
equiv_diameter | 0.033754 |
ELI5
Weights | Features |
---|---|
0.2043 ± 0.4834 | aspect_ration |
0.2009 ± 0.4856 | eccentricity |
0.1751 ± 0.4416 | compactness |
0.1062 ± 0.3060 | roundness |
0.0707 ± 0.1798 | major_axis_length |
0.0539 ± 0.0288 | solidity |
0.0427 ± 0.0697 | minor_axis_length |
0.0423 ± 0.0398 | extent |
0.0365 ± 0.0478 | perimeter |
0.0338 ± 0.0230 | equiv_diameter |
0.0336 ± 0.0284 | convex_area |
- Both methods ranked
aspect_ration
as the feature that affects the model's prediction the most andequiv_diameter
as the 10th most important feature.
From the experiment in the notebook, it was discovered that the coefficients of the model can be interpreted thus:
- Increasing the
aspect_ration
is likely to push the prediction towardsÜrgüp Sivrisi
- Increasing the
equiv_diameter
is likely to push the prediction towardsÜrgüp Sivrisi
- Decreasing the
aspect_ration
is likely to push the prediction towardsÇerçevelik
- Decreasing the
equiv_diameter
is likely to push the prediction towardsÇerçevelik
Light GB: Feature Importance in Gradient Boosting Model
- Solidity is pushing the prediction towards an higher raw score
- While compactness, aspect_ration and roundness are pushing the prediction towards a lower raw score
- Roundness is pushing the prediction towards an higher raw score
- While compactness, aspect_ration, solidity and eccentricity are pushing the prediction towards a lower raw score