The dataset I am working with gives all kinds of geographical and genetic data about different varieties of soy beans. I am using various regression machine learning models to best predict which growing conditions and plant types produce optimal yields.
Different attempts at predicting yield are in different branches and in files on the master branch. So far the completed branches are as follows:
- univariate-linear-regression - Using scikit learn to predict
yield based only on temperature.
Result: 151.16 mean squared error - multivariate-linear-regression - Using scikit learn to predict yield
based on all given geographical data. This includes features like top-soil,
soil ph level, irrigation, precipitation, and many others.
Result: 102.72 mean square error - xgboost - Using xgboost to predict yield
based on all given geographical data with gradient boosting. Xgboost has proven quite successful in recent kaggle competitions.
Result: 76.47 mean squared error - xgboost-with-feature-engineering - On top of just using xgboost I have one hot encoded a few columns including year, plant variety, and plant family.
Result: 53.95 mean squared error on only 1/5 of training data
Improvement!!