Biomass Forecasting and Supply Chain Optimization for the Shell.ai Hackathon 2023
To get started, make sure you have Python 3.10 installed and follow these steps:
-
Install dependencies:
pip install -r requirements.txt
-
Set the notebook kernel to the right environment.
You can find the code for generating the biomass forecast in the generate_forecast.ipynb notebook.
To ensure data accuracy, the notebook addresses and fills in duplicated values that occurred before the 2014 census.
The clustering process is primarily based on district names, followed by checking correlations for each index within each district. Each index is assigned to the district with the highest Pearson correlation.
A table containing crop production data for each district is created based on Desagri data. Missing values before 2014 are filled in using production conservation ratios.
Crop land data from EarthStat and elevation data from NASA Earth Observation NEO are integrated into the analysis.
The model pipeline consists of a MaxAbsScaler and an ExtraTreeRegressor. Cross-validation is performed for each year based on all other years, with the following results:
Year | Test MAE |
---|---|
2010 | 22.6 |
2011 | 19.4 |
2012 | 27.7 |
2013 | 32.9 |
2014 | 24.9 |
2015 | 20.8 |
2016 | 29.1 |
2017 | 29.6 |
Avg | 25.9 |
The model, trained on historical data, is used for inference on 2018 and 2019, and the forecast is stored for further use in the optimization step.
The code for generating optimized locations can be found in the generate_optimized_locations.ipynb notebook.
The number of refineries is defined to collect 80% of biomass production (a problem constraint). Initial refinery positions are set at the center of the main biomass clusters.
The process includes the following steps, repeated until a maximum iteration is reached:
-
Start with around 60 depots spread in regions with high biomass (>200) in a random manner.
-
Calculate the flux to refineries using linear optimization.
-
Remove the depot that is the most underutilized.
-
Stop if constraints cannot be satisfied when calculating flux from depots to refineries.
The final depot positions giving the best cost over all runs are extracted.
For each depot + refinery index:
-
Calculate costs in multiple directions for an initial distance.
-
Keep the new position if the cost is lower.
If the cost is not improved after a full loop, the distance is increased.
Year | Forecast MAE | Optimization Cost | Score |
---|---|---|---|
2018 | 24.39 | 44,150 | 83.49 |
2019 | 30.69 | 26,786 | 83.84 |
Note that the optimization cost on 2018 is high as the same infrastructure had to be used for both years, the final submission is optimized on 2019.