This project is a continuation and update in methodology of the work from Ramankutty et al. (2008). We combine subnational level census data and national level FAOSTAT data to develop a global spatial dataset of croplands and pastures on a graticule of 5 arcminutes (~10
- configs (user-specified settings)
- utils (helper functions and tools)
- census_processor (country class files for which we have subnational data and helps data loading)
- gdd (scripts and data for gdd filter mask)
- land_cover (scripts and data for land cover maps)
- models (scripts and pre-trained model weights)
- FAOSTAT_data (FAOSTAT dataset)
- subnational_stats (subnational census dataset)
- shapefile (shapefile data from GADM)
- evaluation (code and evaluation results between map predictions and independent sources)
- experiments (a collection of mlflow experiment scripts)
- outputs (a collection of experiment results)
- docs/source (results figures and visualization scripts)
- Option 1 - PIP
- Ubuntu users can run the following requirements directly
pip install -r requirements.txt
- Ubuntu users can run the following requirements directly
- Option 2 - Docker
- Dockerfile
- if you encounter issues while importing "gdal_array", I have included a fix
- Dockerfile
We use subnational data whenever it is available and fill in with national level data from FAOSTAT elsewhere. Thus we merge census and FAOSTAT data to generate the input dataset for our machine learning model. During the merging process, 2 filters are applied, namely NaN filter and GDD filter.
- NaN filter
- Remove samples with NaN in either CROPLAND or PASTURE attribute (e.g. this happens if we had data for cropland but not for pasture for this unit)
- GDD filter
- Remove samples that geographically lay in GDD mask
Note: Prior training, samples with CROPLAND and PASTURE sum over 100% will also be scaled to ensure probability distribution. To run the census pipeline, adjust the yaml files in the /configs
and do:
python census.py
A visualization of the census inputs is also provided below.
All training related configs could be found under /configs/training_cfg.yaml
. Note that one could also enable feature selection by specifying features (i.e. land cover types) to be removed. Removing a feature in land cover type does not simply remove it, instead a factor of 1/(1-[removed_class_sum]) is applied to the remaining features to maintain the property of probability distribution. All implementation details can be found here. We employ a few variations of gradient boosting tree based models with cross-validation. To start training, run:
python train.py
During deployment, 20 x 20 block matrices of 500m MODIS grid cells are used as inputs for our model (detailed process is explained under Prediction Input and Aggregation). Deployment configs can be modified under /configs/deploy_setting_cfg.yaml
. Make sure deploy configs are aligned with training configs. Post processing implementation can be found here. To run deployment to get the final cropland and pasture maps, run:
python deploy.py
All visualization scripts are placed under /docs/source/scripts/
. Make sure the project path ${workspaceFolder}
is added to the PYTHONPATH, then run:
cd docs/source/scripts/
python SCRIPT_TO_RUN [FLAG] [ARG]
The final complete dataset can be found in /outputs/all_correct_to_FAO_scale_itr3_fr_0/agland_map_output_3.tif
(where the numerical suffix corresponds to the iteration number). Users may also want the output data already disaggregated into cropland and pasture
Mehrabi, Z., Tong, K., Fortin, J., Stanimirova, R., Frield, M., and Ramankutty, N.: Geospatial database of global agricultural lands in the year 2015, Zenodo [dataset]. https://doi.org/10.5281/zenodo.11540554, 2024.