GLocal Aggregations

"GLocal" stands for global datasets that allow for international comparisons, but are geographically granular enough to capture subnational variation. This project hopes to provide datasets that support both applied and academic research.

This repository houses the code used for creating the aggregations.

Data are available at Harvard Dataverse.

Here is the directory structure of the project:

.
└── src
    ├── case_study
    ├── data_validation
    ├── google_earth_engine
    ├── raster_aggregations
    │   ├── dmsp
    │   ├── elevation
    │   ├── emissions
    │   ├── fao
    │   ├── gdelt
    │   ├── ghsl
    │   ├── ntl_dmsp_ext
    │   ├── ntl_dvnl
    │   ├── population
    │   ├── precipitation
    │   ├── ruggedness
    │   ├── solar_potential
    │   ├── telecom_mobile_coverage
    │   ├── temperature
    │   ├── vegetation
    │   ├── viirs
    │   └── wind_potential
    ├── road_lengths
    └── supporting_data
        ├── code
        └── data
            ├── clean
            └── raw
                ├── Airports
                ├── GADM
                ├── Ports
                ├── coastline
                └── rivers_lakes

Brief description of the top-level directories, in the order they are to be run:

src: Contains the code used for creating the aggregations.
src/supporting_data: Contains the code and data used for creating miscellaneous supporting data such as the centroids of administrative areas, location of airports and ports, etc.
src/google_earth_engine: Contains the code used for computations on Google Earth Engine.
src/raster_aggregations: Contains the code used for zonal statistics from raster files hosted locally.
src/road_lengths: Contains the code used for computing road lengths from OSM data.
src/data_validation: Contains the code used for data validation.
src/case_study: Contains the code used for the case study.

Environment setup

Python

Use conda to install the environment using src/environment.yml.

First navigate to the root directory and then run:

conda env create -f src/environment.yml

This will install all the necessary Python packages.

R

We use R to run zonal statistics so we can use the R bindings to the exactextract package written in C. Benchmarking against several zonal statistics tools in Python shows that exactextractr is fastest, and most feature complete, for our use case.

You will need to install the following R packages: tidyverse, arrow, exactextractr, sf, terra, raster, and here.

You can use the following snippet to install these packages:

packages <-
  c("tidyverse",
    "arrow",
    "exactextractr",
    "sf",
    "terra",
    "raster",
    "here")
install.packages(packages)

Configuring the Project

To get the project up and running on your local system, you'll need to adjust the settings in the src/.config file according to your environment. Follow these steps to modify the configuration file:

Open src/.config file with a text editor of your choice. You will see several sections marked by brackets ([ ]). The comments should guide you on what each parameter does and how it should be updated. For example:

PROJECT_ROOT: This should be the path to the root directory of your project. Modify it if the default does not match your directory structure.
GLOCAL_DATA_PATH: Update this to the location where your Glocal dataset is stored.

Raw data

The raw data used in the project shoule be placed in the following directory structure:

.
└── raw
    ├── airports
    │   ├── datahub
    │   ├── nunn_and_puga
    ├── country_codes
    ├── metadata
    │   └── 15_ADMProjects
    ├── mining
    ├── point_data
    │   └── telecom_antennas
    ├── ports
    │   ├── wld
    │   └── wpi
    ├── rasters
    │   ├── dmsp
    │   ├── elevation
    │   ├── emissions
    │   ├── fao
    │   ├── gdelt_v2
    │   ├── ghsl
    │   ├── gpcp
    │   ├── ntl_dmsp_ext
    │   ├── ntl_dvnl
    │   ├── population
    │   ├── precipitation
    │   ├── roads
    │   ├── ruggedness
    │   ├── sentinel_2
    │   ├── solar_potential
    │   ├── telecom_mobile_coverage
    │   ├── temperature
    │   ├── viirs
    │   └── wind_potential
    ├── remote_sensed
    │   ├── ACLED
    │   ├── all_countries_with_eth
    │   ├── all_FAO
    │   ├── CRU
    │   ├── cru_raster
    │   ├── elevation
    │   ├── GDELT_ethnic
    │   ├── GPCC
    │   ├── GPCC_raster
    │   ├── gpcp
    │   ├── Latinobarometro
    │   ├── nightlights
    │   ├── population
    │   ├── SCAD
    │   ├── SCPDSI
    │   ├── SCPDSI_raster
    │   └── temperature
    ├── roads
    │   ├── code
    │   └── test
    └── shapefiles
        ├── gadm
        ├── gadm_geoparquet
        └── ghs

Further, the intermediate data folder should have the following structure - some of these folders are automatically created by the code, whereas others (such as the downloads from Google Cloud Storage) need to be created manually. In order to simplify the process, it is best to create all the folders manually.

├── intermediate
│   ├── acled
│   ├── gadm_without_geometry
│   ├── gee_accessibility_agg
│   ├── gee_forest_change
│   ├── gee_landcover_agg
│   ├── gee_landcover_compiled
│   ├── gee_landcover_processed
│   ├── gee_viirs_agg
│   ├── gee_viirs_monthly
│   ├── ghs
│   ├── ghs_gadm
│   ├── individual_aggregations
│   ├── market_access
│   ├── mining
│   ├── overture
│   ├── raster_aggregations
│   │   ├── dmsp
│   │   ├── elevation
│   │   ├── emissions
│   │   ├── fao
│   │   ├── ntl_dmsp_ext
│   │   ├── ntl_dvnl
│   │   ├── population
│   │   ├── precipitation
│   │   ├── roads
│   │   ├── ruggedness
│   │   ├── solar_potential
│   │   ├── telecom_mobile_coverage
│   │   ├── temperature
│   │   ├── viirs
│   │   └── wind_potential
│   ├── roads
│   │   └── grip
│   ├── viirs
│   └── viirs_monthly

Need Help?

If you encounter any issues or have questions about configuring your environment, feel free to open an issue in the project repository or contact the project authors.