/spatio_temporal_causality

Towards causal inference for spatio-temporal data: conflict and forest loss in Colombia

Primary LanguageR

output
pdf_document html_document
default
default

This folder contains data and reproducing scripts for the manuscript “Towards Causal Inference for Spatio-Temporal Data: Conflict and Forest Loss in Colombia” submitted to JASA Case Studies on April 30 2021

Part 1: Data

  • This paper does not involve analysis of external data (i.e., no data are used or the only data are generated by the authors via simulation in their code).
  • I certify that the author(s) of the manuscript have legitimate access to and permission to use the data used in this manuscript.

Abstract

The data consist of four data sets, all of which are publicly available:

  1. Forest loss (remote sensing based spatio-temporal data set)
  2. Conflict events (georeferenced spatio-temporal data set, based on reports from newspapers and NGOs)
  3. Distance to road (Euclidean distance to the closest road calculated based on a vector-based road-dataset)
  4. Population density (gloal population density dataset, temporally explicit at a 5-year resolution)

From the above primary data sets, we generate a single summary data set on which most of our analysis is based. It can be found in the supplementary materials under the filename data_xy_colombia_20200616.txt.

Availability

  • Data are publicly available.
  • Data cannot be made publicly available.

If the data are publicly available, see the Publicly available data section. Otherwise, see the Non-publicly available data section, below.

Publicly available data

  • Data are available online at:
  1. https://earthenginepartners.appspot.com/science-2013-global-forest/download_v1.6.html
  2. https://ucdp.uu.se/downloads/index.html#ged_global
  3. http://biogeo.ucdavis.edu/data/diva/rds/COL_rds.zip
  4. https://sedac.ciesin.columbia.edu/data/set/gpw-v4-population-density-rev11/data-download
  • Data are available as part of the paper’s supplementary material.

  • Data are publicly available by request, following the process described here:

  • Data are or will be made available through some other mechanism, described here:

Non-publicly available data

Description

File format(s)

  • CSV or other plain text.
  • Software-specific binary format (.Rda, Python pickle, etc.): pkcle
  • Standardized binary format (e.g., netCDF, HDF5, etc.):
  • Other (please specify): .vrt, SHP, SHP

Data dictionary

  • Provided by authors in the following file(s): data_dictionary.txt
  • Data file(s) is(are) self-describing (e.g., netCDF files)
  • Available at the following URL:

Additional Information (optional)

Part 2: Code

Abstract

Description

Code format(s)

  • Script files
    • R
    • Python
    • Matlab
    • Other:
  • Package
    • R
    • Python
    • MATLAB toolbox
    • Other:
  • Reproducible report
    • R Markdown
    • Jupyter notebook
    • Other:
  • Shell script
  • Other (please specify):

Supporting software requirements

R, Python, ArcGIS

Version of primary software used

R version 3.6.1 (2019-07-05), Python version 3.7, ArcGIS Version 10.5

Libraries and dependencies used by the code

In our R scripts, we used the following packages

  • ggplot2 (version 3.2.1)
  • foreach (version 1.4.7)
  • gridExtra (version 2.3)
  • reshape2 (version 1.4.3)
  • viridis (version 0.5.1)
  • fields (version 10.0)
  • dplyr (version 0.8.3)

In our Python scripts, we used the following packages

  • gdal (version 3.0.2)
  • osr (version 3.0.2)
  • ogr (version 3.0.2)
  • numpy (version 1.18.1)
  • joblib (version 0.15.1)
  • tqdm (version 4.46.0)
  • urllib3 (version 1.25.8)

Supporting system/hardware requirements (optional)

Parallelization used

  • No parallel code used
  • Multi-core parallelization on a single machine/node
    • Number of cores used: 8
  • Multi-machine/multi-node parallelization
    • Number of nodes and cores used: 3 nodes, 243 cores

License

  • MIT License (default)
  • BSD
  • GPL v3.0
  • Creative Commons
  • Other: (please specify below)

Additional information (optional)

Part 3: Reproducibility workflow

Scope

The provided workflow reproduces:

  • Any numbers provided in text in the paper
  • All tables and figures in the paper
  • Selected tables and figures in the paper, as explained and justified below:

Workflow

Format(s)

  • Single master code file
  • Wrapper (shell) script(s)
  • Self-contained R Markdown file, Jupyter notebook, or other literate programming approach
  • Text file (e.g., a readme-style file) that documents workflow
  • Makefile
  • Other (more detail in Instructions below)

Instructions

Summary data set

We first describe how to generate the summary data set data_xy_colombia_20200616.txt. from the publicly avaible data sets mentioned above.

  1. Run 00_Download_ForestLossData.py, which downloads the forest loss data from https://earthenginepartners.appspot.com/science-2013-global-forest/download_v1.6.html and saves it to the files Forest2000.vrt, Gain.vrt, and LossYear.vrt.
  2. Run script 01_ForestLoss_Summaries.py, which summarizes the data to 10x10km grid cells and saves it to Hansen_Summaries_ALL_20190821.csv.
  3. Download the conflict data from https://ucdp.uu.se/downloads/index.html#ged_global (choose ”UCDP Georeferenced Event Dataset (GED) Global version 19.1” in SHX format) save it to a file anmed ged181.shp.
  4. Run 02_Conflict-Summaries.py, which then summarizes the data to 10x10km grid cells and saves it to PRIO-summaries_bestEstimate_20191217.
  5. Download the road data for Colombia from http://biogeo.ucdavis.edu/data/diva/rds/COL_rds.zip
  6. Run Script 03_CalculateDistanceToRoad.py, which calculates the Euclidean distance to roads at 1km spatial resolution and stores the data to DistanceToRoad_1km.tif (caution: modify the workFolder in the script header)
  7. Run Script 04_RoadDist_Summaries.py, which aggregates DistanceToRoad_1km.tif to 10x10km grid cells, and saves it to RoadDist_Summaries.csv
  8. Download the population density data, for each of the years 2000, 2005, 2010, 2015 from https://sedac.ciesin.columbia.edu/data/set/gpw-v4-population-density-rev11/data-download (choose Temporal: Single Year, FileFormat: Geo Tiff, Resolution: 30 seconds (approx. 1km)) and save it to files named gpw_v4_population_density_rev10_2000_30_sec.tif gpw_v4_population_density_rev10_2005_30_sec.tif gpw_v4_population_density_rev10_2010_30_sec.tif gpw_v4_population_density_rev10_2015_30_sec.tif
  9. Run 05_Population-Summaries.py, which summarizes the data to 10x10km grid cells and saves it to GPWv4_summary_20190129.csv
  10. Run merge_data.R, which takes as input the following five data sets and generates the summary data set data_xy_colombia_20200616.txt.
    • Hansen_Summaries_ALL_20190821.csv
    • PRIO-summaries_bestEstimate_20191217
    • RoadDist_Summaries.csv
    • GPWv4_summary_20190129.csv
    • uniqueID_country_Biome_LS_Anthrome_20190130.csv

Figures

Apart from Figures 1 and 5, all our figures can be generated from the data set data_xy_colombia_20200327.txt by running corresponding R scripts. The file wrapper-figures.R produces all figures sequentially, and saves them as .pdf files.

  • Figure 1 (temporally aggregated summary of data set):
    • Run construct_summary_dataset.R, which generates data_xy_colombia_temporally_aggregated_20200616.txt
    • Layers were visually assembled in ArcMAp 10.5
  • Figure 2 (conceptual idea for estimating causal effects from obs. data): run concept.R
  • Figure 3 (simulation experiment): run CSTM_example_consistency.R
  • Figure 4 (results of resampling test on country level): run resampling_sim.R and resampling_plot.R
  • Figure 5 (regional summaries, causal effects and FARC presence):
    • run construct_summary_dataset.R, which generates data_xy_colombia_regional_summaries_20200432.txt
    • run regional_effects.R, which generates the file regional_effects.txt
    • data on FARC presence stem from: https://pares.com.co/2015/04/24/los-mapas-del-conflicto/
    • Layers were visually assembled in ArcMAp 10.5
  • Figure 6 (Colombian peace process): run peace_agreement.R
  • Figure E1 (Plot illustrating the example in Remark 6; supplementary material): run plot_remark.R
  • Figure F1 (Histogram of coefficient estimates; supplementary material): run time-varying-confounders-sim.R and time-varying-confounders-plot.R
  • Figure F2 (Power curves; supplementary material): run power-analysis-sim.R and power-analysis-plot.R
  • Figure F3 (Bias-heatmaps; supplementary material): run bias-analysis-sim.R and bias-analysis-plot.R
  • Figure G1 (Spatial block-premutation scheme; supplementary material): run spatial_blockresampling.R

Results

All results on test statistics and p-values stated in our manuscript can be reproduced from data_xy_colombia_20200327.txt using the below scripts. The file wrapper-results.R produces all results sequentially, and saves them as .txt files.

  • instantaneous effects, standard resampling: run resampling_sim.R
  • temorally lagged effects, standard resampling: run resampling_tlag1.R
  • instantaneous effects, temporal block-resampling: run blockresampling.R
  • temorally lagged effects, temporal block-resampling: run blockresampling_tlag1.R
  • instantaneous effects, spatial block-resampling: run spatial_blockresampling.R

Expected run-time

Approximate time needed to reproduce the analyses on a standard desktop machine:

  • < 1 minute
  • 1-10 minutes
  • 10-60 minutes
  • 1-8 hours
  • > 8 hours
  • Not feasible to run on a desktop machine, as described here: