output | ||||
---|---|---|---|---|
|
This folder contains data and reproducing scripts for the manuscript “Towards Causal Inference for Spatio-Temporal Data: Conflict and Forest Loss in Colombia” submitted to JASA Case Studies on April 30 2021
- This paper does not involve analysis of external data (i.e., no data are used or the only data are generated by the authors via simulation in their code).
- I certify that the author(s) of the manuscript have legitimate access to and permission to use the data used in this manuscript.
The data consist of four data sets, all of which are publicly available:
- Forest loss (remote sensing based spatio-temporal data set)
- Conflict events (georeferenced spatio-temporal data set, based on reports from newspapers and NGOs)
- Distance to road (Euclidean distance to the closest road calculated based on a vector-based road-dataset)
- Population density (gloal population density dataset, temporally explicit at a 5-year resolution)
From the above primary data sets, we generate a single summary data set on which most of our analysis is based. It can be found in the supplementary materials under the filename data_xy_colombia_20200616.txt.
- Data are publicly available.
- Data cannot be made publicly available.
If the data are publicly available, see the Publicly available data section. Otherwise, see the Non-publicly available data section, below.
- Data are available online at:
- https://earthenginepartners.appspot.com/science-2013-global-forest/download_v1.6.html
- https://ucdp.uu.se/downloads/index.html#ged_global
- http://biogeo.ucdavis.edu/data/diva/rds/COL_rds.zip
- https://sedac.ciesin.columbia.edu/data/set/gpw-v4-population-density-rev11/data-download
-
Data are available as part of the paper’s supplementary material.
-
Data are publicly available by request, following the process described here:
-
Data are or will be made available through some other mechanism, described here:
- CSV or other plain text.
- Software-specific binary format (.Rda, Python pickle, etc.): pkcle
- Standardized binary format (e.g., netCDF, HDF5, etc.):
- Other (please specify): .vrt, SHP, SHP
- Provided by authors in the following file(s): data_dictionary.txt
- Data file(s) is(are) self-describing (e.g., netCDF files)
- Available at the following URL:
- Script files
- R
- Python
- Matlab
- Other:
- Package
- R
- Python
- MATLAB toolbox
- Other:
- Reproducible report
- R Markdown
- Jupyter notebook
- Other:
- Shell script
- Other (please specify):
R, Python, ArcGIS
R version 3.6.1 (2019-07-05), Python version 3.7, ArcGIS Version 10.5
In our R scripts, we used the following packages
- ggplot2 (version 3.2.1)
- foreach (version 1.4.7)
- gridExtra (version 2.3)
- reshape2 (version 1.4.3)
- viridis (version 0.5.1)
- fields (version 10.0)
- dplyr (version 0.8.3)
In our Python scripts, we used the following packages
- gdal (version 3.0.2)
- osr (version 3.0.2)
- ogr (version 3.0.2)
- numpy (version 1.18.1)
- joblib (version 0.15.1)
- tqdm (version 4.46.0)
- urllib3 (version 1.25.8)
- No parallel code used
- Multi-core parallelization on a single machine/node
- Number of cores used: 8
- Multi-machine/multi-node parallelization
- Number of nodes and cores used: 3 nodes, 243 cores
- MIT License (default)
- BSD
- GPL v3.0
- Creative Commons
- Other: (please specify below)
The provided workflow reproduces:
- Any numbers provided in text in the paper
- All tables and figures in the paper
- Selected tables and figures in the paper, as explained and justified below:
- Single master code file
- Wrapper (shell) script(s)
- Self-contained R Markdown file, Jupyter notebook, or other literate programming approach
- Text file (e.g., a readme-style file) that documents workflow
- Makefile
- Other (more detail in Instructions below)
We first describe how to generate the summary data set data_xy_colombia_20200616.txt. from the publicly avaible data sets mentioned above.
- Run 00_Download_ForestLossData.py, which downloads the forest loss data from https://earthenginepartners.appspot.com/science-2013-global-forest/download_v1.6.html and saves it to the files Forest2000.vrt, Gain.vrt, and LossYear.vrt.
- Run script 01_ForestLoss_Summaries.py, which summarizes the data to 10x10km grid cells and saves it to Hansen_Summaries_ALL_20190821.csv.
- Download the conflict data from https://ucdp.uu.se/downloads/index.html#ged_global (choose ”UCDP Georeferenced Event Dataset (GED) Global version 19.1” in SHX format) save it to a file anmed ged181.shp.
- Run 02_Conflict-Summaries.py, which then summarizes the data to 10x10km grid cells and saves it to PRIO-summaries_bestEstimate_20191217.
- Download the road data for Colombia from http://biogeo.ucdavis.edu/data/diva/rds/COL_rds.zip
- Run Script 03_CalculateDistanceToRoad.py, which calculates the Euclidean distance to roads at 1km spatial resolution and stores the data to DistanceToRoad_1km.tif (caution: modify the workFolder in the script header)
- Run Script 04_RoadDist_Summaries.py, which aggregates DistanceToRoad_1km.tif to 10x10km grid cells, and saves it to RoadDist_Summaries.csv
- Download the population density data, for each of the years 2000, 2005, 2010, 2015 from https://sedac.ciesin.columbia.edu/data/set/gpw-v4-population-density-rev11/data-download (choose Temporal: Single Year, FileFormat: Geo Tiff, Resolution: 30 seconds (approx. 1km)) and save it to files named gpw_v4_population_density_rev10_2000_30_sec.tif gpw_v4_population_density_rev10_2005_30_sec.tif gpw_v4_population_density_rev10_2010_30_sec.tif gpw_v4_population_density_rev10_2015_30_sec.tif
- Run 05_Population-Summaries.py, which summarizes the data to 10x10km grid cells and saves it to GPWv4_summary_20190129.csv
- Run merge_data.R, which takes as input the following five data sets and generates the summary data set data_xy_colombia_20200616.txt.
- Hansen_Summaries_ALL_20190821.csv
- PRIO-summaries_bestEstimate_20191217
- RoadDist_Summaries.csv
- GPWv4_summary_20190129.csv
- uniqueID_country_Biome_LS_Anthrome_20190130.csv
Apart from Figures 1 and 5, all our figures can be generated from the data set data_xy_colombia_20200327.txt by running corresponding R scripts. The file wrapper-figures.R produces all figures sequentially, and saves them as .pdf files.
- Figure 1 (temporally aggregated summary of data set):
- Run construct_summary_dataset.R, which generates data_xy_colombia_temporally_aggregated_20200616.txt
- Layers were visually assembled in ArcMAp 10.5
- Figure 2 (conceptual idea for estimating causal effects from obs. data): run concept.R
- Figure 3 (simulation experiment): run CSTM_example_consistency.R
- Figure 4 (results of resampling test on country level): run resampling_sim.R and resampling_plot.R
- Figure 5 (regional summaries, causal effects and FARC presence):
- run construct_summary_dataset.R, which generates data_xy_colombia_regional_summaries_20200432.txt
- run regional_effects.R, which generates the file regional_effects.txt
- data on FARC presence stem from: https://pares.com.co/2015/04/24/los-mapas-del-conflicto/
- Layers were visually assembled in ArcMAp 10.5
- Figure 6 (Colombian peace process): run peace_agreement.R
- Figure E1 (Plot illustrating the example in Remark 6; supplementary material): run plot_remark.R
- Figure F1 (Histogram of coefficient estimates; supplementary material): run time-varying-confounders-sim.R and time-varying-confounders-plot.R
- Figure F2 (Power curves; supplementary material): run power-analysis-sim.R and power-analysis-plot.R
- Figure F3 (Bias-heatmaps; supplementary material): run bias-analysis-sim.R and bias-analysis-plot.R
- Figure G1 (Spatial block-premutation scheme; supplementary material): run spatial_blockresampling.R
All results on test statistics and p-values stated in our manuscript can be reproduced from data_xy_colombia_20200327.txt using the below scripts. The file wrapper-results.R produces all results sequentially, and saves them as .txt files.
- instantaneous effects, standard resampling: run resampling_sim.R
- temorally lagged effects, standard resampling: run resampling_tlag1.R
- instantaneous effects, temporal block-resampling: run blockresampling.R
- temorally lagged effects, temporal block-resampling: run blockresampling_tlag1.R
- instantaneous effects, spatial block-resampling: run spatial_blockresampling.R
Approximate time needed to reproduce the analyses on a standard desktop machine:
- < 1 minute
- 1-10 minutes
- 10-60 minutes
- 1-8 hours
- > 8 hours
- Not feasible to run on a desktop machine, as described here: