Particle components aggregation

This repository contains code for aggregating the particle components generated by Heresh Amini (prepub: <https://doi.org/10.21203/rs.3.rs-1745433/v1>{.uri}) to Census geographies. The computationally-intensive processes happen at the Census block level; aggregations at higher levels (i.e. block groups, tracts, and counties) are simply aggregations of the aggregated Census block-level data and this methodology can be trivially extended to other, higher-level Census geographic levels.

Processing takes place in several distinct stages. All scripts in a given stage are meant to be run and completed before continuing to the next stage. Some stages have several scripts; these can safely be run in parallel to speed up the processing.

In addition to the processing scripts, the util.R script, which is sourced in every R script, contains several constants and functions that are used across multiple scripts. In particular, all of the directory definitions are contained within this script, to facilitate fast re-linking to moved directories.

NOTE: Urban predictions for non-urban geographies will not be valid. For simplicity, this script will automatically generate aggregates for both urban and non-urban Census geographies. Non-urban predictions can be used nationwide; however, urban predictions have only been generated in Census-designated urban areas. Care should be taken to only use the urban aggregations for urban geographies (as defined by the files generated by stage1_detect_urban.R). The script stage6_distrib.R will help you avoid this by generating files with only the relevant valid data.

Stage 1: Initial setup and consistency checks

The first stage sets up some initial files that are used in the later stages. Specifically:

stage1_detect_urban.R determines which Census geographies are considered "urban". As Census blocks are the smallest geographies for which population data is available, a simple point-in-polygon of areal-weighted block centroids within Census-designated urban areas is used to determine whether or not Census blocks are "urban". Conversely, higher-level geographies are considered "urban" if at least 75% of the population in encompassed blocks live in blocks that have previously been considered "urban".

stage1_extract_grids.R is used to generate GeoPackage files containing the grids used for predicting data. This is done by extracting the coordinates of the prediction files. All prediction files for a given Census region use the same grid for each grid and particle component; therefore, we focus only on the first year of the component with the smallest predicted files (NO3). From this grid, we generate a gridcell_idx as 1:nrow(df), where df is the data frame of coordinates in the first year of NO3 predictions for a given year. These are saved to GeoPackage files to take advantage of the format's portability and spatial indexing functionality.

Additionally, stage1_check_missing.R scans all input files and prints out ones that may have missing data, if any. Input files with missing data will not make it past stage4_aggregate_blocks.R, but can detect potential issues before they reach that stage.

Stage 2: Working GeoPackage setup

The second stage sets up a single working GeoPackage database for all subsequent geoprocessing. The stage2_setup_gpkg.sh script imports saved Census block boundaries from TIGER/Line and the grids generated earlier. Additionally, the script creates subsets of blocks by Census region and calculates centroids for each subset of Census blocks. This is done by directly interfacing with the GeoPackage via -nln imports and -sql queries using the ogr2ogr command-line tool.

NOTE: This script contains constants separate from those util.R, since this is written in Bash instead. Care should be taken to modify both sets of constants when modifying input and output directory paths.

Stage 3: Spatial join

The third stage joins the grid cell centroids to Census blocks in two different ways:

stage3_nearest.R uses the nabor library to calculate the nearest Census block to each grid cell centroid. This is guaranteed to match one grid cell to each Census block, urban or not.

stage3_within.txt includes instructions for how to perform a point-in-polygon spatial join of grid cells to Census blocks. No R script is provided because the memory cost of such a large spatial join using the sf library is prohibitive. (This was attempted on a server with 500 GB of RAM and failed.) Instead, instructions are provided for performing the spatial join in QGIS, where remarkably low amounts of memory are used, owing to the use of GeoPackages. Ideally, the entire GeoPackage would be imported into a PostGIS database for computation, but this may be logistically difficult.

NOTE: Unlike the nearest-neighbour join, the point-in-polygon join does not produce a match for every Census block. This is because we are now doing a left spatial join on the grid cells themselves, and not every Census block is guaranteed to have a grid cell, e.g. blocks that are very small / for urban predictions: blocks that are off the coast, which are outside of the prediction area.

Stage 4: Census block aggregations

In stage 4, the predictions are joined with the nearest-neighbour and point-in-polygon crosswalks generated in the third stage, then aggregated. The stage4_aggregate_blocks.R script first joins the predictions to both crosswalk files, separately, and aggregates grid cell predictions into Census block-level predictions as follows:

For the nearest-neighbour crosswalk, the script simply uses the prediction from the matched nearest grid cell.
For the point-in-polygon crosswalk, the script averages the predictions from the grid cells contained in each Census block.

The script then performs a left join of the point-in-polygon aggregates on the nearest-neighbour predictions, as the nearest-neighbour predictions are guaranteed to have all blocks represented. (The point-in-polygon crosswalk does not; see the note in stage 3.) The resulting file has $n \times y$ rows, where $n$ is the number of Census blocks and $y$ is the number of years predicted.

Stage 5: Aggregations for other Census geographies

In stage 5, the Census block-level aggregates are used to generate aggregates for higher-level geographies. stage5_aggregate_other.R first decides on block-level predictions by taking the point-in-polygon aggregates, if available, and using the nearest-neighbour predictions if not. Then, the 15-character Census block GEOID (BLKIDFP00 for the 2000 Census, or GEOID10 for the 2010 Census) is truncated to obtain the block group, tract, and county GEOIDs using the lengths defined in the GEOID_LENGTHS variable in util.R. For more information about how this works, see the Understanding Geographic Identifiers (GEOIDs) page from the U.S. Census Bureau website.

Stage 6: Create files for distribution

So far, we have aggregated the data to all Census blocks and block groups, disregarding the urban/non-urban classifications generated by stage1_detect_urban.R. In this stage, stage6_distrib.R will join those classifications (which include population counts) to the aggregated files, subset to urban areas only if necessary (i.e. for the urban predictions), and write out files for distribution.