This repository documents the data processing and the analysis performed on the data to support the findings presented in the article "Measuring Inequality in Community Resilience to Natural Disasters Using Large-scale Mobility Data" published in Nature Communications.
Title: Measuring Inequality in Community Resilience to Natural Disasters Using Large-scale Mobility Data
Authors: Boyeong Hong1, Bartosz J. Bonczak1, Arpit Gupta2, and Constantine E. Kontokosta1,3,*
1New York University, Marron Institute of Urban Management, 60 5th Avenue, New York, NY 10011
2New York University, Stern School of Business, 44 West 4th Street, New York, NY 10012
3New York University, Center for Urban Science and Progress, 370 Jay Street, Brooklyn, NY 11201
*Correspondence and requests for materials should be addressed to Constantine E. Kontokosta (email: ckontokosta@nyu.edu).
While conceptual definitions provide a foundation for the study of disasters and their impacts, thechallenge for researchers and practitioners alike has been to develop objective and rigorous measuresof resilience that are generalizable and scalable, taking into account spatio-temporal dynamics in theresponse and recovery of localized communities. In this paper, we analyze mobility patterns of morethan 800,000 anonymized mobile devices in Houston, Texas, representing approximately 35% of thelocal population, in response to Hurricane Harvey in 2017. Using changes in mobility behavior before,during, and after the disaster, we empirically define community resilience capacity as a function of themagnitude of impact and time-to-recovery. Overall, we find clear socioeconomic and racial disparities inresilience capacity and evacuation patterns. Our work provides new insight into the behavioral responseto disasters and provides the basis for data-driven public sector decisions that prioritize the equitableallocation of resources to vulnerable neighborhoods.
Hong, B., Bonczak, B. J., Gupta, A., & Kontokosta, C. E. (2021). Measuring inequality in community resilience to natural disasters using large-scale mobility data. Nature communications, 12(1), 1870.
The repository documents the data processing and analysis perform on the data to support the findings of the study reported in the published article. It is structured as follows:
.\root
|--- data # placeholder directory intended to contain raw data
|--- figures # figures generated for the publication purposes
|--- scripts # contains all scripts used in the study
|--- data_processing # scripts processing raw data with PySpark on HDFS cluster
|--- analysis # scripts documenting analytical tasks
The primary data used for the main analysis are the annonymized geolocated mobile application activity provided by VenPath, Inc. – a data marketplace company providing mobile application data and business analytics services extracted from more than 200 various mobile applications and covering more than 120 million devices every month across the U.S. The initial dataset of 23TB of compressed comma-separated-value files (CSV) covering the period from June 2016 through October 2017 and contains more than 320 billion data points.
The annonymized geolocated mobile application activity data that support the findings of this study are available from VenPath, Inc. but restrictions apply to the availability of these data, which were used under license for the current study, and so are not publicly available. Data are however available from the authors upon reasonable request and with permission of VenPath, Inc.
The mobility data was organized by the date of registering it in the database, not by the date of ping creation and is partitioned by year, month and day. It is of the following structure:
Column | Type | Description |
---|---|---|
venpath_id | int | Auto-incrementing key to denote that this is a unique locate |
app_id | str | The source app that provided this data |
ad_id | str | The annonymized advertising ID |
id_type | str | iOS (idfa) or Android (afid) |
country_code | str | The locale setting on the user's phone |
device_make | str | The make of the user's device |
device_model | str | The model of the user's device |
device_os | str | The operating system of the user's device |
device_os_version | str | The version of the operating system on the user's device |
latitude | float | The geographical latitude |
longitude | float | The geographical longitude |
timestamp | timestamp | The UTC timestamp / datestamp of when this location point was collected |
ip_address | str | The IP address of the user at the time of the data collection |
horizontal_accuracy | float | The horizontal accuracy of the location point, in meters |
vertical_accuracy | str | The vertical accuracy of the location point, in meters |
foreground | bool | true if the ping was collected in the foreground or false if it was collected in the background |
The study relies on publicly available data sources as well, which are listed in the table below.
Data | Size | Description | Spatial granularity | Data coverage | Source | URL |
---|---|---|---|---|---|---|
Parcel level land use | 1.3 GB | Houston Harris County parcel level land use information such as property owner, land use type,and location attributes etc. | Parcel | 2018 | Texas Natural Resources Information System | https://data.tnris.org/collection/2679b514-bb7b-409f-97f3-ee3879f34448 |
Park and open space | 8 MB | Texas State parcel-based areas of parks and open spaces | Dissolved by park | 2018 | City of Houston GIS (COHGIS) Open Data Portal | https://cohgis-mycity.opendata.arcgis.com/datasets/coh-park-boundary |
Physical topology | 1.1 GB | Physical topology associated information including elevation | Related polygon | 2013 | USGS National Elevation Dataset (NED - 1/3 arc-second) | https://catalog.data.gov/dataset/national-elevation-dataset-ned-1-3-arc-second-downloadable-data-collection-national-geospatial |
National Flood Hazard Layer | 11 GB | Geographical information of an area of land adjacent to watershed and valley with high likelihood of flooding | Related polygon | 2018 | Federal Emergency Management Agency | https://www.fema.gov/flood-maps/products-tools/national-flood-hazard-layer |
Evacuation zone | - | List of zipcodes of planned evacuation zones | - | 2018 | Harris County Emergency Management | https://prepare.readyharris.org/evacuation-map |
FEMA shelters | 28 MB | FEMA National Shelter System information with geotagged locations | GPS (X,Y) | 2017 | Rice University Houston Urban Data Platform | https://www.kinderudp.org/#/datasetCatalog/va7b869ng5dv |
Major roads | 650 MB | Texas statewide and Houston citywide major roadway information associated with spatial segments, including roadtype, number of lanes etc. | Road segment | 2019 | Texas Departmentof Transportation, Houston Departmentof Transportation | https://gis-txdot.opendata.arcgis.com/datasets/d4f7206d27af4358acb70cb1cc819d10_0, https://cohgis-mycity.opendata.arcgis.com/datasets/coh-major-road |
FEMA disaster assistance application | - | Federal aid for residents who suffered damage and losses from Hurricane Harvey | Zipcode | 2017 | Federal Emergency Management Agency | https://www.fema.gov/about/openfema/data-sets |
U.S. American Community Survey | 20 MB | 2017 5-year estimate. U.S. demographicand household socioeconomic characteristics data, such as population, age, race, income level, household type etc. | Census Tract | 2017 | U.S. Census Bureau (U.S. Department of Commerce) | https://www.census.gov/programs-surveys/acs/technical-documentation/table-and-geography-changes/2017/5-year.html |
Mobility data were provided via dedicated AWS S3 service and were managed in accordance with NYU Institutional Review Board approval IRB-FY2018-1645 and stored and accessed in a secured environment at New York University’s Center for Urban Science and Progress (NYU CUSP) Research Computing Facility (RCF), which is equipped with High Performance Computing (HPC) infrastructure, controlled access, and restricted connectivity. Raw data was deployed to Hadoop Cloudera cluster of 20 nodes with 256 GB memory and 2.1 GHz CPU speed each with a total of 1280 cores and 5.1 TB memory. Initial data processing was conducted using Apache PySpark version 2.4. The following analysis performed on the processed data was conducted on RCF's High Memory computing server with total memory of 1TB and Intel(R) Xeon(R) CPU E5-4640 0 @ 2.40GHz (4x8 cores) using predominantly Python version 3.7 and Quantum GIS version 3.4 Madeira. Each script file relies on a separate set of libraries, which were listed at the begining of each of the file, including library version used for the particular task.
MIT License
Copyright (c) 2021 Urban Intelligence Lab
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.