/hack-the-bay

Primary LanguageJupyter Notebook

UNDER CONSTRUCTION

Hack the Bay: Data

This repo is intended to provide data and starter code for the Hack the Bay hackathon (Aug. 3-Sep. 14, 2020).

DOWNLOAD WATER QUALITY & BENTHIC DATA HERE (coming soon)

Overview

CMC’s data is intended to fill spatial and temporal data gaps that exist in the federal Chesapeake Bay Program’s (CBP) database. Both CMC and CBP measure the health of the watershed through monitoring both chemical water quality indicators and counting the presence of different benthic organisms. For all hackathon challenges, we recommend participants use both CMC and CBP water quality data (and for challenges 1 and 2, also recommend leveraging CMC and CBP benthic sample data). We have downloaded water quality and benthic datasets here and made them available in the Google Drive folder linked above. In addition to these datasets, there are links to more suggested datasets for each challenge below.

*Note on the data: For CBP, only data after 2005 is included (for both water quality and benthic data). CMC data goes back to 1992. For more historic CBP data, you can download additional water quality data and nontidal benthic sample data directly from CBP. CMC data was downloaded from the Chesapeake Data Explorer.

Contents

  1. Data to Download [Google Drive]
  2. Recommended Datasets
  3. Data Dictionaries
  4. Understanding the Data
  5. Code for Generating Final Datasets

Recommended Datasets

X = primary dataset (strongly recommended)

o = optional (suggested)

Source Dataset Challenge 1 Challenge 2 Challenge 3 Challenge 4
CBP / CMC Water Quality X X X X
CBP / CMC Benthic X X o o
USGS Stream Flow o o o
USGS Pollution Yields and Loads o o
USGS Geology o o
NOAA Weather o
CBP Nutrient Point Source Database o o
CBP Land Cover (Under GIS Datasets) o o o
CBP Public Access Data (Under GIS Datasets) o
Chesapeake Conservancy Land Use X X
EPA Environmental Justice (EJ) Screen o
CDC Social Vulnerability Index o
US Census Demographic / Economic Data o
US Census County Boundary Maps X
USDA HUC12 Boundary Maps X X X o

Data Dictionaries

Understanding the Data

Geospatial Density

The Chesapeake Bay watershed spans Virginia, Maryland, Delaware, West Virginia, Pennsylvania, New York and Washington, DC. CMC’s data has greater coverage in some states over others, and is largely dependent on the activity level and participation of monitoring groups in those states. As of July 2020, CMC’s water quality database included samples from over 1,600 unique collection points (compared to 887 unique collection points in CBP’s database from 2005-2020).

Geospatial Data

Temporal Density

CMC’s water quality data goes back as far as 1992, with the majority of their data collected after 2017.

Temporal Data

What is a Data Gap?

Participants exploring CMC’s data for the first time will notice that data collection is highly variable across time, space, and water quality parameter. Data sparsity is a reality of many environmental datasets. When planning your analysis, each challenge recommends focusing on either a geographic area or one of a few parameters that can be compared across space/time.

To tell a story across space, we need data that effectively covers a region with samples to represent the condition of the area.

  • It is important to represent different habitats and land uses like forest, agricultural fields and urban settings.
  • Other habitats might be valuable to represent such as headwater streams versus lowland streams, or ponds and reservoirs versus rivers and bays, as examples of locating land use and land cover categories that show an abundance of data versus data gaps.

To tell a story across time, we need data collected seasonally and annually so that change over time can be evaluated. Trends in time tend to require at least 4 years of data while time trends necessitating 10 or more years of data are very valuable.

  • CMC is most interested in locations that have collected data in the last 5 years. (Locations with robust historic data that have not collected new data in the last five years would be considered a gap.)
  • The ideal sampling rhythm is 1x per month for water quality and 2x per year for benthic observations.

Selecting Locations

If you select a challenge that recommends picking a specific part of the watershed to focus your analysis on, consider some ways that the watershed could be separated geographically:

  • Hydrologic Unit Code (HUC): HUCs are a specific type of boundary for bodies of water that range in detail from HUC-2 (2-digit HUC) to HUC-12 (12-digit HUC). For environmental analyses, evaluating parts of the watershed by HUC-12 is recommended (ex., for Challenges 1, 2, and 3).
  • County / Municipality: Using administrative boundaries makes sense when comparing environmental data to social and demographic data (ex., for Challenge 4).

Code

This repo contains notebooks with the code used to join the raw exported datasets from CMC and CBP, as well as add HUC12 and FIPS codes.

Questions?

If you have any questions about the data or information in this repo, contact Kate Dowdy (dowdy_katherine@bah.com). More resources for the Hack the Bay hackathon can be found here.