This project's goal is to create an internal and public-facing resource (e.g. an interactive map) for people looking to find healthy, affordable food. Pittsburgh Food Policy Council, an umbrella organization for food-related nonprofits, is the project sponsor. More information about the need for this project can be found here.
There are many food-related nonprofits in the Pittsburgh area, and each maintains datasets about different food access programs and where they are offered (for example, Greater Pittsburgh Food Bank maintains a list of food pantries). The data processing part of this project gathers data from various sources and merges the datasets into a common format.
The map is located at the following address: https://codeforpittsburgh.github.io/FoodAccessMap/ Code for the map is located at a different repo: https://github.com/CodeForPittsburgh/CodeForPittsburgh.github.io/tree/master/FoodAccessMap
The map relies on the following steps to provide results
- Raw data is manually gathered from various providers at the federal and local level and saved in the Github repository.
- A Github Action is used to kick off a virtual machine containing the various scripts which then clean, transform, deduplicate, and collate the multiple data sources into a single file for use by the map.
- The map is hosted on another Code for Pittsburgh Github repo.
Volunteers can help in a number of ways, including developing code, fixing bugs, and improving project documentation. A list of outstanding issues can be found on the issues page, but if you can't find an issue you think you can work on, don't hesitate to ask one of us for help figuring out how you can contribute!
Python: Some of the data processing scripts are written in Python. R: Some of the data processing scripts are written in R.
There are multiple ways to access and manipulate the data, but for simplicity’s sake, this README will recommend a Python or R.
This project uses Python3, pipenv and pytest.
Required packages are listed in Pipfile
and can be installed using
$ pipenv install
This installs the packages in a virtual environment, a python convention which allows different projects to have different dependencies, with different versions.
You can run a single command inside the virtual environment using pipenv run
, or open a shell using
$ pipenv shell
Tests are stored in the tests/
directory, and include any file in the form test_*.py
, you can run them using
$ pytest
When you're done with the virtual environment, you can leave it using
$ exit
It is recommended to use the RStudio IDE to interact with the data.
- Download/Install R
- Download RStudio
- Start an RStudio Project (recommended)
- Install the
tidyverse
package with the following line of code (one-time action):
install.packages(“tidyverse”)
-
Start a new R Script or RMarkdown and read in the data with the following line of code:
library(tidyverse)
my_data <- read_csv(“https://raw.githubusercontent.com/CodeForPittsburgh/food-access-map-data/master/merged_datasets.csv”)
-
Once you’ve entered this line of code, you now have access to the data. You can use the various functions in base R or the
tidyverse
to explore the data -
For example, you can use the command
names(my_data)
to see the attributes of the data table.
Data for the food access map:
-
merged_datasets.csv
is the most current version of compiled PFPC data -
To regenerate merged_datasets.csv with new data, run the "Generate Merged Dataset" Github Action. This calls "data_prep_scripts/run.sh", which runs the following scripts in order.
- auto_agg_clean_data.R -- Reads in previously prepared data sources and aggregates them to a single data frame.
- auto_text_process_name.R -- Assigns types (like Chain Grocery Store, Farmer's Market, etc) to different addresses
- auto_geocode_wrapper.R -- Uses geocoding to obtain latitude and longitude coordinates for addresses without them
- auto_clean_addresses_wrapper.py -- Cleans up addresses to a standardized format
- auto_id_duplicates_wrapper.py -- Identifies duplicate rows
- auto_merge_duplicates_wrapper.py -- Merges duplicate rows, resolving conflicts on critical information by prioritizing some data sources
The end result of all of these steps is a new merged_datasets.csv, which the map points to for its data!
- Farmers Market Nutritional Program - Established by Congress in 1992, to provide fresh, unprepared, locally grown fruits and vegetables to WIC participants, and to expand the awareness, use of, and sales at farmers’ markets
- Greater Pittsburgh Community Food Bank - Food bank for the Greater Pittsburgh Area
- Just Harvest - "Nonprofit organization that reduces hunger through sustainable, long-term approaches that lift up those in need"
- Pittsburgh Food Policy Council - "The mission of the Pittsburgh Food Policy Council is to build a food system that benefits our communities, economy, and environment in ways that are just, equitable and sustainable"
- USDA Food and Nutrition Service - Agency of US Department of Agriculture responsible for administering the nation’s domestic nutrition assistance programs
Sources are obtained and prepared for additional processing via our data prep scripts. The source rules for utilizing those scripts can be found here.
These labels are listed in merged_datasets.csv and are used to denote particular unique traits of the food source.
- SNAP - Whether the site accepts SNAP
- WIC - Whether the site accepts WIC
- FMNP - Whether the site accepts farmers market nutrition program
- fresh_produce - Whether the site offers fresh produce
- food_bucks - Whether the site accepts food bucks
- free_distribution - Whether the site offers free food assistance
- open_to_spec_group - Whether the site is only open to special groups
New datasets can be added as they are discovered. A prep script can be added to this directory.
New datasets need to correspond to the project-wide schema set in schema.xlsx. Cleaned data should be saved in .csv format here.
Any new prep script also needs to be added to the source_r_scripts.R or source_python_scripts.py files in the same prep_source_scripts directory. The source scripts control what prep scripts are run to update the full dataset.
Because we are combining multiple data sets, we utilize a deduplication process to identify and resolve possible duplicates.
The "training set", which is used to train the deduplication program in what is and is not likely to be a duplicate, is located here. Adding to the training data primarily consists of added cases of address strings that are duplicates, as well as cases of address strings that aren't. To train new data, you can utilize the IPython Notebook located here.