The Introduction to Data Science for Public Policy is a survey course of the fundamentals of data science. The course is focused on evaluating and analyzing public policy, telling stories with data to make compelling and fact-based arguments.
The objective of the course is to equip students with the skills to tell stories with data and drive action. Public policy is part of a large and sprawling social system. Parsing causality from a system of variables where everything is related requires a scalpel. This refined approach can be assembled from pre-written code and routines; but it still requires skilled assembly. We will teach an approach that leverages analytical routines that have already been written. The value of this course is in the mortar, not the bricks.
Jeff Chen is the Deputy Chief Data Officer of the U.S. Department of Commerce. He has led wide ranging initiatives across 30+ fields, from emergency services to international public health to legal affairs to trade economy. Jeff has previously served as the Director of Analytics at the NYC Fire Department leading development of fire prediction algorithms, senior data roles in the NYC Mayor’s Office during the Bloomberg Administration focusing on city operations and health + human services, and an advisor to governments, corporations, and non-profits on applied data for strategy and operations.
Dan Hammer is currently a Senior Policy Advisor at the White House, where he works with the U.S. Chief Technology Officer and U.S. Chief Data Scientist on the public finance of data infrastructure. He was previously the Chief Data Scientist at two environmental non-profits. He cofounded Global Forest Watch, a web application to monitor forests from satellite imagery. He is a Fellow at the Berkeley Institute for Data Science and a PhD candidate in environmental economics at UC Berkeley.
Prior to their current positions, Jeff and Dan worked together as White House Presidential Innovation Fellows at NASA.
Classes will be held on Mondays from 6:30pm to 9:00pm in Reiss 283.
- January: 11 (Wednesday), 23 (Monday), 30
- February: 6, 13, 27
- March 13, 20, 27
- April 3, 10, 24
- May 1
Students are expected to sign up for a Github account (https://github.com). Readings and materials will be available from the class Github repository (https://github.com/GeorgetownMcCourt/data-science).
Students will be evaluated on the basis of five problem sets (60%) and one final project (40%). Late problem sets will be penalized by 10% per day late. All problem sets will be submitted electronically. The final project will be due on Monday May 8th. As this class is quite hands-on, it is expected that students bring their computers to class to partake in computational activities.
Data science is dependent on sound application of computer programming, mathematics/statistics, and communication. This course is thus organized into three units that dive into the fundamentals. Particular emphasis is placed on skilled assembly of empirical ideas, drawing from standard and non-standard data.
Data science is about designing and building data products that derive insight. This first section will focus on developing fundamental skills required to build effective products.
The objective of the first lecture is to overcome the coefficient of static friction in using R for data science. Students will learn to execute simple R scripts to read, write, and extract data elements.
Lecture objectives
- Data science: What is it? What is the lay of the land?
- Languages of data science
- Basics of R progrmming
- Read data from CSV and JSON
- Data types and classes, including matrix, data.frame, list, and vectors
- Extracting rows, columns, and specific elements from a data frame
- Basic operations (e.g., sum, mean) on rows; useful as consistency checks.
- Write data to CSV and JSON
- Getting started with Github
Example application
- Graphing photovoltaic energy data from the National Institute of Standard and Technology's Net Zero Energy Residential Test Facility
The objective of this lecture is to present the most important and fundamental elements of data manipulation. These core operations include sort, merge, reshape, and collapse. We will also present loops through multiple rows or columns, and other alternatives to operate on partitions of data frames.
Lecture objectives
- Sound data manipulation as the basis of good data science
- Sort data based on column values
- Subset data frames
- Reshape data table, wide <--> long
- Merge data frames
- Collapse data frames
- Text processing: capitalization, substring, regex
- Looping through basic operations (bonus: same idea without loops)
The objective of this lecture is to handle missing values appropriately and script visual checks to find errors introduced in data input/output. We will also start to view computational optimization techniques, like taking advantage of multiple cores for heavy duty operations (parallel processing).
Lecture objectives
- Understanding data structures
- Statistical measures
- Graph and visual analytics
Example application
- Finding health coverage patterns using the US Census American Community Survey
- Conducting analysis of missing values analysis of weather anomalies from 1880 to Present using the National Oceanographic and Atmospheric Administration's GHCN-M
Building upon basic data manipulation and high level analytical tasks, this session will focus on programming paradigms that are commonly relied upon when practicing data science.
Lecture objectives
- Custom functions for consistency and efficiency
- Control structures: Loops, if statements
- Suitable practices
Example applications
- Smoothing conventional gasoline time series data from the Energy Information Administration (EIA)
Homework Assignment
- TBA
The use case drives the technique. In public policy, data can be used to support evaluation of programs to understand causal mechanisms (e.g. retrospective focus) or enable the creation of data-rooted products that drive action (e.g. deployed applications). Machine learning and data analysis enables both uses of data and will be the focus of the next five courses.
The use case drives the technique. In public policy, data can be used to support evaluation of programs to understand causal mechanisms (e.g. retrospective focus) or enable the creation of data-rooted products that drive action (e.g. deployed applications). Machine learning and data analysis enables both uses of data and will be the focus of the next five courses.
Formal statistics offers methods to calculate closed-form, analytical answers to the limits of OLS regression. Data science offers a more immediate and arguably a more accessible solution: simulate conditions and examine the outcomes. We begin to use the early visualizations techniques taught in a previous lecture for analysis.
Lecture objectives
- Simulating OLS and identifying p-values
- For-loops versus
apply
for simulations - Visualizing distributions with ggplot
Example application
- Schooling outcomes data
Supervised learning is the most relied upon class of techniques that enable causal inference but also deployed precision policy. How does changing one variable independently impact another variable? We begin to introduce basic regression analysis, correlation coefficients, ordinary least squares, and the relationship between the concepts. Note that this is a very cursory review, and the deep assumptions are not tested or expounded upon.
Lecture objectives
- What is supervised learning?
- Structure of a supervised learning project
- Target variables, Input variables, Objective function and evaluation measures, model experiment design, Cross validation versus train/validate/test, Regression versus classifiers
- Ordinary Least Squares (OLS)
- K-Nearest Neighbors (kNN)
Example application
- Prediction of missing values in satellite imagery using kNN
Homework Assignment
- Lec 6: Satellite imagery for predicting employment
Classification models are one of the workhorses of data science. Classifiers enables data-driven applications such as risk scoring, lawsuit outcome prediction, marketing lead generation, facial detection and computer vision, spam filtering, among other use cases. This session will focus on the fundamentals of classification models, types of models, and daily applications.
Lecture objectives
- Three common problems using classifiers
- Structure of a classification project, Target variables, Input variables, Objective function and evaluation measures, model experiment design, Cross validation versus train/validate/test, Confusion matrix, TPR, TNR, AUC
- Framing dataset
- Models: statistical assumptions and mechanics, risks/strengths, implementation, non-technical explanation, Decision trees, Logistic Regression, K-Nearest Neighbors
- Appropriate uses of classification techniques, Scoring, prediction and prioritization, Propensity score matching
Example application
- Healthcare insurance coverage data
Homework Assignments
- Lec 7: Predict activity using smartphone accelerometer data (due Lec 8).
- Lec 8: Hand out class project instructions, one page proposal of what you'll do due by Lec 9.
No, this is not an independent study session. Unsupervised learning techniques such as clustering and principal components analysis help to identify recognizable patterns when no labels are provided. In sales and recruitment offices, customer segmentation may use current customer data, then use clustering techniques to identify k-number of distinct customer profiles. In resourceful law firms, data scientists may develop topic modeling algorithms to automatically tag and cluster hundreds of thousands of documents for improved search. This session will focus on clustering methodologies that are commonly employed in applied research.
Lecture objectives
- Three common problems using unsupervised learning
- Structure of unsupervised learning project, Input variables, optimization methods
- Framing dataset
- Models: statistical assumptions and mechanics, risks/strengths, implementation, sanity checks, non-technical explanation, K-means clustering (K-means), Principal Components Analysis (PCA)/Dimensionality Reduction, Hierarchical clustering (if time permits)
- Appropriate uses of k-means and PCA
Example application
- Univariate clustering application: k-means
- Multivariate clustering application: Customer segmentation using Census American Community Survey
Homework Assignment
- Lec 9: Write prototypical functions that will help you do your project. Due Lec 10.
Beyond the data preparation and modeling, the ‘presentation layer’ is the glue that will allow a data science project to stick with target audiences. Often times, presentation is graphical and relies upon a rich ecosystem of visualization, web services, and interactive applications to communicate pertinent issues.
Knowing how to develop models is not enough. Often times there is a need to extract data from databases as well as develop a web-based presentation to demonstrate results. This lesson focuses on the beginning and end of the data analytics process: the data extraction process typically relies on Structured Querying Language (SQL) to make requests from databases and the results of statistical models can be presented in websites using HyperText Markup Language (HTML) and styled using Cascading Style Sheets (CSS).
Lecture objectives
- Understand how to write SQL queries
- Clean and join two or more datasets using SQL
- Understand the underlying architecture of websites and how to build a static webpage
Example application
- Entity resolution for individuals and organizations monitored by security agencies
- Building a basic website
The objective of this section is to introduce spatial analysis and web service APIs in R. The auxiliary objectives include learning basic web mapping through Carto and practicing some classification techniques. We will focus on two applications – farmers markets and wind turbines in the United States.
Example application
- Extracting elevation data from the Google Elevation API
- Identifying the characteristics of farmers’ markets in the Southwest United States.
This class provides an overview to two unrelated topics. To start, an overview to cloud computing, specifically opportunities to leverage parallel processing to make speed up computation. Particular emphasis is placed on parallelism and how it can be effectively applied. To provide context of the role of the data scientist in organizations, we will explore the other actors who contribute to data teams and how those teams are organized.
Data science, statistics, and machine learning are agnostic of languages. Each programming language offers different advantages. R is well-designed for modeling and research, but not web application development. Python is well-suited for production-grade web applications. This lesson extends basic statistical computing into the realm of Python.
If you believe you have a disability, then you should contact the Academic Resource Center (arc@georgetown.edu) for further information. The Center is located in the Leavey Center, Suite 335 (202-687-8354). The Academic Resource Center is the campus office responsible for reviewing documentation provided by students with disabilities and for determining reasonable accommodations in accordance with the Americans with Disabilities Act (ASA) and University policies. For more information, go to http://academicsupport.georgetown.edu/disability/.
McCourt School students are expected to uphold the academic policies set forth by Georgetown University and the Graduate School of Arts and Sciences. Students should therefore familiarize themselves with all the rules, regulations, and procedures relevant to their pursuit of a Graduate School degree. The policies are located at: http://grad.georgetown.edu/academics/policies/
Georgetown University promotes respect for all religions. Any student who is unable to attend classes or to participate in any examination, presentation, or assignment on a given day because of the observance of a major religious holiday (see below) or related travel shall be excused and provided with the opportunity to make up, without unreasonable burden, any work that has been missed for this reason and shall not in any other way be penalized for the absence or rescheduled work. Students will remain responsible for all assigned work. Students should notify professors in writing at the beginning of the semester of religious observances that conflict with their classes. The Office of the Provost, in consultation with Campus Ministry and the Registrar, will publish, before classes begin for a given term, a list of major religious holidays likely to affect Georgetown students. The Provost and the Main Campus Executive Faculty encourage faculty to accommodate students whose bona fide religious observances in other ways impede normal participation in a course. Students who cannot be accommodated should discuss the matter with an advising dean.
Please know that as a faculty member I am committed to supporting survivors of sexual misconduct, including relationship violence, sexual harassment and sexual assault. However, university policy also requires me to report any disclosures about sexual misconduct to the Title IX Coordinator, whose role is to coordinate the University’s response to sexual misconduct.
Georgetown has a number of fully confidential professional resources who can provide support and assistance to survivors of sexual assault and other forms of sexual misconduct. These resources include:
Jen Schweer, MA, LPC
Associate Director
Health Education Services for Sexual Assault Response and Prevention
(202) 687-0323
jls242@georgetown.edu
Erica Shirley
Trauma Specialist
Counseling and Psychiatric Services (CAPS)
(202) 687-6985
els54@georgetown.edu
More information about campus resources and reporting sexual misconduct can be found at http://sexualassault.georgetown.edu.