Data and Databases

Columbia University, Lede Program

Tuesdays and Thursdays, May 24th 2016 through July 7th 2016, 10am

Office hours: By appointment only.

For FERPA reasons, I ask that you e-mail me at my Columbia address when discussing any matters related to this class or your grade. Personal or professional inquiries can go to my personal address.

Description

Consideration of both the scientific and social implications of counting, of turning the world into bits. Through the process of gaining fluency in the use of Python, students will spend some time thinking through representations of core "data types" like time, location, text, image, sound and relationships (or networks), and the computational "affordances" associated with each. Students will study several common metaphors for organizing and storing data – from structureless key-value stores, to a single table or spreadsheet, to the "multiple tables" of a relational database. We will also discuss ideas behind publishing or sharing data, moving from HTML documents and Web 1.0 to data services and APIs in Web 2.0, to semantics in Web 3.0. Student work and discussion will underscore the reality that data are plentiful and circulate and interact in a kind of informational ecosystem. As researchers, our students will be called on both to access and to publish data products.

Notes for previous versions of the course:

2014
2015

Homework assignments

There will be six homework assignments in this class, each assigned on Thursday and due the following Tuesday before the beginning of class. The homework assignments are designed to test and expand your knowledge of the technical concepts introduced in class. Each homework assignment is worth 10% of your grade.

With the exception of the first assignment, all homeworks will take the form of an IPython Notebook that you fill in and send to a TA for grading. (We'll discuss the specifics of this in class.)

Grading

40% Attendance and participation
60% Homework assignments (10% each)

Schedule and notes

Week 1 (May 24 and 26)

Orientation
Student introductions
SQL basics

Homework #1 (due May 31): Read and respond to the following.

Relational and Non-Relational Models in the Entextualization of Bureaucracy by Michael Castelle
Literature is not Data: Against the Digital Humanities by Stephen Marche
Machine Bias by Julia Angwin, Jeff Larson, Surya Mattu and Lauren Kirchner

These essays each address the limits and consequences of data-driven analysis and public policy. Your response should take the form of a brief e-mail (no more than 3-5 paragraphs) sent to me. In your response, describe the critique of one or more of the essays and discuss how (if at all) you might incorporate their critique(s) into your practice as a journalist. Also in your e-mail, include and comment on a link to an essay or article that you feel "speaks to" the points raised in one or more of the essays (e.g., agrees with, provides a counterexample, expands upon, responds to).

Week 2 (May 31 and June 2)

SQL continued
IPython/Jupyter Notebooks. Basics, Running Code, Markdown tutorial
Installing Python Libraries (other notes TK)
Using SQL in Python
SQL and CSVs

To install Jupyter Notebook on OSX:

sudo pip3 install jupyter

Depending on how you've installed Python on Windows, try:

pip3 install jupyter
py -3 -m pip install jupyter

More info here.

Homework #2 (due June 7): Working with SQL.

Week 3 (June 7 and 9)

Scraping HTML with Beautiful Soup

Homework #3 (due Jun 14): Web scraping.

Week 4 (June 14 and 16)

Working with unstructured data
List comprehensions (scroll down, needs to be translated to Python 3)
Strings and regular expressions (note: needs to be translated to Python 3!)

Homework #4 (due June 21): List comprehensions and regular expressions

Week 5 (June 21 and 23)

HTML to SQL

Homework #5 (due June 28): SQL schema design

Week 6 (June 28 and 30)

Making a Flask app (the template files referenced can be found in the templates folder of this repository)

Homework #6: Web applications.

Week 7 (July 5 and 7)

The Twitter API (see completed lake_bot.py in this repo)
Homework review
If we have time: Intro to NLP with TextBlob

Extra credit homework assignment: Create and deploy a Twitter bot. The bot should either respond to updates on an external data source (like NYT 4th Down Bot or Congress Edits) or iterate through/randomly select data for presentation (like Census Americans). This extra credit assignment can make up for up to 5% of your final grade. Complete this assignment by July 12th. Send me a link to the Twitter bot and a zip file with the source code for the bot.

newshack/data-and-databases