General Assembly Data Science (NYC 20)

Here you will find the curriculum for the 11 week Data Science course at GA.

Instructor: Anthony Erlinger
Teaching Assistants: Josh Schneier and Deepti Gottipati

Don't submit assignments here! The student repository for submitting assignments can be found at https://github.com/ga-students/DAT_20_Students.

Please use this form when submitting assignments: http://goo.gl/forms/YZWvn9MUlt

Course Description:

This course is a practical approach to the knowledge and skills required to excel in the field of data science. Through various case studies, real-world examples and guest speakers, students will be exposed to the basics of data science, fundamental modeling techniques, and various other tools to make predictions and decisions about data. Students will gain practical computational experience by running machine learning algorithms and learning how to choose the best and most representative data models to make predictions. Students will be using Python throughout this course.

Prerequisites:

Some experience with programming languages (preferably R or Python) and familiarity with the command line interface (UNIX).
Laptop with OSX (Mac) or UNIX/Linux operating system

Students are expected to complete approx **10 hours of prework** before the course begins as outlined in this [pre work document](./Assignments/PreWork/ds_pre_work.pdf)

The typical structure of each session is 40% Lecture, 60% Exercises/Labs

Installation and Setup

Install the Anaconda distribution of Python 2.7.x
Install Git and create a GitHub account.
Once you receive an email invitation from Slack, join our "Data Science 20 team" and add your photo!

SYLLABUS

Unit I

The premodel workflow: Mining and Representing Data (5 Lectures)

Tuesday	Thursday
(No Class)	Lecture 1 (3/12) Introduction
Lecture 2 (3/17) Git and Python	Lecture 3 (3/19) Mining data from the web
Lecture 4 (3/24) Statistics with Pandas and Numpy	Lecture 5 (3/26) Visualizing Data

Unit II

Learning from Data: Building Predictive Models (11 Lectures)

Tuesday	Thursday
Lecture 6 (3/31) Machine Learning With SKLearn	Lecture 7 (4/2) Linear Regression
Lecture 8 (4/7) Polynomial Regression, and the problem of Overfitting	Lecture 9 (4/9) Logistic Regression
Lecture 10 (4/14) Text Analysis with Naive Bayes, Brief intro to Natural Language Processing	Lecture 11 (4/16) Model Evaluation and Cross Validation Strategies
Lecture 12 (4/21) Decision Trees	Lecture 13 (4/23) Support Vector Machines and the basics of the kernel space
Lecture 14 (4/28) PCA and dimensionality reduction	Lecture 15 (4/30) K-means clustering and KNN
Lecture 16 (5/5) Ensemble Learning and Random Forests

Unit III

Intro to Data Engineering: Processing Data at Scale (6 Lectures)

Tuesday	Thursday
Lecture 17 (5/7) Querying data in Relational Databases
Lecture 18 (5/12) Recommender Systems and Network Analysis	Lecture 19 (5/14) Processing data at Scale Using Map Reduce
Lecture 20 (5/19) Working Session for Final Project	Lecture 21 (5/21) Open Session: Speaker or Course Review
Lecture 22 (5/26) Project Presentations

####Each lesson in this curriculum contains the following:

Agenda
Slides
In-Class Exercises (which can include code)
Additional Resources

Course Project

Overview

The final project should represent significant original work applying data science techniques to an interesting problem. Final projects are individual attainments, but you should be talking frequently with your instructors and classmates about them.

Address a data-related problem in your professional field or a field you're interested in. Pick a subject that you're passionate about; if you're strongly interested in the subject matter it'll be more fun for you and you'll produce a better project!

To stimulate your thinking, there is an excellent list of public data listed below. Using public data is the most common choice. If you have access to private data, that's also an option, though you'll have to be careful about what results you can release. You are also welcome to compete in a Kaggle competition as your project, in which case the data will be provided to you.

You should also take a look at past projects from other GA Data Science students, to get a sense of the variety and scope of projects.

Peer Review Guidelines

You will be assigned to review the project drafts of two of your peers. You will have one week to provide them with feedback. You should upload your feedback as a Markdown (or plain text) document to the "reviews" folder of DAT_20. If your last name is Smith and you are reviewing Jones, you should name your file smith_reviews_jones.md.

Expectations:

Read everything they wrote!
If they provided their data, review it and try to understand it.
Read their code and try to understand their thought process.
If their code can be run, try running it.
Spend at least one hour reviewing their project (including the time it takes to write the feedback).

Your feedback would ideally consist of:

Strengths of their project (things you particularly like about it)
Comments about things you think could be improved
Questions about things you don't understand
Comments about their code
Suggestions for next steps
Guiding principle: Give feedback that would be helpful to you if it was your project!

You should take a quick glance through their project as soon as possible, to make sure you understand what they have given you and what files you should be reviewing. If you're unclear, ask them about it!

Project Deliverables

You are responsible for creating a project paper and a project presentation. The paper should be written with a technical audience in mind, while the presentation should target a more general audience. You will deliver your presentation (including slides) during the final week of class, though you are also encouraged to present it to other audiences.

Here are the components you should aim to cover in your paper:

Problem statement and hypothesis
Description of your data set and how it was obtained
Description of any pre-processing steps you took
What you learned from exploring the data, including visualizations
How you chose which features to use in your analysis
Details of your modeling process, including how you selected your models and validated them
Your challenges and successes
Possible extensions or business applications of your project
Conclusions and key learnings

Your presentation should cover these components with less breadth and less depth. Focus on creating an engaging, clear, and informative presentation that tells the story of your project.

You should create a GitHub repository for your project that contains the following:

Project paper: any format (PDF, Markdown, etc.)
Presentation slides: any format (PDF, PowerPoint, Google Slides, etc.)
Code: commented Python scripts, and any other code you used in the project
Data: data files in "raw" or "processed" format
Data dictionary (aka "code book"): description of each variable, including units

If it's not possible or practical to include your entire dataset, you should link to your data source and provide a sample of the data. (GitHub has a size limit of 100 MB per file and 1 GB per repository.) If your data is private, you can either include an "anonymized" version of your data or create a private GitHub repository.

Additional Resources

See the Resources folder

Python

Data Sources

Wolfram Alpha
Jake Hofman Data Links
Peter Skomoroch (Linkedin) Data Links
Hilary Mason (bitly) Data Links
Wikipedia Database
IMDB Data
Last.fm Database
Quandl
Datamob
Factual
Metro Boston Data Common
Census.gov
Data.gov
Dataverse Network
Infochimps
Linked Data
Guardian DataBlog
Data Market
Reddit Open Data
Climate Data Sources
Climate Station Records
CDC Data
World Bank Catalog
Free SVG Maps
Office for National Statistics
StateMaster
Open data catalogs from various governments and NGOs:
- NYC Open Data
- DC Open Data Catalog / OpenDataDC
- DataLA
- data.gov (see also: Project Open Data Dashboard)
- data.gov.uk
- US Census Bureau
- World Bank Open Data
- Humanitarian Data Exchange
- Sunlight Foundation: government-focused data
- ProPublica Data Store
Datasets hosted by academic institutions:
- UC Irvine Machine Learning Repository: datasets specifically designed for machine learning
- Stanford Large Network Dataset Collection: graph data
- Inter-university Consortium for Political and Social Research
- Pittsburgh Science of Learning Center's DataShop
- Academic Torrents: distributed network for sharing large research datasets
Datasets hosted by private companies:
- Quandl: over 10 million financial, economic, and social datasets
- Amazon Web Services Public Data Sets
- Kaggle provides datasets with their challenges, but each competition has its own rules as to whether the data can be used outside of the scope of the competition.
Big lists of datasets:
- Rdatasets: collection of 700+ datasets originally distributed with R packages
- RDataMining.com
- KDnuggets
- inside-R
- 100+ Interesting Data Sets for Statistics
- 20 Free Big Data Sources
APIs:
- Apigee: explore dozens of popular APIs
- Python APIs: Python wrappers for many APIs
Other interesting datasets:
- FiveThirtyEight: data and code related to their articles
- Donors Choose: data related to their projects
- 200,000+ Jeopardy questions
Other resources:
- Datasets subreddit: ask for help finding a specific data set, or post your own
- Center for Data Innovation: blog posts about interesting, recently-released data sets.

This is just the tip of the iceberg; there's a lot of data out there!

Aerlinger/DAT_20_NYC