GA icon

General Assembly Data Science (NYC 20)

Here you will find the curriculum for the 11 week Data Science course at GA.

Instructor: Anthony Erlinger
Teaching Assistants: Josh Schneier and Deepti Gottipati

Don't submit assignments here! The student repository for submitting assignments can be found at https://github.com/ga-students/DAT_20_Students.

Please use this form when submitting assignments: http://goo.gl/forms/YZWvn9MUlt

Course Description:

This course is a practical approach to the knowledge and skills required to excel in the field of data science. Through various case studies, real-world examples and guest speakers, students will be exposed to the basics of data science, fundamental modeling techniques, and various other tools to make predictions and decisions about data. Students will gain practical computational experience by running machine learning algorithms and learning how to choose the best and most representative data models to make predictions. Students will be using Python throughout this course.

Prerequisites:

  • Some experience with programming languages (preferably R or Python) and familiarity with the command line interface (UNIX).
  • Laptop with OSX (Mac) or UNIX/Linux operating system

Students are expected to complete approx **10 hours of prework** before the course begins as outlined in this [pre work document](./Assignments/PreWork/ds_pre_work.pdf)

The typical structure of each session is 40% Lecture, 60% Exercises/Labs

Installation and Setup

  • Install the Anaconda distribution of Python 2.7.x
  • Install Git and create a GitHub account.
  • Once you receive an email invitation from Slack, join our "Data Science 20 team" and add your photo!


SYLLABUS

Unit I

The premodel workflow: Mining and Representing Data (5 Lectures)

Tuesday Thursday
(No Class) Lecture 1 (3/12) Introduction
Lecture 2 (3/17) Git and Python Lecture 3 (3/19) Mining data from the web
Lecture 4 (3/24) Statistics with Pandas and Numpy Lecture 5 (3/26) Visualizing Data

Unit II

Learning from Data: Building Predictive Models (11 Lectures)

Tuesday Thursday
Lecture 6 (3/31) Machine Learning With SKLearn Lecture 7 (4/2) Linear Regression
Lecture 8 (4/7) Polynomial Regression, and the problem of Overfitting Lecture 9 (4/9) Logistic Regression
Lecture 10 (4/14) Text Analysis with Naive Bayes, Brief intro to Natural Language Processing Lecture 11 (4/16) Model Evaluation and Cross Validation Strategies
Lecture 12 (4/21) Decision Trees Lecture 13 (4/23) Support Vector Machines and the basics of the kernel space
Lecture 14 (4/28) PCA and dimensionality reduction Lecture 15 (4/30) K-means clustering and KNN
Lecture 16 (5/5) Ensemble Learning and Random Forests

Unit III

Intro to Data Engineering: Processing Data at Scale (6 Lectures)

Tuesday Thursday
Lecture 17 (5/7) Querying data in Relational Databases
Lecture 18 (5/12) Recommender Systems and Network Analysis Lecture 19 (5/14) Processing data at Scale Using Map Reduce
Lecture 20 (5/19) Working Session for Final Project Lecture 21 (5/21) Open Session: Speaker or Course Review
Lecture 22 (5/26) Project Presentations

####Each lesson in this curriculum contains the following:

  • Agenda
  • Slides
  • In-Class Exercises (which can include code)
  • Additional Resources

Course Project

Overview

The final project should represent significant original work applying data science techniques to an interesting problem. Final projects are individual attainments, but you should be talking frequently with your instructors and classmates about them.

Address a data-related problem in your professional field or a field you're interested in. Pick a subject that you're passionate about; if you're strongly interested in the subject matter it'll be more fun for you and you'll produce a better project!

To stimulate your thinking, there is an excellent list of public data listed below. Using public data is the most common choice. If you have access to private data, that's also an option, though you'll have to be careful about what results you can release. You are also welcome to compete in a Kaggle competition as your project, in which case the data will be provided to you.

You should also take a look at past projects from other GA Data Science students, to get a sense of the variety and scope of projects.

Peer Review Guidelines

You will be assigned to review the project drafts of two of your peers. You will have one week to provide them with feedback. You should upload your feedback as a Markdown (or plain text) document to the "reviews" folder of DAT_20. If your last name is Smith and you are reviewing Jones, you should name your file smith_reviews_jones.md.

Expectations:

  • Read everything they wrote!
  • If they provided their data, review it and try to understand it.
  • Read their code and try to understand their thought process.
  • If their code can be run, try running it.
  • Spend at least one hour reviewing their project (including the time it takes to write the feedback).

Your feedback would ideally consist of:

  • Strengths of their project (things you particularly like about it)
  • Comments about things you think could be improved
  • Questions about things you don't understand
  • Comments about their code
  • Suggestions for next steps
  • Guiding principle: Give feedback that would be helpful to you if it was your project!

You should take a quick glance through their project as soon as possible, to make sure you understand what they have given you and what files you should be reviewing. If you're unclear, ask them about it!

Project Deliverables

You are responsible for creating a project paper and a project presentation. The paper should be written with a technical audience in mind, while the presentation should target a more general audience. You will deliver your presentation (including slides) during the final week of class, though you are also encouraged to present it to other audiences.

Here are the components you should aim to cover in your paper:

  • Problem statement and hypothesis
  • Description of your data set and how it was obtained
  • Description of any pre-processing steps you took
  • What you learned from exploring the data, including visualizations
  • How you chose which features to use in your analysis
  • Details of your modeling process, including how you selected your models and validated them
  • Your challenges and successes
  • Possible extensions or business applications of your project
  • Conclusions and key learnings

Your presentation should cover these components with less breadth and less depth. Focus on creating an engaging, clear, and informative presentation that tells the story of your project.

You should create a GitHub repository for your project that contains the following:

  • Project paper: any format (PDF, Markdown, etc.)
  • Presentation slides: any format (PDF, PowerPoint, Google Slides, etc.)
  • Code: commented Python scripts, and any other code you used in the project
  • Data: data files in "raw" or "processed" format
  • Data dictionary (aka "code book"): description of each variable, including units

If it's not possible or practical to include your entire dataset, you should link to your data source and provide a sample of the data. (GitHub has a size limit of 100 MB per file and 1 GB per repository.) If your data is private, you can either include an "anonymized" version of your data or create a private GitHub repository.



Additional Resources

See the Resources folder

Python

Data Sources

This is just the tip of the iceberg; there's a lot of data out there!

Web Sites & Blogs