/DAT5

General Assembly's Data Science course in Washington, DC

Primary LanguageJupyter Notebook

DAT5 Course Repository

Course materials for General Assembly's Data Science course in Washington, DC (3/18/15 - 6/3/15).

Instructors: Brandon Burroughs and Kevin Markham (Data School blog, email newsletter, YouTube channel)

Monday Wednesday
3/18: Introduction and Python
3/23: Git and Command Line 3/25: Exploratory Data Analysis
3/30: Visualization and APIs 4/1: Machine Learning and KNN
4/6: Bias-Variance and Model Evaluation 4/8: Kaggle Titanic
4/13: Web Scraping, Tidy Data, Reproducibility 4/15: Linear Regression
4/20: Logistic Regression and Confusion Matrices 4/22: ROC and Cross-Validation
4/27: Project Presentation #1 4/29: Naive Bayes
5/4: Natural Language Processing 5/6: Kaggle Stack Overflow
5/11: Decision Trees 5/13: Ensembles
5/18: Clustering and Regularization 5/20: Advanced scikit-learn and Regex
5/25: No Class 5/27: Databases and SQL
6/1: Course Review 6/3: Project Presentation #2

Key Project Dates

  • 3/30: Deadline for discussing your project idea(s) with an instructor
  • 4/6: Project question and dataset (write-up)
  • 4/27: Project presentation #1 (slides, code, visualizations)
  • 5/18: First draft due (draft of project paper, code, visualizations)
  • 5/25: Peer review due
  • 6/3: Project presentation #2 (project paper, slides, code, visualizations, data, data dictionary)

Key Project Links

Logistics

  • Office hours will take place every Saturday and Sunday.
  • Homework will be assigned every Wednesday and due on Monday, and you'll receive feedback by Wednesday.
  • Our primary tool for out-of-class communication will be a private chat room through Slack.

Submission Forms

Before the Course Begins

Python Resources


Class 1: Introduction and Python

  • Introduction to General Assembly
  • Course overview (slides)
  • Brief tour of Slack
  • Checking the setup of your laptop
  • Python lesson with airline safety data (code)

Homework:

Optional:

  • If we discovered any setup issues with your laptop, please resolve them before Monday.
  • If you're not feeling comfortable in Python, keep practicing using the resources above!

Class 2: Git and Command Line

  • Any questions about the course project?
  • Command line (slides)
  • Git and GitHub (slides)

Homework:

Optional:

Resources:


Class 3: Pandas

Homework:

Optional:


Class 4: Visualization and APIs

Homework:

Optional:

  • Watch Look at Your Data (18 minutes) for an excellent example of why visualization is useful for understanding your data.

Resources:


Class 5: Data Science Workflow, Machine Learning, KNN

Homework:

Optional:

Resources:


Class 6: Bias-Variance Tradeoff and Model Evaluation

  • Brief introduction to the IPython Notebook
  • Exploring the bias-variance tradeoff (notebook)
  • Discussion of the assigned reading on the bias-variance tradeoff
  • Model evaluation procedures (notebook)

Resources:

  • If you would like to learn the IPython Notebook, the official Notebook tutorials are useful.
  • To get started with Seaborn for visualization, the official website has a series of tutorials and an example gallery.
  • Hastie and Tibshirani have an excellent video (12 minutes, starting at 2:34) that covers training error versus testing error, the bias-variance tradeoff, and train/test split (which they call the "validation set approach").
  • Caltech's Learning From Data course includes a fantastic video (15 minutes) that may help you to visualize bias and variance.

Class 7: Kaggle Titanic

  • Guest instructor: Josiah Davis
  • Participate in Kaggle's Titanic competition
    • Work in pairs, but the goal is for every person to make at least one submission by the end of the class period!

Homework:

  • Option 1 is to do the Glass identification homework. This is a good option if you are still getting comfortable with what we have learned so far, and prefer a very structured assignment. (solution)
  • Option 2 is to keep working on the Titanic competition, and see if you can make some additional progress! This is a good assignment if you are feeling comfortable with the material and want to learn a bit more on your own.
  • In either case, please submit your code as usual, and include lots of code comments!

Class 8: Web Scraping, Tidy Data, Reproducibility

Resources:


Class 9: Linear Regression

  • Linear regression (notebook)
    • Simple linear regression
    • Estimating and interpreting model coefficients
    • Confidence intervals
    • Hypothesis testing and p-values
    • R-squared
    • Multiple linear regression
    • Feature selection
    • Model evaluation metrics for regression
    • Handling categorical predictors

Homework:

  • If you're behind on homework, use this time to catch up.
  • Keep working on your project... your first presentation is in less than two weeks!!

Resources:


Class 10: Logistic Regression and Confusion Matrices

  • Logistic regression (slides and code)
  • Confusion matrices (same links as above)

Homework:

Resources:


Class 11: ROC Curves and Cross-Validation

Homework:

  • Your first project presentation is on Monday! Please submit a link to your project repository (with slides, code, and visualizations) before class using the homework submission form.

Optional:

Resources:


Class 12: Project Presentation #1

  • Project presentations!

Homework:


Class 13: Naive Bayes

Homework:

  • Please download/install the following for the NLP class on Monday
    • In Spyder, import nltk and run nltk.download('all'). This downloads all of the necessary resources for the Natural Language Tool Kit.
    • We'll be using two new packages/modules for this class: textblob and lda. Please install them. Hint: In the Terminal (Mac) or Git Bash (Windows), run pip install textblob and pip install lda.

Resources:

  • For other intuitive introductions to Bayes' theorem, here are two good blog posts that use ducks and legos.
  • For more on conditional probability, these slides may be useful.
  • For more details on Naive Bayes classification, Wikipedia has two excellent articles (Naive Bayes classifier and Naive Bayes spam filtering), and Cross Validated has a good Q&A.
  • If you enjoyed Paul Graham's article, you can read his follow-up article on how he improved his spam filter and this related paper about state-of-the-art spam filtering in 2004.
  • If you're planning on using text features in your project, it's worth exploring the different types of Naive Bayes and the many options for CountVectorizer.

Class 14: Natural Language Processing

  • Natural Language Processing (notebook)
  • NLTK: tokenization, stemming, lemmatization, part of speech tagging, stopwords, Named Entity Recognition, LDA
  • Alternative: TextBlob

Resources:


Class 15: Kaggle Stack Overflow

Optional:

  • Keep working on this competition! You can make up to 5 submissions per day, and the competition doesn't close until 6:30pm ET on Wednesday, May 27 (class 20).

Resources:


Class 16: Decision Trees

Resources:

Installing Graphviz (optional):

  • Mac:
  • Windows:
    • Download and install MSI file
    • Add it to your Path: Go to Control Panel, System, Advanced System Settings, Environment Variables. Under system variables, edit "Path" to include the path to the "bin" folder, such as: C:\Program Files (x86)\Graphviz2.38\bin

Class 17: Ensembles

  • Ensembles and random forests (notebook)

Homework:

  • Your project draft is due on Monday! Please submit a link to your project repository (with paper, code, and visualizations) before class using the homework submission form.
    • Your peers and your instructors will be giving you feedback on your project draft.
    • Here's an example of a great final project paper from a past student.
  • Make at least one new submission to our Kaggle competition! We suggest trying Random Forests or building your own ensemble of models. For assistance, you could use this framework code, or refer to the complete code from class 15. You can optionally submit your code to us if you want feedback.

Resources:


Class 18: Clustering and Regularization

Homework:

  • You will be assigned to review the project drafts of two of your peers. You have until next Monday to provide them with feedback, according to these guidelines.

Resources:


Class 19: Advanced scikit-learn and Regular Expressions

Optional:

  • Use regular expressions to create a list of causes from the homicide data. Your list should look like this: ['shooting', 'shooting', 'blunt force', ...]. If the cause is not listed for a particular homicide, include it in the list as 'unknown'.

Resources:

  • scikit-learn has an incredibly active mailing list that is often much more useful than Stack Overflow for researching a particular function.
  • The scikit-learn documentation includes a machine learning map that may help you to choose the "best" model for your task.
  • In you want to build upon the regex material presented in today's class, Google's Python Class includes an excellent lesson (with an associated video).
  • regex101 is an online tool for testing your regular expressions in real time.
  • If you want to go really deep with regular expressions, RexEgg includes endless articles and tutorials.
  • Exploring Expressions of Emotions in GitHub Commit Messages is a fun example of how regular expressions can be used for data analysis.

Class 20: Databases and SQL

Homework:

  • Read this classic paper, which may help you to connect many of the topics we have studied throughout the course: A Few Useful Things to Know about Machine Learning.
  • Your final project is due next Wednesday!
    • Please submit a link to your project repository before Wednesday's class using the homework submission form.
    • Your presentation should start with a recap of the key information from the previous presentation, but you should spend most of your presentation discussing what has happened since then.
    • Don't forget to practice your presentation and time yourself!

Resources:


Class 21: Course Review

  • Pipelines (code)
  • Class review
  • Creating an ensemble (code)

Resources:


Class 22: Project Presentation #2

  • Presentations!

Class is over! What should I do now?

  • Take a break!
  • Go back through class notes/code/videos to make sure you feel comfortable with what we've learned.
  • Take a look at the Resources for each class to get a deeper understanding of what we've learned. Start with the Resources from Class 21 and move to topics you are most interested in.
  • You might not realize it, but you are at a point where you can continue learning on your own. You have all of the skills necessary to read papers, blogs, documentation, etc.
  • GA Data Guild
  • 8/24/2015
  • 9/21/2015
  • 10/19/2015
  • 11/9/2015
  • Follow data scientists on Twitter. This will help you stay up on the latest news/models/applications/tools.
  • Participate in Data Community DC events. They sponsor meetups, workshops, etc, notably the Data Science DC Meetup. Sign up for their newsletter also!
  • Read blogs to keep learning. I really like District Data Labs.
  • Do Kaggle competitions! This is a good way to continue and hone your skillset. Plus, you'll learn a ton along the way.

And finally, don't forget about graduation!