Code2020Materials

This repository will contain all of the materials for the 2020 Erdős Institute Cőde Boot Camp.

The material needed for each day of the course will be added the day before it is covered.

As we go through the material what we cover will be updated below.

Material Covered

Day 1 - Data Gathering

All the lecture materials can be found in the Lectures/DataGathering Folder. We'll be working through Notebooks 1 - 4 in that folder.

On this day you'll do this following.

  • Work through a pandas refresher,
  • Be introduced to some popular data websites,
  • Go through an introduction to html scraping with beautifulsoup,
  • Learn the fundamentals of relational databases in python.

Day 2 - Regression 1

All the lecture materials can be found in the Lectures/Regression Folder. On this day we'll complete notebooks 1 and 2, then start notebook 3.

On this day we'll:

  • Introduce Regression,
  • Learn about Statistical Learning,
  • Start with Simple Linear Regression,
  • Touch on SLR assumptions,
  • Begin Multiple Linear Regression.

Day 3 - Regression 2

The lecture materials for this day are notebooks 3 and 4 in the Regression folder.

We:

  • Discuss regression models with multiple features,
  • Introduce cross validation as a model selection tool,
  • Talk about interactions,
  • Include polynomial and other nonlinear transformations in our regressions.

Day 4 - Regression 3

The lecture materials are the end of notebook 4, notebook 5, and notebook 6 in the Regression Folder.

On our final regression day we:

  • Show sklearn preprocessing tools and tricks,
  • Discuss overfitting and multicolinearity,
  • Introduce the Bias-Variance Trade-Off in light of regression,
  • Learn about regularization with ridge and lasso,
  • See how we can use lasso for feature selection.

Day 5 - Time Series 1

The lecture materials are Notebooks 1, 2, 3, and potentially 4 in the Time Series folder.

On this day we:

  • Show how to handle time series data in python,
  • Learn three basic time series forecasting techniques,
  • Introduce lag plots, autocorrelation, and correlograms,
  • Adapt cross-validation for time series data, and
  • begin on exponential smoothing.

Day 6 - Time Series 2 and Classification 1

The lecture materials are Time Series Notebook 4 and 5 as well as Classification Notebooks 1-3.

On this day we:

  • Reviewed the End of Exponential Smoothing,
  • Went through ARIMA,
  • Discussed the goals of classification problems,
  • Learned our first simple classification algorithm knn,
  • Introduced stratified train test splits, and
  • Demonstrated the Logistic Regression algorithm.

Day 7 - Classification 2

The lecture materials are Classification Notebooks 4-6.

On this day we:

  • Cleared up any lingering questions on Logistic Regression,
  • Introduced Decision Trees,
  • Saw how we can combine many decision trees into a random forest,
  • Learned about some short comings of the sklearn tree based methods with regards to categorical data.

Day 8 - Classification 3

The lecture materials are Classification Notebooks 7 and 8.

Today we:

  • Reviewed support vector machine algorithms,
  • Expanded upon our ensemble method techniques with:
    • Voter Methods,
    • Bagging and Pasting, and
    • Boosting.

Day 9 - Unsupervised Learning 1

The lecture materials are What's Different With Unsupervised Learning, and Unsupervised Learning Notebooks 1 and 2.

We:

  • Introduced the goal of Unsupervised learning techniques,
  • Reviewed an array of clustering algorithms including:
    • k Means Clustering,
    • Hierarchical Clustering, and
    • DBScan,
  • Described two Clustering practice problems.

Day 10 - Unsupervised Learning 2

The lecture materials are Unsupervised Learning Notebook 3, Proper Data Preprocessing Steps, and potentially the beginning of Unsupervised Learning Notebook 4.

We:

  • Discussed the desire to perform dimensionality reduction,
  • Introduced Principal Components Analysis and derived the mathematical technique,
  • Showed how to interpret the output of PCA,
  • Laid out the "Elbow Method" with the Explained Variance Curve to determine the "natural" dimensionality of the data set,
  • Reviewed proper preprocessing procedures and more advanced pipelines.