This video series will teach you how to solve machine learning problems using Python's popular scikit-learn library. It was featured on Kaggle's blog in 2015.
There are 9 video tutorials totaling 4 hours, each with a corresponding Jupyter notebook. The notebook contains everything you see in the video: code, output, images, and comments.
Note: The notebooks in this repository have been updated to use Python 3.6 and scikit-learn 0.19.1. The original notebooks (shown in the video) used Python 2.7 and scikit-learn 0.16, and can be downloaded from the archive branch. You can read about how I updated the code in this blog post.
You can watch the entire series on YouTube, and view all of the notebooks using nbviewer.
Once you complete this video series, I recommend enrolling in my online course, Machine Learning with Text in Python, to gain a deeper understanding of scikit-learn and Natural Language Processing.
-
What is machine learning, and how does it work? (video, notebook, blog post)
- What is machine learning?
- What are the two main categories of machine learning?
- What are some examples of machine learning?
- How does machine learning "work"?
-
Setting up Python for machine learning: scikit-learn and Jupyter Notebook (video, notebook, blog post)
- What are the benefits and drawbacks of scikit-learn?
- How do I install scikit-learn?
- How do I use the Jupyter Notebook?
- What are some good resources for learning Python?
-
Getting started in scikit-learn with the famous iris dataset (video, notebook, blog post)
- What is the famous iris dataset, and how does it relate to machine learning?
- How do we load the iris dataset into scikit-learn?
- How do we describe a dataset using machine learning terminology?
- What are scikit-learn's four key requirements for working with data?
-
Training a machine learning model with scikit-learn (video, notebook, blog post)
- What is the K-nearest neighbors classification model?
- What are the four steps for model training and prediction in scikit-learn?
- How can I apply this pattern to other machine learning models?
-
Comparing machine learning models in scikit-learn (video, notebook, blog post)
- How do I choose which model to use for my supervised learning task?
- How do I choose the best tuning parameters for that model?
- How do I estimate the likely performance of my model on out-of-sample data?
-
Data science pipeline: pandas, seaborn, scikit-learn (video, notebook, blog post)
- How do I use the pandas library to read data into Python?
- How do I use the seaborn library to visualize data?
- What is linear regression, and how does it work?
- How do I train and interpret a linear regression model in scikit-learn?
- What are some evaluation metrics for regression problems?
- How do I choose which features to include in my model?
-
Cross-validation for parameter tuning, model selection, and feature selection (video, notebook, blog post)
- What is the drawback of using the train/test split procedure for model evaluation?
- How does K-fold cross-validation overcome this limitation?
- How can cross-validation be used for selecting tuning parameters, choosing between models, and selecting features?
- What are some possible improvements to cross-validation?
-
Efficiently searching for optimal tuning parameters (video, notebook, blog post)
- How can K-fold cross-validation be used to search for an optimal tuning parameter?
- How can this process be made more efficient?
- How do you search for multiple tuning parameters at once?
- What do you do with those tuning parameters before making real predictions?
- How can the computational expense of this process be reduced?
-
Evaluating a classification model (video, notebook, blog post)
- What is the purpose of model evaluation, and what are some common evaluation procedures?
- What is the usage of classification accuracy, and what are its limitations?
- How does a confusion matrix describe the performance of a classifier?
- What metrics can be computed from a confusion matrix?
- How can you adjust classifier performance by changing the classification threshold?
- What is the purpose of an ROC curve?
- How does Area Under the Curve (AUC) differ from classification accuracy?
At the PyCon 2016 conference, I taught a 3-hour tutorial that builds upon this video series and focuses on text-based data. You can watch the tutorial video on YouTube.
Here are the topics I covered:
- Model building in scikit-learn (refresher)
- Representing text as numerical data
- Reading a text-based dataset into pandas
- Vectorizing our dataset
- Building and evaluating a model
- Comparing models
- Examining a model for further insight
- Practicing this workflow on another dataset
- Tuning the vectorizer (discussion)
Visit this GitHub repository to access the tutorial notebooks and many other recommended resources.