This repository contains materials for the "Predicting Flu Vaccination: An Introduction to Machine Learning" tutorial session at Good Tech Fest 2020.
Interested in getting started with machine learning? In this tutorial, we will walk through the full process of building a simple machine learning model using Python. We will explain the basic tools in the Python toolkit, do some light exploratory data analysis, and then build and evaluate a model.
We will be using the data and prediction task from "Flu Shot Learning," a practice competition hosted by DrivenData: predicting whether respondents to the U.S. National 2009 H1N1 Flu Survey got H1N1 and seasonal flu vaccines using information they shared about their backgrounds, opinions, and health behaviors. DrivenData is a data science competition platform that hosts competitions exclusively in the data for social good space.
As part of Good Tech Fest, DrivenData will have a special community leaderboard for conference participants. We encourage everyone to take a shot at this competition and see how your submission stacks up. We are also looking for speakers for a lightning talk session to share out your work on the competition.
If you are participating through the conference, please join the#drivendata-challenge
and #predictingfluvaccinationanintroductiontomachinelearning
Slack channels.
Note that in order to get the data, you should sign up for the Flu Shot Learning competition on drivendata.org.
To get this repository, the best way is to have git
and use git clone
:
git clone https://github.com/drivendataorg/flu-shot-learning-tutorial.git
then to enter the project
cd flu-shot-learning-tutorial
This project is a simplified version of what is generated using Cookiecutter Data Science, a standardized structure we recommend for data science projects.
To access the data, please sign up for the Flu Shot Learning competition on drivendata.org. Then, you can find the data on the data download page.
This project requires Python 3.
Virtual environments are important for creating reproducible analyses. One popular tool for managing Python and virtual environments is conda
. You can set up the environment for this project with conda
using the commands below.
conda create -n flu-shot-learning-tutorial python=3.7
conda activate flu-shot-learning-tutorial
pip install -r requirements.txt
The analysis is saved as a Jupyter notebook file. You can use Jupyter Lab to view and edit it:
jupyter lab
├── README.md <- The top-level README for developers using this project.
├── data
│ ├── interim <- Intermediate data that has been transformed.
│ ├── processed <- The final, canonical data sets for modeling.
│ └── raw <- The original, immutable data dump.
├── models <- Trained and serialized models, model predictions, or model summaries
│
├── notebooks <- Jupyter notebooks. Naming convention is a number (for ordering),
│ the creator's initials, and a short `-` delimited description, e.g.
│ `1.0-jqp-initial-data-exploration`.
└── requirements.txt <- The requirements file for reproducing the analysis environment, e.g.
generated with `pip freeze > requirements.txt`
Project based on the cookiecutter data science project template. #cookiecutterdatascience