Welcome to your first week of work at the Disease And Treatment Agency, division of Societal Cures In Epidemiology and New Creative Engineering (DATA-SCIENCE). Time to get to work!
Due to the recent epidemic of West Nile Virus in the Windy City, we've had the Department of Public Health set up a surveillance and control system. We're hoping it will let us learn something from the mosquito population as we collect data over time. Pesticides are a necessary evil in the fight for public health and safety, not to mention expensive! We need to derive an effective plan to deploy pesticides throughout the city, and that is exactly where you come in!
The dataset, along with description, can be found here: https://www.kaggle.com/c/predict-west-nile-virus/.
This is also where you will be submitting your code for evaluation. We will be using the Kaggle Leaderboard to keep track of your score. The public leaderboard uses roughly 30% of the dataset to score an AUC (Area Under Curve) metric. You can read more about the scoring metric here.
If you do not already have a Kaggle account, you will need to sign up on the website. Also note that you will be submitting a "Late Submission" on Kaggle because the official competition has ended. You can use the leaderboard to see how your results compare against roughly 1300 other data science teams!
You can submit predictions as many times as you want to Kaggle, but there is a limit of 5 submissions per day. Be intentional with your submissions!
This project will be executed as a group. To make your team as effective and efficient as possible you should do the create a shared GitHub repo and project planning document as described in the deliverables section below.
GitHub Repo
- Create a GitHub repository for the group. Each member should be added as a contributor.
- Retrieve the dataset and upload it into a directory named
assets
. - Generate a .py or .ipynb file that imports the available data.
Project Planning
- Define your deliverable - what is the end result?
- Break that deliverable up into its components, and then go further down the rabbit hole until you have actionable items. Document these using a project managment tool to track things getting done. The tool you use is up to you; it could be Trello, a spreadsheet, GitHub issues, etc.
- Begin deciding priorities for each task. These are subject to change, but it's good to get an initial consensus. Order these priorities however you would like.
- You planning documentation (or a link to it) should be included in your GitHub repo.
EDA
- Describe the data. What does it represent? What types are present? What does each data points' distribution look like? Discuss these questions, and your own, with your partners. Document your conclusions.
- What kind of cleaning is needed? Document any potential issues that will need to be resolved.
Note: As you know, EDA is the single most important part of data science. This is where you should be spending most of your time. Knowing your data, and understanding the status of its integrity, is what makes or breaks a project.
Modeling
- The goal is of course to build a model and make predictions that the city of Chicago can use when it decides where to spray pesticides! Your team should have a clean Jupyter Notebook that shows your EDA process, your modeling and predictions.
- Conduct a cost-benefit analysis. This should include annual cost projections for various levels of pesticide coverage (cost) and the effect of these various levels of pesticide coverage (benefit). (Hint: How would we quantify the benefit of pesticide spraying? To get "maximum benefit," what does that look like and how much does that cost? What if we cover less and therefore get a lower level of benefit?)
- Your final submission CSV should be in your GitHub repo.
Presentation
- Audience: You are presenting to members of the CDC. Some members of the audience will be biostatisticians and epidemiologists who will understand your models and metrics and will want more information. Others will be decision-makers, focusing almost exclusively on your cost-benefit analysis. Your job is to convince both groups of the best course of action in the same meeting and be able to answer questions that either group may ask.
- The length of your presentation should be about 20 minutes (a rough guideline: 2 minute intro, 10 minutes on model, 5 minutes on cost-benefit analysis, 3 minute recommendations/conclusion). Touch base with your local instructor... er, manager... for specific logistic requirements!
Your project is due at 10:00 AM EST/9:00 AM CST on Friday, Jun3 15.
Data science is a field in which we apply data to solve real-world problems. Therefore, projects and presentations are means by which we can assess your ability to solve real-world problems in a data-driven manner.
Your final assessment ("grade," if you will) will be calculated based on a topical rubric. For each category, you will receive a score of 0-3. From the rubric you can see descriptions of each score and what is needed to attain those scores. Make sure you look at the "Rubric P4" tab of the spreadsheet.