Project 5: Feature Selection + Classification

Domain and Data

This week, you've learned about access and utilizing remote databases, and more advanced topics for conducting logistic regression, selecting features, and building machine learning pipelines. Now, let's put these skills to the test!

You're working as a data scientist with a research firm. You're firm is bidding on a big project that will involve working with thousands or possible tens of thousands of features. You know it will be impossible to use conventional feature selection techniques. You propose that a way to win the contract is to demonstrate a capacity to identify relevant features using machine learning. Your boss says, "Great idea. Write it up." You figure that working with the Madelon synthetic dataset is an excellent way to demonstrate your abilities.

A data engineer colleague sets up a remote PostgreSQL database for you to work with. You can connect to that database at joshuacook.me:5432 with user dsi and password "correct horse battery staple". You can connect via command line using

$ psql -h joshuacook.me -p 5432 -d dsi -U dsi_student

and entering the password when prompted

(Optional) You tell your colleague, thanks, but you prefer to run your database locally using docker.

Regardless of whether you use the remote database or Docker, your colleague encourages you to use sqlalchemy to connect postgres to pandas. He suggests that the following code might be useful but seems distracted and rushed and tells you to check stack when you push for more:

engine = create_engine("postgresql://{}:{}@{}:{}/{}".format(user, password, url, port, database))

Problem Statement

Your challenge here is to implement three machine learning pipelines designed to demonstrate your ability to select salient features programatically.

Solution Statement

Your final product will consist of:

A prepared report
Three Jupyter notebooks to be used to control your pipelines
A library of python code you will use to build your pipelines

Pipeline 1: Benchmarking

Pipeline 2: Select Features with $\ell1$-Penalty

Pipeline 3: Build Model with Grid Search

Tasks

Prepared Report

Your report should

be a pdf
include a well-posed problem statement with Domain, Data, Problem, Solution, Metric, Benchmark
optionally include EDA & Data Description
present results from Step 1 - Benchmarking
present results from Step 2 - Identify Salient Features
present results from Step 3 - Build Model
compare results obtained by LogisticRegression and KNearestNeighbors in Step 3
present compare features identified as important by Step 2 and Step 3
recommend feature engineering steps for a potential next phase in project

Jupyter Notebook 1, Step 1 - Benchmarking

build pipeline to perform a naive logistic regression as a baseline model
- in order to do this, you will need to set a high C value in order to perform minimal regularization

Jupyter Notebook, Step 2 - Identify Features

build pipeline with LogisticRegression using l1 penalty
use constructed model to identify important features

Jupyter Notebook, Step 3 - Build Model

construct a Pipeline that uses SelectKBest to transform data
construct a Pipeline that uses LogisticRegression to model data
construct a Pipeline that uses KNearestNeighbors to model data
Gridsearch optimal parameters for logistic regression and KNN

Library of Python Code

write docstrings for all wrapper function that describe inputs and outputs
write wrapper function to connect remote datasource to pandas
- receives database information
- queries and sorts data
- returns a dataframe
write wrapper function to split data into a dictionary object to be passed through pipeline
- receives a dataframe and a random state
- performs train test split
- returns a data dictionary containing all necessary data objects
Write wrapper function to perform a general transformation on data
- receives a data dictionary
- fits on train data
- transforms train and test data
- return a data dictionary with updated train and test data and transformer
Write wrapper function to build a general model using data
- receives a data dictionary
- fits on train data
- scores on train and test data
- return a data dictionary adding model and scores

Requirements

A local PostgreSQL database housing your remote data.
A Jupyter Notebook with the required problem statement, goals, and technical data.
A written report of your findings that detail the accuracy and assumptions of your model.
Bonus:
Create a blog post of at least 500 words (and 1-2 graphics!) describing your data, analysis, and approach. Link to it in your Jupyter notebook.

Necessary Deliverables / Submission

Materials must be in a clearly labeled Jupyter notebook.
Materials must be submitted via a Github PR to the instructor's repo.
Materials must be submitted by the end of week 5.

Starter code

DSI_SM_3/projects/project-05 (master)$ tree ├── README.md ├── assets │ ├── benchmarking.png │ ├── build_model.png │ └── identify_features.png ├── lib │ ├── init.py │ ├── project_5.py ├── project-05-rubric.md ├── step_1-benchmarking.ipynb ├── step_2-identify_features_l1_penalty.ipynb └── step_3-build_model.ipynb

Suggested Ways to Get Started

Read in your dataset
Write pseudocode before you write actual code. Thinking through the logic of something helps.
Read the docs for whatever technologies you use. Most of the time, there is a tutorial that you can follow, but not always, and learning to read documentation is crucial to your success!
Document everything.
Look up sample executive summaries online.

Project Feedback + Evaluation

Attached here is a complete rubric for this project.

Your instructors will score each of your technical requirements using the scale below:

Score | Expectations
----- | ------------
**0** | _Incomplete._
**1** | _Does not meet expectations._
**2** | _Meets expectations, good job!_
**3** | _Exceeds expectations, you wonderful creature, you!_

This will serve as a helpful overall gauge of whether you met the project goals, but the more important scores are the individual ones above, which can help you identify where to focus your efforts for the next project!

LeylaMova/machine-learning-pipelines