Finding Donors for CharityML

Training 6 different Supervised Machine Learning in scikit-learn framework then tuning and optimizing the best model of them to increase the accuracy.

This project is part of Udacity Machine Learning Nanodegree projects.

Table of Content

Introduction
Project Overview
Prerequisites
Starting the Project
1. Code
2. Run
3. Data
4. Results
5. Loading the Trained Model
References
Author
License

Introduction

In this project, I used 6 different Supervised Machine Learning models to train and test the dataset. I also used F-beta score as a metric that considers both precision and recall:

In particular, when 𝛽=0.5 , more emphasis is placed on precision. This is called the F(0.5)_score (or F-score for simplicity).

The tested models are:

Model
1. Random Forest
2. Gradient Boosting
3. Logistic Regression
4. Decision Trees
5. AdaBoost
6. Support Vector Machine

And then I visualized a graph to compare between:

Accuracy Score on Training & Testing Subsets.
F-Score on Training & Testing Subsets.
Time of Model Training & Testing.

1. Gradient Boosting Classifier	2. Random Forest Classifier	3. Logistic Regression

4. Decision Tree Classifier	5. AdaBoost Classifier	6. Support Vector Machine Classifier

From graphs, it's obvious that AdaBoost Classifier is better in both Accuracy and F-Score. So after that, we'll use it with the GridSearch technique to tune our model and optimize its hyperparameter in order to increase the Accuracy and F-Score as it's shown below

Project Overview

I applied supervised learning techniques and an analytical mind on data collected for the U.S. census to help CharityML (a fictitious charity organization) identify people most likely to donate to their cause. I started by exploring the data to learn how the census data is recorded. Next, I applied a series of transformations and preprocessing techniques to manipulate the data into a workable format. Then I evaluated several supervised models on the data and considered which is best suited for the solution. Afterwards, I optimized the selected model.

Prerequisites

This project uses the following software and Python libraries:

You will also need to have software installed to run and execute a Jupyter Notebook.

If you do not have Python installed yet, it is highly recommended that you install the Anaconda distribution of Python, which already has the above packages and more included.

Starting the Project

This project contains three files:

finding-donors-for-charityML.ipynb: This is the main file where you will find all the work on the project.
census.csv: The project dataset. Which is loaded this data in the notebook.
visuals.py: A Python file containing visualization code that is run behind-the-scenes. Do not modify

Code

Template code is provided in the finding-donors-for-charityML.ipynb notebook file. The script visuals.py Python file is also required for the visualizing functions, and the census.csv dataset file.

Run

In a terminal or command window, navigate to the top-level project directory Finding-Donors-for-CharityML/ (that contains this README) and run one of the following commands:

ipython notebook finding-donors-for-charityML.ipynb

jupyter notebook finding-donors-for-charityML.ipynb

This will open the iPython Notebook software and project file in your browser.

Data

The modified census dataset consists of approximately 32,000 data points, with each datapoint having 13 features. This dataset is a modified version of the dataset published in the paper "Scaling Up the Accuracy of Naive-Bayes Classifiers: a Decision-Tree Hybrid", by Ron Kohavi. You may find this paper online, with the original dataset hosted on UCI.

Features
- age: Age
- workclass: Working Class (Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-gov, Without-pay, Never-worked)
- education_level: Level of Education (Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm, Assoc-voc, 9th, 7th-8th, 12th, Masters, 1st-4th, 10th, Doctorate, 5th-6th, Preschool)
- education-num: Number of educational years completed
- marital-status: Marital status (Married-civ-spouse, Divorced, Never-married, Separated, Widowed, Married-spouse-absent, Married-AF-spouse)
- occupation: Work Occupation (Tech-support, Craft-repair, Other-service, Sales, Exec-managerial, Prof-specialty, Handlers-cleaners, Machine-op-inspct, Adm-clerical, Farming-fishing, Transport-moving, Priv-house-serv, Protective-serv, Armed-Forces)
- relationship: Relationship Status (Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried)
- race: Race (White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black)
- sex: Sex (Female, Male)
- capital-gain: Monetary Capital Gains
- capital-loss: Monetary Capital Losses
- hours-per-week: Average Hours Per Week Worked
- native-country: Native Country (United-States, Cambodia, England, Puerto-Rico, Canada, Germany, Outlying-US(Guam-USVI-etc), India, Japan, Greece, South, China, Cuba, Iran, Honduras, Philippines, Italy, Poland, Jamaica, Vietnam, Mexico, Portugal, Ireland, France, Dominican-Republic, Laos, Ecuador, Taiwan, Haiti, Columbia, Hungary, Guatemala, Nicaragua, Scotland, Thailand, Yugoslavia, El-Salvador, Trinadad&Tobago, Peru, Hong, Holand-Netherlands)
Target Variable
- income: Income Class (<=50K, >50K)

Results:

Metric	Naive Predictor	Unoptimized Model	Optimized Model
Accuracy Score	0.2478	0.8638	0.8709
F-score	0.2917	0.7333	0.7446

Loading the Trained Model

You could download the optimized model and load it by the following commands:

import pickle
filename = 'optimized_model.sav'
loaded_model = pickle.load(open(filename, 'rb'))

References

Author

Ahmed Hamido
- LinkedIn

License

MIT License

Inspired by Udacity Machine Learning Engineer Nanodegree.

ahmedxomar101/Finding-Donors-for-CharityML