The project has three major steps: the customer segmentation report, the supervised learning model, and the Kaggle Competition.
-
Customer Segmentation Report This section will be similar to the corresponding project in Term 1 of the program, but the datasets now include more features that you can potentially use. You'll begin the project by using unsupervised learning methods to analyze attributes of established customers and the general population in order to create customer segments.
-
Supervised Learning Model You'll have access to a third dataset with attributes from targets of a mail order campaign. You'll use the previous analysis to build a machine learning model that predicts whether or not each individual will respond to the campaign.
-
Kaggle Competition Once you've chosen a model, you'll use it to make predictions on the campaign data as part of a Kaggle Competition. You'll rank the individuals by how likely they are to convert to being a customer, and see how your modeling skills measure up against your fellow students.
Below are some libraries we used.
- Python 3.6.8
- numpy 1.16.4
- pandas 0.24.2
- matplotlib 3.0.2
- seaborn 0.9.0
- scikit-learn 0.21.2
- xgboost 0.90
- ipython 6.1.0
- ipython-genutils 0.2.0
- jupyter-client 5.1.0
- jupyter-console 5.2.0
- jupyter-core 4.3.0
- jupyterlab 0.35.6
- jupyterlab-launcher 0.4.0
- jupyterlab-server 0.2.0
Or you can run below command to setup the environment.
conda create --name arvato-project python=3.6
source activate arvato-project
pip install -r requirements/requirements.txt
- terms_and_conditions
|- terms_completed.md
|- terms.md
|- terms.pdf
- Arvato Project Workbook.ipynb
- Arvato Project Workbook.html
- Project Rubric.pdf
- requirements.txt
- README.md
-
Data Preprocessing
- missing value distribution
- missing value distribution after change unknown value to NA
- categorical feature
- quantitative feature
- quantitative feature after log transform and outlier caping
- drop
-
Customer Segmentation Report
- PCA
- KMeans
- cluster
- feature difference
-
Supervised Learning Model
- data distribution
- model selection
- XGBoost
The main findings of the code can be found at the post available here.
In addition to Udacity's Terms of Use and other policies, your downloading and use of the AZ Direct GmbH data solely for use in the Unsupervised Learning and Bertelsmann Capstone projects are governed by the following additional terms and conditions. The big takeaways:
-
You agree to AZ Direct GmbH's General Terms provided below and that you only have the right to download and use the AZ Direct GmbH data solely to complete the data mining task which is part of the Unsupervised Learning and Bertelsmann Capstone projects for the Udacity Data Science Nanodegree program.
-
You are prohibited from using the AZ Direct GmbH data in any other context.
-
You are also required and hereby represent and warrant that you will delete any and all data you downloaded within 2 weeks after your completion of the Unsupervised Learning and Bertelsmann Capstone projects and the program.
-
If you do not agree to these additional terms, you will not be allowed to access the data for this project. The full terms are provided in the workspace below. You will then be asked in the next workspace to agree to these terms before gaining access to the project, which you may also choose to download if you would like to read in full the terms.