Customer Analysis

Dataset

The dataset is from kaggle.com and can be found here.

It contains user data from the months october and november 2019.

Usage

Download the data from here.
Put the data in /src/data and unzip it to csv.
make sure you have installed all modules from requirements.txt.
Run the explore.ipynb for initial analysis (Attention the whole dataset is giant. For debugging use src/data/test_data.csv)

Usage K-Means

Execute cluster_customer.py
In case of an error mkdir log inside modelling folder (empty folder for log-files won't be synced with git)
Check if train, test, dev = customer.prep_data(False) in line 175 is False, so the data will be created from scratch.
Coffee break! This will take some time ;)
Rename the created csv (data/customer_profile_new.csv) file of the customer_profile inside the created folder and move it into src/data
Next time use the exported csv instead of the raw data. Therefore, set line 175 to True again.

Usage FPGrowth

File: /modelling/product_analysis/product_fpm.py

To use the entire dataset set small_dataset=False in line 70
To filter the data on the event_type column one can set the filter_element in line 71 to be either purchases, carts, or views
The "focus" element in line 71 is the column that is used for the pattern mining. Useful columns to analyse are "product_id" and "category_code".
The parameters used in line 72 are directly used in the default fpgrowth model from pyspark.

About the project

Team

Our Team:

Project Target

Target of this project is to analyse customer behavior in online shops / e-commerce.

Tools

Spark

Spark and Pyspark as the Python module is used as the "analytics engine" and modelling tool.

More about Spark see here.

Plotly

For easy and interactive plotting the library plotly was used.

The figures are saved in src/data/exports. More about Plotly see here.

Documentation

Our more explicit documentation (in German) can be found here

To find the interactive plots from plotly see src/data/exports and open the HTML-Files. For more Plots change the branch to "notebook_runned" to see the results on the whole dataset (runtime >3h).

BennerLukas/customeranalysis