In this project, we use machine learning models to analyze e-commerce data in order to determine the probability of a website visitor making a purchase.
The purpose of the project is to conduct exploratory data analysis, build one or more models to solve the task, interpret model forecasts, create an interactive dashboard and wrap it in a docker container.
- Project Description
- Files
- Dataset
- ML Model
- Deployment
- How to Install and Run the Project
- How to Use the Project
- Include Credits
- License
- EDA.ipynb : Jupyter Notebook with Exploratory Data Analysis
- pipeline.ipynb : Jupyter Notebook with ML pipeline
- dashboard.html : HTML ExplainerDashboard
- Dockerfile : File for creating a Docker Container with ExplainerDashboard
- app.py : File that generates the dashboard
- dashboard.py : File that launches the dashboard
We obtained the dataset for our analysis from the UC Irvine Machine Learning Repository, which provides the original dataset at this link. The dataset consists of rows that represent visit "sessions" of users on an e-commerce website, with each row containing a feature vector of corresponding data. To ensure uniqueness of users over a 1-year period, the dataset was specifically structured so that each session belongs to a unique user. The total number of sessions in the dataset is 12,330.
Target Variable
Revenue
(categorical, bool) : whether the user has made a purchase
All features
Feature | Description | Type |
---|---|---|
Revenue |
TARGET LABEL: whether the visitor made a purchase (True) or not (False) | Categorical, boolean |
Administrative |
the number of pages of this type (administarive) that the user visited | Numerical, int |
Administrative_Duration |
the amount of time spent in this category (administarive) of pages | Numerical, float |
Informational |
the number of pages of this type (informational) that the user visited | Numerical, int |
Informational_Duration |
the amount of time spent in this category (informational) of pages | Numerical, float |
ProductRelated |
the number of pages of this type (product related) that the user visited | Numerical, int |
ProductRelated_Duration |
the amount of time spent in this category (product related) of pages | Numerical, float |
BounceRates |
the percentage of visitors who enter the website through that page and exit without triggering any additional tasks (characteristic of Google Analytics) | Numerical, float |
ExitRates |
the percentage of pageviews on the website that end at that specific page (characteristic of Google Analytics) | Numerical, float |
PageValues |
the average value of the page averaged over the value of the target page and/or the completion of an eCommerce (characteristic of Google Analytics) | Numerical, float |
SpecialDay |
the closeness of the visiting site to a specific special day (e.g. Mother's Day) | Numerical, float |
Month |
the month of the year of the visiting site | Categorical, object |
OperatingSystems |
user's operating system | Categorical, int |
Browser |
user's browser | Categorical, int |
Region |
user's region | Categorical, int |
TrafficType |
the type of traffic the brought the visitor to the website | Categorical, int |
VisitorType |
visitor type (Returning_Visitor or New_Visitor or Other) | Categorical, object |
Weekend |
is the visit day a weekend | Categorical, boolean |
In this project, we perform primary data analysis and research before starting modeling and prediction. We use EDA to identify the main characteristics of our data and test the assumptions on which we will build our model.
During EDA, we analyze the distribution of data, identify outliers, duplicates, missing values, and plot graphs to visually assess the relationships between different variables. We also analyze the values of various correlations and the results of Pearson's chi-squared test.
In this project, we have built several models that solve the problem.
We settled on the NB model with hyperparameters selected by GridSearchCV. f1-score was chosen as the target metric. Also, the variables were encoded using ONE and standardized using StandardScaler.
In this project, we built an interactive dashboard and wrapped it in a docker container.
Using Docker, we created a Dockerfile that contains all the necessary instructions to build a container with our dashboard.
Please note that you need to have Docker on your computer to perform the above steps. If you do not have it, please install Docker before starting the process.
To start the dashboard, follow these steps:
-
Open a terminal on your computer.
-
Download the image from Docker Hub by running the command:
$ docker pull moxeeeem/explainerdashboard
This will download the dashboard image to your computer.
- Pull up the container from the downloaded image by running the command:
$ docker run moxeeeem/explainerdashboard
This will create and start the dashboard container.
- A link will appear in the terminal. Copy this link and paste it into the address bar of your web browser.
After clicking the link, you will see the dashboard open in your web browser. You can now view and use the dashboard to analyze the data.
This dashboard allows you to investigate SWAP values, permutation importances, interaction effect, partial dependence plots, all kinds of performance plots, and even individual decision trees inside a random forest.
This project was completed as part of the "Разведочный анализ данных и основы разработки" course offered by AI Education.
This project is licensed under the MIT license. For more information, see the LICENSE file.