This project consists in data analysis based on the use of data mining tools. It has to be performed by using Python. The guidelines require to address specific tasks and to report results in a unique paper. Well commented Python notebooks contains the code of each task.
Before talking in details about the tasks, some tips (for the correct execution and visualization of the supplied notebooks) are provided.
Create a virtual environment , and install the dependecies:
python3 -m venv venv
source venv/bin/activate
pip install -r requirements.txt
In the subfolders presentations
you can find slides we used to present our project and discuss about Evaluation of Explainable AI (link to the original paper)
Explore the dataset with the analytical tools studied and write a concise “data understanding” report describing data semantics, assessing data quality, the distribution of the variables and the pairwise correlations.
Improve the quality of data and prepare it by extracting new features interesting for describing the customer profile and his purchasing behavior.
Based on the customer’s profile explore the dataset using various clustering techniques. Different algorithms and approaches must be compared:
- K-means
- Density-based clustering (DBSCAN)
- Hierarchical clustering
Consider the problem of predicting for each customer a label that defines if (s)he is a high-spending customer, medium-spending customer or low-spending customer. After having defined some indicators for assigning these labels, perform the predictive analysis comparing the performance of different models:
- Decision Tree
- Random Forest
- SVM
- KNN
- Naive Bayesian
Model the customer as a sequence of baskets and apply the sequential pattern mining algorithm.
Extra task about frequent patterns and association rules analysis, exploiting Apriori algorithm