This project aims at analyzing the content of an E-commerce database that lists purchases made by ∼ 4000 customers over a period of one year (from 2010/12/01 to 2011/12/09). Based on this analysis, I develop a model that allows to anticipate the purchases that will be made by a new customer, during the following year and this, from its first purchase.
-
Data Preparation
-
Exploring the content of variables
2.1 Countries
2.2 Customers and products
2.2.1 Cancelling orders
2.2.2 StockCode
2.2.3 Basket price -
Insight on product categories
3.1 Product description
3.2 Defining product categories
3.2.1 Data encoding
3.2.2 Clusters of products
3.2.3 Characterizing the content of clusters -
Customer categories
4.1 Formating data
4.1.1 Grouping products
4.1.2 Time spliting of the dataset
4.1.3 Grouping orders
4.2 Creating customer categories
4.2.1 Data enconding
4.2.2 Creating categories -
Classifying customers
5.1 Support Vector Machine Classifier (SVC)
5.1.1 Confusion matrix
5.1.2 Leraning curves
5.2 Logistic regression
5.3 k-Nearest Neighbors
5.4 Decision Tree
5.5 Random Forest
5.6 AdaBoost
5.7 Gradient Boosting Classifier
5.8 Let's vote ! -
Testing the predictions
-
Conclusion
Then, I load the data. Once done, I also give some basic informations on the content of the dataframe: the type of the various variables, the number of null values and their percentage with respect to the total number of entries. This dataframe contains 8 variables that correspond to: InvoiceNo: Invoice number. Nominal, a 6-digit integral number uniquely assigned to each transaction. If this code starts with letter 'c', it indicates a cancellation. StockCode: Product (item) code. Nominal, a 5-digit integral number uniquely assigned to each distinct product. Description: Product (item) name. Nominal. Quantity: The quantities of each product (item) per transaction. Numeric. InvoiceDate: Invice Date and time. Numeric, the day and time when each transaction was generated. UnitPrice: Unit price. Numeric, Product price per unit in sterling. CustomerID: Customer number. Nominal, a 5-digit integral number uniquely assigned to each customer. Country: Country name. Nominal, the name of the country where each customer resides. In the dataframe, products are uniquely identified through the StockCode variable. A shrort description of the products is given in the Description variable. In this section, I intend to use the content of this latter variable in order to group the products into different categories. In this part, the objective will be to adjust a classifier that will classify consumers in the different client categories that were established in the previous section. The objective is to make this classification possible at the first visit. To fulfill this objective, I will test several classifiers implemented in scikit-learn. First, in order to simplify their use, I define a class that allows to interface several of the functionalities common to these different classifiers.
Support Vector Machine Precision: 65.93 %
Logostic Regression Precision: 71.34 %
k-Nearest Neighbors Precision: 67.58 %
Decision Tree Precision: 71.38 %
Random Forest Precision: 75.38 %
Gradient Boosting Precision: 75.23 %
In the previous section, a few classifiers were trained in order to categorize customers. Until that point, the whole analysis was based on the data of the first 10 months. In this section, I test the model the last two months of the dataset, that has been stored in the set_test dataframe The work described in this notebook is based on a database providing details on purchases made on an E-commerce platform over a period of one year. Each entry in the dataset describes the purchase of a product, by a particular customer and at a given date. In total, approximately ∼ 4000 clients appear in the database. Given the available information, I decided to develop a classifier that allows to anticipate the type of purchase that a customer will make, as well as the number of visits that he will make during a year, and this from its first visit to the E-commerce site.The first stage of this work consisted in describing the different products sold by the site, which was the subject of a first classification. There, I grouped the different products into 5 main categories of goods. In a second step, I performed a classification of the customers by analyzing their consumption habits over a period of 10 months. I have classified clients into 11 major categories based on the type of products they usually buy, the number of visits they make and the amount they spent during the 10 months. Once these categories established, I finally trained several classifiers whose objective is to be able to classify consumers in one of these 11 categories and this from their first purchase. For this, the classifier is based on 5 variables which are:
mean : amount of the basket of the current purchase categ_N with N∈[0:4] : percentage spent in product category with index N Finally, the quality of the predictions of the different classifiers was tested over the last two months of the dataset. The data were then processed in two steps: first, all the data was considered (ober the 2 months) to define the category to which each client belongs, and then, the classifier predictions were compared with this category assignment. I then found that 75% of clients are awarded the right classes. The performance of the classifier therefore seems correct given the potential shortcomings of the current model. In particular, a bias that has not been dealt with concerns the seasonality of purchases and the fact that purchasing habits will potentially depend on the time of year (for example, Christmas ). In practice, this seasonal effect may cause the categories defined over a 10-month period to be quite different from those extrapolated from the last two months. In order to correct such bias, it would be beneficial to have data that would cover a longer period of time.