This project shows how Machine Learning can help automate the categorization procedure at an online marketplace, where thousands of new products are listed every day. Auto-categorization with reasonable accuracy helps the company to reduce manual work, speed up the product display & recommendation on the platform, hence boosting sales activities & improve customer shopping experience.
The text data in French are processed using pos-tags, lemmatization and tokenization with TfidfVectorizer
and Keras Tokenizer
. Then, some simple classification models are used to demonstrate the categorization step, evaluate the results & suggest future improvement.
Models used include:
SGDClassifier
: accuracy score atMultilayer Perceptron
withTensorFlow/Keras
:- Settings: 1 hidden layer
- To prevent overfitting: using a
Dropout
layer andEarlyStopping
. - Hyperparameter tuning using 2 Keras Tuners:
RandomSearch
&Hyperband
.
The full Python code can be downloaded from this repo. A browser view is also available here: Auto_Categorization.ipynb .
The dataset contains product information as follows:
article_id
: Unique id for each itembrand
: Item brandprovider
: Seller's nametitle
: Product titledescription
: Detailed descriptionprice
: Item pricecategory_from_provider
: Product category set by sellercategory_id
: Target category to predict
There are 10,000 instances provided, corresponding to nearly 2,000 product categories. As the number of instances are relatively small compared to the number of classes, the models are likely to underfit and perform poorly. For the demontration purpose, I'll use instances correspond to the 500 most popular categories, accounting for +7,300 rows. In practice, the same steps can be applied to higher number of categories, provided that the available train data is sufficiently large.
This dataset is provided & used for educational purpose only.