
In this repository, I use text data to help auto-categorize new products on an e-commerce platform. Steps include filtering by pos tags, lemmatization, tokenization using TfidfVectorizer and simple modeling.

Primary LanguageJupyter Notebook


This project shows how Machine Learning can help automate the categorization procedure at an online marketplace, where thousands of new products are listed every day. Auto-categorization with reasonable accuracy helps the company to reduce manual work, speed up the product display & recommendation on the platform, hence boosting sales activities & improve customer shopping experience.

Text preprocessing

The text data in French are processed using pos-tags, lemmatization and tokenization with TfidfVectorizer and Keras Tokenizer. Then, some simple classification models are used to demonstrate the categorization step, evaluate the results & suggest future improvement.


Models used include:

  • SGDClassifier: accuracy score at
  • Multilayer Perceptron with TensorFlow/Keras:
    • Settings: 1 hidden layer
    • To prevent overfitting: using a Dropout layer and EarlyStopping.
    • Hyperparameter tuning using 2 Keras Tuners: RandomSearch & Hyperband.


The full Python code can be downloaded from this repo. A browser view is also available here: Auto_Categorization.ipynb .


The dataset contains product information as follows:

  • article_id: Unique id for each item
  • brand: Item brand
  • provider: Seller's name
  • title: Product title
  • description: Detailed description
  • price: Item price
  • category_from_provider: Product category set by seller
  • category_id: Target category to predict

There are 10,000 instances provided, corresponding to nearly 2,000 product categories. As the number of instances are relatively small compared to the number of classes, the models are likely to underfit and perform poorly. For the demontration purpose, I'll use instances correspond to the 500 most popular categories, accounting for +7,300 rows. In practice, the same steps can be applied to higher number of categories, provided that the available train data is sufficiently large.


This dataset is provided & used for educational purpose only.