/product-classification

Product Classification on Flipkart Products dataset.

Primary LanguageJupyter Notebook

Product Classification

The goal of the task is to predict the primary category of a product using the description of the product. The dataset used for the task is based on Flipkart Products.

Data Preparation

Labels

  • The dataset does not directly have a primary category attribute. However, it contains an attribute named product_category_tree using which the primary category can be extracted.
  • I noticed that a few products are not assigned to a primary category. I have grouped such products together and dropped the corresponding rows.
  • Further, the dataset is very imbalanced.
    • Approximately 30% of the products belong to the 'Clothing' category and so on.
    • Thus, I have only focused on the top 15 categories in decreasing order of their product counts.
    • Adding more categories will lead to a decrease in accuracy due to the lack of training examples for such categories.
    • On the other hand, if we consider only top 5 or top 10 categories, we can improve the accuracy.

Description

  • The description has been processed as follows:
    • Remove any non-alphabetic characters and extra whitespaces.
    • Convert the strings to lowercase.
    • Remove stopwords from the strings using stopwords obtained from the spaCy library.
  • preprocessed.csv is obtained after the preprocessing and is used for the LSTM and BERT models. It contains only two attributes, primary_category and description.

Insights and Visualisations

  • I have mainly focused on two visualisations:
    • Category-wise product counts
      • Through this, I was able to eliminate categories with very low counts that would possibly decrease the classifier's accuracy.
    • Category-wise average description lengths
      • This gives us some insight into the way products of different categories are described.
      • Categories like 'Automotive' and 'Computers' have very lenghty descriptions on average. This can be attributed to the need for technical specifications for such products.
  • I have also used the Scattertext library to find the terms most associated with different categories.

Models and Results

  • I have considered three different models for the classification task, a traditional ML model, a sequence model, and a transformers model.
  • The experiments have been conducted with a Train:Test split of 80:20.
  • The reported metrics correspond to the test set.
  • Metrics used:
    • Mean accuracy
    • Weighted F1 -> since the dataset is imbalanced

Naive Bayes Classifier

  • I have considered two variants:
  • Count Vectorizer
    • Mean Accuracy: 95.759%
    • Weighted F1: 95.583%
  • TF-IDF
    • Mean Accuracy: 89.848%
    • Weighted F1: 88.623%

LSTM Classifier

  • I have used a model with 3 bidirectional LSTM layers. The loss function used is CrossEntropyLoss and the optimiser used is Adam.
  • The model was trained for 25 epochs on Google Colab.
  • Results:
    • Mean Accuracy: 97.433%
    • Weighted F1: 97.484%
  • To increase the accuracy further:
    • Since the average length of descriptions is > 300 (approximately 315), it is possible that the LSTM is unable to capture all of the relevant information.
    • As a result, we can experiment by increasing the number of layers as well as the hidden dimension size.
    • However, a better way to go about this would be by introducing attention mechanisms so that the model gives a higher weight to terms that are more associated with their respective categories.

BERT Classifier

  • I have fine-tuned bert-base-uncased model obtained from transformers library along with the corresponding tokenizer.
  • The model was trained for just one epoch on Google Colab.
  • Results:
    • Mean Accuracy: 94.058%
    • Weighted F1: 93.660%
  • To increase the accuracy further:
    • Train for more epochs.

Instructions to Run

  1. Clone the repository: git clone
  2. Run the bash script to create a new virtual environment and install required libraries: ./setup.sh
  3. Run naive_bayes.ipynb first since it also contains the exploration and visualisation code.

Resources