Fashion Product Image-to-Text Prediction

This project aims to predict the display names of fashion products from images using two distinct approaches. The dataset comprises fashion product images and their attributes, such as category, color, season, etc. The goal is to convert these images into descriptive display names.

Dataset

The dataset used for this project is the Fashion Product Images Dataset from Kaggle. It includes:

Images of fashion products.
Attributes for each image, including category, color, brand, and season (Classification of These Attributes).
Display Names which are the target labels we aim to predict.

Pre-Trained Classification Model

For the approaches below, I utilize a pre-trained model developed for multi-label classification of fashion products.

You can find the details and code for this model in the Fashion Product Multilabel Classification repository.

Approaches

Implemented two approaches to tackle the image-to-text prediction problem:

Approach One: End-to-End Transfer Learning with RNN

Model: Utilizes the pre-trained classification model from the Fashion Product Multilabel Classification repository as a feature extractor. An additional RNN (LSTM) head is added to directly predict the display name from the image features.
Implementation: Kaggle Notebook - Approach One
Performance:
- Average BLEU Score: 0.8994
- Average ROUGE-1 F1 Score: 0.9532
- Average ROUGE-2 F1 Score: 0.9394
- Average ROUGE-L F1 Score: 0.9532
Example:

Approach Two: Multi-Stage Model

Segment One: Attribute Classification
- Model: Fine-tuned ResNet-50 model for classifying various attributes of fashion images such as category, base color, brand, and season.
- Output: The predicted classes for each attribute.
Segment Two: Display Name Prediction
- Inputs:
  - The class predictions from Segment One.
  - The encoded image features from the ResNet-50 model.
- Model: An RNN (LSTM) model that takes these inputs and predicts the display name of the product.
- Implementation: Kaggle Notebook - Approach Two
- Example: