AmazonProductSearch

This is a English/Multi-Lingual Hybrid Product Search( Semantic and Syntactic) Application using Amazon's ESCI dataset involving Pinecone vectorDB , English/Multi Lingual Embedding(especially Voyage Embedding) , Pinecone Hybrid Search , Reranking and Evalation of recommnedation scores on hit_rate@N , hits@N , precision@N , recall@N , f1@N , mrr

Amazon Product Search Github link : https://github.com/AlexBlazee/AmazonProductSearch

Dataset: I am using the Amazon’s ESCI dataset esci-data(https://github.com/amazon-science/esci-data) with around 1.8 million products and 2.6 million search queries

Data preprocessing and cleaning : Data Sampling: Dropped any product row which has nan value as I am planning to use pinecone as a vector DB and for the free tire version , I can upsert around 450,000 products at 1024 dimensions English Dataset: 437953 products where products locale is ‘us’ Multilingual Dataset : Did some strategy based selection ( selected only 10 products from brands who have more than 10 products) giving 422015 products Cleaning: Removed HTML script , emoticons etc.

Embedding Models : Build a Hybrid Search using sparse and dense embeddings Sparse Embedding Model:

BM25 from pinecone English dense Embedding models:
Voyage AI – voyage-large-2-instruct - dim (1024) model
AllMini - all-MiniLM-L6-v2 - dim (384) model Multilingual dense Embedding models:
Voyage AI – voyage-multilingual-2 – dim (1024) model
LaBSE - dim (768) model

Vector DB:

Pinecone

Re-ranker Model:

Jina AI’s – jina-reranker-v2-base-multilingual

Recommendation Engine: Recommends products in less than 10 lines of code
Capabilities:

Single query Search
Bulk Query Search in batches

The Bulk query searches are beneficial when using any proprietary embedding models on free-tier due to the restrictions on rate-limits. This batch -128 processing the data without hitting the rate-limits (In my case : Voyage AI embeddings for documents and queries)

Evaluation Data: Strategized random selection of 10K easy and 5 K hard queries for each English and Multi-lingual dataset Strategy – select queries with products in [‘E’,’S’] – exact/substitute labels and available in the pinecone vectorized database with a threshold Roughly selected 30K easy and 15k hard queries and random sampled 10k easy and 5k hard

Evaluator: Evaluation is 1 line Capabilities:

Single Query Evaluation
Parallelized Evaluation – query by query
Batch Parallelized Evaluation – query in batches

Evaluation Metrics:

Hit_rate @ (1,5,10)
Hits @ (1,5,10)
Precision @ (1,5,10)
Recall @ (1 , 5, 10)
F1 @ (1 , 5, 10)
MRR

Results: English: Multilingual :

Stream lit and Fast API web APP: