Amazon Product Search

This repo showcases and compares various search algorithms and models for Shopping Queries Dataset: A Large-Scale ESCI Benchmark for Improving Product Search.

The results of experiments will be added to the wiki: https://github.com/rejasupotaro/amazon-product-search/wiki

Installation

Copy .envrc.example and fill in the required environment variables. Then, install the dependencies.

$ pyenv install 3.10.8
$ pyenv local 3.10.8
$ pip install poetry
$ poetry env use python
$ poetry install

The following libraries are necessary to process Japanese.

# For macOS
$ brew install mecab mecab-ipadic
$ poetry run python -m unidic download

Dataset

Clone https://github.com/amazon-science/esci-data and copy esci-data/shopping_queries_dataset/* into amazon-product/search/data/raw/. Then, run the following command to preprocess the dataset.

$ poetry run inv data.merge-and-split

Index Products

This project indexes products to Elasticsearch. If you want to try on your machine, launch Elasticsearch locally and run the document indexing pipeline against the index you created.

$ docker compose up
$ poetry run inv es.create_index --index-name=products_jp
$ poetry run inv es.index-docs \
  --index-name=products_jp \
  --locale=jp \
  --es-host=http://localhost:9200 \
  --extract-keywords \
  --encode-text \
  --nrows=100

See https://github.com/rejasupotaro/amazon-product-search/wiki/Indexing for more details.

Demo

The following command launches the Streamlit demo app.

$ docker compose up
$ poetry run inv demo

Experimentation

The demo app provides the ability to run experiments with different experimental settings.

# src/demo/experimental_setup.py
"sparse_vs_dense": ExperimentalSetup(
    index_name="products_jp",
    locale="jp",
    num_queries=5000,
    variants=[
        Variant(name="sparse", fields=["product_title"]),
        Variant(name="dense", fields=["product_vector"]),
        Variant(name="hybrid", fields=["product_title", "product_vector"]),
    ],
),

Development

Run the following tasks when you make changes.

$ poetry run inv format lint
$ poetry run pytest -vv