This repo showcases and compares various search algorithms and models for Shopping Queries Dataset: A Large-Scale ESCI Benchmark for Improving Product Search.
The results of experiments will be added to the wiki: https://github.com/rejasupotaro/amazon-product-search/wiki
Copy .envrc.example
and fill in the required environment variables. Then, install the dependencies.
$ pyenv install 3.10.8
$ pyenv local 3.10.8
$ pip install poetry
$ poetry env use python
$ poetry install
The following libraries are necessary to process Japanese.
# For macOS
$ brew install mecab mecab-ipadic
$ poetry run python -m unidic download
Clone https://github.com/amazon-science/esci-data and copy esci-data/shopping_queries_dataset/*
into amazon-product/search/data/raw/
. Then, run the following command to preprocess the dataset.
$ poetry run inv data.merge-and-split
This project indexes products to Elasticsearch. If you want to try on your machine, launch Elasticsearch locally and run the document indexing pipeline against the index you created.
$ docker compose up
$ poetry run inv es.create_index --index-name=products_jp
$ poetry run inv es.index-docs \
--index-name=products_jp \
--locale=jp \
--es-host=http://localhost:9200 \
--extract-keywords \
--encode-text \
--nrows=100
See https://github.com/rejasupotaro/amazon-product-search/wiki/Indexing for more details.
The following command launches the Streamlit demo app.
$ docker compose up
$ poetry run inv demo
The demo app provides the ability to run experiments with different experimental settings.
# src/demo/experimental_setup.py
"sparse_vs_dense": ExperimentalSetup(
index_name="products_jp",
locale="jp",
num_queries=5000,
variants=[
Variant(name="sparse", fields=["product_title"]),
Variant(name="dense", fields=["product_vector"]),
Variant(name="hybrid", fields=["product_title", "product_vector"]),
],
),
Run the following tasks when you make changes.
$ poetry run inv format lint
$ poetry run pytest -vv