Baguetter is a flexible, efficient, and hackable search engine library implemented in Python. It's designed for quickly benchmarking, implementing, and testing new search methods. Baguetter supports sparse (traditional), dense (semantic), and hybrid retrieval methods.
Note: Baguetter is not built for production use-cases or scale. For such use-cases, please check out other search engine projects.
Paper: https://arxiv.org/abs/2408.06643
- Sparse retrieval using BM25 and BMX algorithms
- Dense retrieval using embeddings
- Hybrid retrieval combining sparse and dense methods
- Customizable text preprocessing pipeline
- Multi-threaded indexing and searching
- Evaluation tools for benchmarking
- Easy integration with HuggingFace datasets and models for sharing
- Hackable interface to quickly implement new methods
pip install baguetter
from baguetter.indices import BMXSparseIndex
# Create an index
idx = BMXSparseIndex()
# Add documents
docs = [
"We all love baguette and cheese",
"Baguette is a great bread",
"Cheese is a great source of protein",
"Baguette is a great source of carbs",
]
doc_ids = ["1", "2", "3", "4"]
idx.add_many(doc_ids, docs, show_progress=True)
# Search
results = idx.search("quick fox")
print(results)
# Search many
results = idx.search_many(["quick fox", "baguette is great"])
print(results)
Baguetter includes tools for evaluating search performance on standard benchmarks:
from baguetter.evaluation import datasets, evaluate_retrievers
from baguetter.indices import BM25SparseIndex, BMXSparseIndex
results = evaluate_retrievers(datasets.mteb_datasets_small, {"bm25": BM25SparseIndex, "bmx": BMXSparseIndex})
results.save("eval_results")
For more detailed usage instructions and API documentation, please refer to the full documentation.
Contributions are welcome! We are using the GitHub Pull Request workflow. Either open an issue first and create a PR or include a comprehensive commit message when opening a PR.
To get started, please create a clone of the repo (or a fork). We recommend working in a virtual environment.
python -m pip install -e ".[dev]"
pre-commit install
To test your changes, run:
pytest
This project is licensed under the Apache 2.0 License - see the LICENSE file for details.
Baguetter builds upon the work of several open-source projects:
-
retriv by AmenRa: Baguetter is a fork of retriv, adjusting it to our needs.
-
bm25s by xhluca: Our BM25 implementation is based on this project, which provides an efficient and effective implementation of the BM25 algorithm with different scoring functions.
-
USearch by unum-cloud for dense retrival.
Please check out the respective repositories and show some appreciation to the authors.
@article{li2024bmx,
title={BMX: Entropy-weighted Similarity and Semantic-enhanced Lexical Search},
author={Xianming Li and Julius Lipp and Aamir Shakir and Rui Huang and Jing Li},
year={2024},
eprint={2408.06643},
archivePrefix={arXiv},
primaryClass={cs.IR},
url={https://arxiv.org/abs/2408.06643},
}