Minimal example for Hybrid Search fails
cnndabbler opened this issue · 3 comments
cnndabbler commented
First, I really like this project !
Respective sparse and dense examples work with minimal setup.
Issue is with the hybrid mode.
Here is the code:
from retriv import HybridRetriever
collection = [
{"id": "doc_1", "text": "Generals gathered in their masses"},
{"id": "doc_2", "text": "Just like witches at black masses"},
{"id": "doc_3", "text": "Evil minds that plot destruction"},
{"id": "doc_4", "text": "Sorcerer of death's construction"},
]
hr = HybridRetriever(
# Shared params ------------------------------------------------------------
index_name="hybrid-index",
# Sparse retriever params --------------------------------------------------
sr_model="bm25",
min_df=1,
tokenizer="whitespace",
stemmer="english",
stopwords="english",
do_lowercasing=True,
do_ampersand_normalization=True,
do_special_chars_normalization=True,
do_acronyms_normalization=True,
do_punctuation_removal=True,
# Dense retriever params ---------------------------------------------------
dr_model="sentence-transformers/multi-qa-MiniLM-L6-dot-v1",
normalize=True,
max_length=128,
use_ann=True,
)
he = hr.index(collection)
he.search(
query="witches", # What to search for
return_docs=True, # Default value, return the text of the documents
cutoff=5, # 100 is Default value, number of results to return
)
Error:
Building TDF matrix: 100%|██████████| 4/4 [00:01<00:00, 3.41it/s]
Building inverted index: 100%|██████████| 13/13 [00:00<00:00, 6786.90it/s]
Embedding documents: 100%|██████████| 4/4 [00:00<00:00, 206.63it/s]
Building ANN Searcher
100%|██████████| 1/1 [00:00<00:00, 20661.60it/s]
100%|██████████| 1/1 [00:00<00:00, 99.58it/s]
0%| | 0/1 [00:00<?, ?it/s]
╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ /tmp/ipykernel_45461/1793453458.py:32 in <module> │
│ │
│ [Errno 2] No such file or directory: '/tmp/ipykernel_45461/1793453458.py' │
│ │
│ /home/didierlacroix1/anaconda3/envs/FastChat/lib/python3.10/site-packages/retriv/hybrid_retrieve │
│ r.py:255 in search │
│ │
│ 252 │ │ """ │
│ 253 │ │ │
│ 254 │ │ sparse_results = self.sparse_retriever.search(query, False, 1_000) │
│ ❱ 255 │ │ dense_results = self.dense_retriever.search(query, False, 1_000) │
│ 256 │ │ hybrid_results = self.merger.fuse([sparse_results, dense_results]) │
│ 257 │ │ return ( │
│ 258 │ │ │ self.prepare_results( │
│ │
│ /home/didierlacroix1/anaconda3/envs/FastChat/lib/python3.10/site-packages/retriv/dense_retriever │
│ /dense_retriever.py:251 in search │
│ │
│ 248 │ │ │ │ self.load_embeddings() │
│ 249 │ │ │ doc_ids, scores = compute_scores(encoded_query, self.embeddings, cutoff) │
│ 250 │ │ │
│ ❱ 251 │ │ doc_ids = self.map_internal_ids_to_original_ids(doc_ids) │
│ 252 │ │ │
│ 253 │ │ return ( │
│ 254 │ │ │ self.prepare_results(doc_ids, scores) │
│ │
│ /home/didierlacroix1/anaconda3/envs/FastChat/lib/python3.10/site-packages/retriv/base_retriever. │
│ py:87 in map_internal_ids_to_original_ids │
│ │
│ 84 │ │ return results │
│ 85 │ │
│ 86 │ def map_internal_ids_to_original_ids(self, doc_ids: Iterable) -> List[str]: │
│ ❱ 87 │ │ return [self.id_mapping[doc_id] for doc_id in doc_ids] │
│ 88 │ │
│ 89 │ def save(self): │
│ 90 │ │ raise NotImplementedError() │
│ │
│ /home/didierlacroix1/anaconda3/envs/FastChat/lib/python3.10/site-packages/retriv/base_retriever. │
│ py:87 in <listcomp> │
│ │
│ 84 │ │ return results │
│ 85 │ │
│ 86 │ def map_internal_ids_to_original_ids(self, doc_ids: Iterable) -> List[str]: │
│ ❱ 87 │ │ return [self.id_mapping[doc_id] for doc_id in doc_ids] │
│ 88 │ │
│ 89 │ def save(self): │
│ 90 │ │ raise NotImplementedError() │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
KeyError: -1
cnndabbler commented
ok, making the following change makes the code complete.
use_ann=False,
AmenRa commented
Hi, thanks for the kind words.
I suspect the issue is that four docs are not enough to build clusters with faiss
.
Strangely, it works for the dense
but not the hybrid retriever
.
Also, did I report this example somewhere? I cannot find it in the documentation. :D
I know it is in the readme, but it was only intended for the sparse retriever
.
In general, if you have less than 20k documents, it does not make sense to use approximate nearest neighbors.
AmenRa commented
Closing for inactivity.
Feel free to re-open.