OpenKaito/openkaito

Issue with Setup Semantic Search Dataset & Indexing

Closed this issue · 2 comments

I've got an issue while indexing the ETH Denver 2024 dataset for semantic search.

19129 files in /home/tao/Development/openkaito/datasets/eth_denver_dataset
Index already exists: eth_denver
Number of docs in eth_denver: 19129 == total files 19129, no need to reindex docs
Indexing embeddings: 5%|███▌ | 1000/19129 [09:44<2:56:33, 1.71it/s]
╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ /home/tao/Development/openkaito/scripts/vector_index_eth_denver_dataset.py:186 in │
│ │
│ 183 │ │ │ f"Number of docs in {index_name}: {r['count']} == total files {num_files}, n │
│ 184 │ │ ) │
│ 185 │ │
│ ❱ 186 │ indexing_embeddings(search_client) │
│ 187 │ │
│ 188 │ query = "What is the future of blockchain?" │
│ 189 │ response = test_retrieval(search_client, query, topk=5) │
│ │
│ /home/tao/Development/openkaito/scripts/vector_index_eth_denver_dataset.py:115 in │
│ indexing_embeddings │
│ │
│ 112 def indexing_embeddings(search_client): │
│ 113 │ """Index embeddings of documents in Elasticsearch""" │
│ 114 │ │
│ ❱ 115 │ for doc in tqdm( │
│ 116 │ │ helpers.scan(search_client, index=index_name), │
│ 117 │ │ desc="Indexing embeddings", │
│ 118 │ │ total=search_client.count(index=index_name)["count"], │
│ │
│ /home/tao/anaconda3/envs/openkaito/lib/python3.10/site-packages/tqdm/std.py:1181 in iter
│ │
│ 1178 │ │ time = self._time │
│ 1179 │ │ │
│ 1180 │ │ try: │
│ ❱ 1181 │ │ │ for obj in iterable: │
│ 1182 │ │ │ │ yield obj │
│ 1183 │ │ │ │ # Update and possibly print the progressbar. │
│ 1184 │ │ │ │ # Note: does not call self.update(1) for speed optimisation. │
│ │
│ /home/tao/anaconda3/envs/openkaito/lib/python3.10/site-packages/elasticsearch/helpers/actions.py │
│ :755 in scan │
│ │
│ 752 │ │ │ │ │ │ │ shards_total, │
│ 753 │ │ │ │ │ │ ), │
│ 754 │ │ │ │ │ ) │
│ ❱ 755 │ │ │ resp = scroll_client.scroll( │
│ 756 │ │ │ │ scroll_id=scroll_id, scroll=scroll, **scroll_kwargs │
│ 757 │ │ │ ) │
│ 758 │ │ │ scroll_id = resp.get("_scroll_id") │
│ │
│ /home/tao/anaconda3/envs/openkaito/lib/python3.10/site-packages/elasticsearch/_sync/client/utils │
│ .py:446 in wrapped │
│ │
│ 443 │ │ │ │ │ except KeyError: │
│ 444 │ │ │ │ │ │ pass │
│ 445 │ │ │ │
│ ❱ 446 │ │ │ return api(*args, **kwargs) │
│ 447 │ │ │
│ 448 │ │ return wrapped # type: ignore[return-value] │
│ 449 │
│ │
│ /home/tao/anaconda3/envs/openkaito/lib/python3.10/site-packages/elasticsearch/_sync/client/ini │
│ t
.py:3609 in scroll │
│ │
│ 3606 │ │ __headers = {"accept": "application/json"} │
│ 3607 │ │ if __body is not None: │
│ 3608 │ │ │ __headers["content-type"] = "application/json" │
│ ❱ 3609 │ │ return self.perform_request( # type: ignore[return-value] │
│ 3610 │ │ │ "POST", │
│ 3611 │ │ │ __path, │
│ 3612 │ │ │ params=__query, │
│ │
│ /home/tao/anaconda3/envs/openkaito/lib/python3.10/site-packages/elasticsearch/_sync/client/_base │
│ .py:271 in perform_request │
│ │
│ 268 │ │ │ endpoint_id=endpoint_id, │
│ 269 │ │ │ path_parts=path_parts or {}, │
│ 270 │ │ ) as otel_span: │
│ ❱ 271 │ │ │ response = self._perform_request( │
│ 272 │ │ │ │ method, │
│ 273 │ │ │ │ path, │
│ 274 │ │ │ │ params=params, │
│ │
│ /home/tao/anaconda3/envs/openkaito/lib/python3.10/site-packages/elasticsearch/_sync/client/_base │
│ .py:352 in _perform_request │
│ │
│ 349 │ │ │ │ except (ValueError, KeyError, TypeError): │
│ 350 │ │ │ │ │ pass │
│ 351 │ │ │ │
│ ❱ 352 │ │ │ raise HTTP_EXCEPTIONS.get(meta.status, ApiError)( │
│ 353 │ │ │ │ message=message, meta=meta, body=resp_body │
│ 354 │ │ │ ) │
│ 355 │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
NotFoundError: NotFoundError(404, 'search_phase_execution_exception', 'No search context found for id [12]')

This is usually because the iteration time is longer than the elasticsearch scroll time, you may adjust the scroll time in the scan() operation, https://elasticsearch-py.readthedocs.io/en/latest/helpers.html#scan:~:text=the%20search()%20api-,scroll,-(str)%20%E2%80%93%20Specify
or you can accelerate the embedding process via batch execution etc.

Closing this issue, feel free to reopen it if you still have question:)