create_index: No validation when split_length <= split_overlap[BUG]
pandu-k opened this issue · 1 comments
pandu-k commented
Describe the bug
Internal error occurs on add_docs when split_length < split_overlap. This issue was raised on our forums here.
Reproducing the issue
To reproduce:
# create index:
curl -XPOST -H 'Content-type: application/json' http://localhost:8882/indexes/text-index -d '{ "index_defaults": { "text_preprocessing": { "split_length": 2, "split_overlap": 5, "split_method": "word" }, "treat_urls_and_pointers_as_images": false, "model": "hf/all_datasets_v4_MiniLM-L6", "normalize_embeddings": true, "image_preprocessing": { "patch_method": null }, "ann_parameters" : { "space_type": "cosinesimil", "parameters": { "ef_construction": 128, "m": 16 } } }, "number_of_shards": 3, "number_of_replicas": 0 }'
# add docs
curl -XPOST -H 'Content-type: application/json' http://localhost:8882/indexes/text-index/documents -d '{ "documents" : [{"_id":"1","title":"Fat cat","description":"The fat cat sits on the mat in the sunshine"},{"_id":"2","title":"Brown fox","description":"The quick brown fox jumps over the lazy dog"}], "tensorFields" : ["description"] }'
Yields this error:
Marqo logs:
File "/app/src/marqo/tensor_search/tensor_search.py", line 522, in add_documents
content_chunks = text_processor.split_text(field_content, split_by=split_by,
File "/app/src/marqo/s2_inference/processing/text.py", line 147, in split_text
segments = list(windowed(split_text, n=split_length, step=split_length - split_overlap))
File "/usr/local/lib/python3.8/dist-packages/more_itertools/more.py", line 841, in windowed
raise ValueError('step must be >= 1')
ValueError: step must be >= 1
The return message is an unhelpful message: Internal Server Error
.
Expected behavior
Index-creation-time validation should prevent creating an index with these problematic settings.
Additional context
TeimasTeimoso commented
Hello @pandu-k, can I try to pick up on this issue?