marqo-ai/marqo

create_index: No validation when split_length <= split_overlap[BUG]

pandu-k opened this issue · 1 comments

Describe the bug
Internal error occurs on add_docs when split_length < split_overlap. This issue was raised on our forums here.

Reproducing the issue
To reproduce:

# create index: 

curl -XPOST -H 'Content-type: application/json' http://localhost:8882/indexes/text-index -d '{ "index_defaults": { "text_preprocessing": { "split_length": 2, "split_overlap": 5, "split_method": "word" }, "treat_urls_and_pointers_as_images": false, "model": "hf/all_datasets_v4_MiniLM-L6", "normalize_embeddings": true, "image_preprocessing": { "patch_method": null }, "ann_parameters" : { "space_type": "cosinesimil", "parameters": { "ef_construction": 128, "m": 16 } } }, "number_of_shards": 3, "number_of_replicas": 0 }'

# add docs

curl -XPOST -H 'Content-type: application/json' http://localhost:8882/indexes/text-index/documents -d '{ "documents" : [{"_id":"1","title":"Fat cat","description":"The fat cat sits on the mat in the sunshine"},{"_id":"2","title":"Brown fox","description":"The quick brown fox jumps over the lazy dog"}], "tensorFields" : ["description"] }'

Yields this error:

Marqo logs:

  File "/app/src/marqo/tensor_search/tensor_search.py", line 522, in add_documents
    content_chunks = text_processor.split_text(field_content, split_by=split_by,
  File "/app/src/marqo/s2_inference/processing/text.py", line 147, in split_text
    segments = list(windowed(split_text, n=split_length, step=split_length - split_overlap))
  File "/usr/local/lib/python3.8/dist-packages/more_itertools/more.py", line 841, in windowed
    raise ValueError('step must be >= 1')
ValueError: step must be >= 1

The return message is an unhelpful message: Internal Server Error.

Expected behavior
Index-creation-time validation should prevent creating an index with these problematic settings.

Additional context

Hello @pandu-k, can I try to pick up on this issue?