quickwit-oss/quickwit

Ingestion bug when using partition key and unsorted data

Opened this issue · 5 comments

Describe the bug
Cant ingest data while using a partition key if number of partition key is high and ndjson data is not sorted per partition key.

Steps to reproduce (if applicable)
Ingest 100M more docs with partition key and unsorted data with default configuration

Expected behavior
Data to be ingested in the index without any issue.

Configuration:
Default configuration, version 0.8.1, please check this thread (https://discord.com/channels/908281611840282624/1233364603480576092) where we discuss the issue and hopefully find a solution for further details

The number of published document is probably a limitation of the UI. I think we download a partial list of splits metadata and do the computation in javascript.

@PieReissyet Concretely what is the other issue you observe?
Ingest blocks at one point for a few minutes, the push API returns 429, the number of splits then decrease and you can ingest again or something like that?

My hunch is that heaving partitioning makes the load on merge heavier, especially if the merge policy is inappropriate and the number of partiiton is high.

In that case the default setting in merge_concurrency is not sufficient, and you need to increase it.
merge_concurrency is an undocumented property in the indexer config. (in the quickwit.yaml)

You can confirm this hypothesis by looking at the pending merge / ongoing merge curves.
At each commit the pending merge should increase strongly and then decrease as merge get done.

After 10 * commit_timeout large merges will occur and the peak will get even higher.
After 100 * commit_timeout same thing.

The graph should show the number of pending merge to eventually come back to 0 or close to 0 and not diverge.

Ingestion gets blocked, numbers go crazy on the UI then eventually the insertion keep going after a while. We had use cases where it was stuck for many hours and others where it eventually went back depending on the numbers.

We did not dig the issue more since we managed to get rid of the issue by just sorting our data by partition key.

I will definitely try the merge_concurrency thing if this happen again. Any idea what value should I use ?

Ingestion gets blocked, numbers go crazy on the UI then eventually the insertion keep going after a while.

yes that is in line with what I thought.
If you have prometheus, looking at the graph of pending/ongoing merging should be telling.

We did not dig the issue more since we managed to get rid of the issue by just sorting our data per partition key.
If this is something you can do, it is actually a nice hack to make the work of quickwit much lighter.

merge_concurrency value:

Actually scratch that, I am wrong. merge_concurrency would only help if you had a single index. We do not allow parallel merges within a single pipeline at the moment. Right now the only thing you can do is wait for merges to progress.
What is the volume here?

We had issues with ~130M docs / ~45go of data uncompressed