local file ingestion success, 0 published documents/splits
Opened this issue · 5 comments
Describe the bug
I am trying to ingest about 1.2B documents (1.8GB x 1900 files) using single node configuration deployed on Amazon ECS.
quickwit tool local-ingest
seems successful, but no split files in s3 bucket, nothing in search query result. quickwit index describe
also show nothing. When I take a look in metastore.json
file, there are mature splits and published docs, and so on. I wish someone from the team could guide me what I am missing here.
local-ingest success message:
❯ download: s3://my-bucket/.../896.json.gz to ./896.json.gz
-rw-r--r-- 1 root root 258M Apr 2 20:03 896.json.gz
❯ unzipping 896.json.gz
-rw-r--r-- 1 root root 1.7G Apr 2 20:03 896.json
❯ making jsonl from 896.json
-rw-r--r-- 1 root root 1.6G Apr 15 11:42 896.jsonl
❯ Ingesting documents locally...
--------------------------------------------------
Connectivity checklist
✔ metastore storage
✔ metastore
✔ index storage
✔ _ingest-cli-source
Num docs 2444 Parse errs 0 PublSplits 0 Input size 5MB Thrghput 2.75MB/s Time 00:00:02
Num docs 17136 Parse errs 0 PublSplits 0 Input size 38MB Thrghput 12.86MB/s Time 00:00:03
Num docs 36729 Parse errs 0 PublSplits 0 Input size 82MB Thrghput 20.66MB/s Time 00:00:04
...
Num docs 761206 Parse errs 0 PublSplits 0 Input size 1712MB Thrghput 0.00MB/s Time 00:01:17
Num docs 761206 Parse errs 0 PublSplits 0 Input size 1712MB Thrghput 0.00MB/s Time 00:01:18
Num docs 761206 Parse errs 0 PublSplits 1 Input size 1712MB Thrghput 0.00MB/s Time 00:01:19
Indexed 761,206 documents in 1m 19s.
2024-04-15T11:38:28.350Z ERROR quickwit_actors::actor_context: exit activating-kill-switch actor=MergeSplitDownloader-empty-hLbL exit_status=DownstreamClosed
Now, you can query the index with the following command:
quickwit index search --index my-indexer-2 --config ./config/quickwit.yaml --query "my query"
Clearing local cache directory...
✔ Local cache directory cleared.
✔ Documents successfully indexed.
quickwit index describe --index my-indexer-2
:
General Information
--------------------------------------------+---------------------------------------------------------------------------------------------
Index ID | my-indexer-2
Index URI | s3://my-bucket/qw_index/my-indexer-2
Number of published documents | 0 (0)
Size of published documents (uncompressed) | 0 B
Number of published splits | 0
Size of published splits | 0 B
Timestamp field | "timestamp"
Timestamp range start | Timestamp does not exist for the index.
Timestamp range end | Timestamp does not exist for the index.
quickwit index search --index my-indexer-2 --query "*"
:
{
"num_hits": 0,
"hits": [],
"elapsed_time_micros": 821,
"errors": []
}
metastore.json:
{
"version": "0.7",
"index": {
"version": "0.7",
"index_uid": "my-indexer-2:01HV96E19YCPBGS1EJ82BSMY6M",
"index_config": {
"version": "0.7",
"index_id": "my-indexer-2",
"index_uri": "s3://my-bucket/qw_index/my-indexer-2",
"doc_mapping": {
"field_mappings": [
...
],
"tag_fields": [
"some_url"
],
"store_source": false,
"index_field_presence": false,
"timestamp_field": "timestamp",
"mode": "dynamic",
"dynamic_mapping": {
...
},
"partition_key": "job_id",
"max_num_partitions": 10000,
"tokenizers": []
},
"indexing_settings": {
"commit_timeout_secs": 300,
"docstore_compression_level": 8,
"docstore_blocksize": 1000000,
"split_num_docs_target": 1000000,
"merge_policy": {
"type": "stable_log",
"min_level_num_docs": 100000,
"merge_factor": 10,
"max_merge_factor": 12,
"maturation_period": "6h"
},
"resources": {
"heap_size": "4.0 GB"
}
},
"search_settings": {
"default_search_fields": [
...
]
},
"retention": null
},
"checkpoint": {
"_ingest-api-source": {},
"_ingest-cli-source": {
"file:///quickwit/0.jsonl": "00000000001346960658",
"file:///quickwit/1.jsonl": "00000000001672670326",
"file:///quickwit/10.jsonl": "00000000001494934641",
"file:///quickwit/100.jsonl": "00000000001370648178",
"file:///quickwit/1000.jsonl": "00000000001654146853",
...
"file:///quickwit/895.jsonl": "00000000001734691539",
"file:///quickwit/896.jsonl": "00000000001701627756",
"file:///quickwit/897.jsonl": "00000000001788978023",
"file:///quickwit/898.jsonl": "00000000001726562370",
"file:///quickwit/899.jsonl": "00000000001798043217",
"file:///quickwit/9.jsonl": "00000000001432854516"
},
"_ingest-source": {}
},
"create_timestamp": 1712926950,
"sources": [
{
"version": "0.7",
"source_id": "_ingest-cli-source",
"max_num_pipelines_per_indexer": 1,
"desired_num_pipelines": 1,
"enabled": true,
"source_type": "ingest-cli",
"input_format": "json"
},
{
"version": "0.7",
"source_id": "_ingest-source",
"max_num_pipelines_per_indexer": 1,
"desired_num_pipelines": 1,
"enabled": false,
"source_type": "ingest",
"input_format": "json"
},
{
"version": "0.7",
"source_id": "_ingest-api-source",
"max_num_pipelines_per_indexer": 1,
"desired_num_pipelines": 1,
"enabled": true,
"source_type": "ingest-api",
"input_format": "json"
}
]
},
"splits": [
{
"split_state": "MarkedForDeletion",
"update_timestamp": 1712927482,
"publish_timestamp": 1712927337,
"version": "0.7",
"split_id": "01HV96QF4JB9RK7WAK8F88BAD9",
"index_uid": "my-indexer-2:01HV96E19YCPBGS1EJ82BSMY6M",
"partition_id": 15543552982874619885,
"source_id": "_ingest-cli-source",
"node_id": "indexer-0",
"num_docs": 743311,
"uncompressed_docs_size_in_bytes": 1672670326,
"time_range": {
"start": 1702718790,
"end": 1705296854
},
"create_timestamp": 1712927332,
"maturity": {
"type": "immature",
"maturation_period_millis": 21600000
},
"tags": [
"job_id!",
"job_id:ff8fb966f1c74a769437a5d09eabd1f4"
],
"footer_offsets": {
"start": 663139484,
"end": 663563079
},
"delete_opstamp": 0,
"num_merge_ops": 0
},
{
"split_state": "MarkedForDeletion",
"update_timestamp": 1712927482,
"publish_timestamp": 1712927184,
"version": "0.7",
"split_id": "01HV96K8RFST0Z2SJS7ASF1RRN",
"index_uid": "my-indexer-2:01HV96E19YCPBGS1EJ82BSMY6M",
"partition_id": 15543552982874619885,
"source_id": "_ingest-cli-source",
"node_id": "indexer-0",
"num_docs": 600643,
"uncompressed_docs_size_in_bytes": 1346960658,
"time_range": {
"start": 1702543504,
"end": 1706364084
},
"create_timestamp": 1712927180,
"maturity": {
"type": "immature",
"maturation_period_millis": 21600000
},
"tags": [
"job_id!",
"job_id:ff8fb966f1c74a769437a5d09eabd1f4"
],
"footer_offsets": {
"start": 541671361,
"end": 542018700
},
"delete_opstamp": 0,
"num_merge_ops": 0
},
{
"split_state": "Published",
"update_timestamp": 1712927482,
"publish_timestamp": 1712927482,
"version": "0.7",
"split_id": "01HV96W77N5G1V8V6G9WXE292R",
"index_uid": "my-indexer-2:01HV96E19YCPBGS1EJ82BSMY6M",
"partition_id": 15543552982874619885,
"source_id": "_ingest-cli-source",
"node_id": "indexer-0",
"num_docs": 1343954,
"uncompressed_docs_size_in_bytes": 3019630984,
"time_range": {
"start": 1702543504,
"end": 1706364084
},
"create_timestamp": 1712927475,
"maturity": {
"type": "mature"
},
"tags": [
"job_id!",
"job_id:ff8fb966f1c74a769437a5d09eabd1f4"
],
"footer_offsets": {
"start": 1204412102,
"end": 1205130771
},
"delete_opstamp": 0,
"num_merge_ops": 1
},
{
"split_state": "Published",
"update_timestamp": 1712927761,
"publish_timestamp": 1712927761,
"version": "0.7",
"split_id": "01HV974Q8KPMA5HH8YCC7037CS",
"index_uid": "my-indexer-2:01HV96E19YCPBGS1EJ82BSMY6M",
"partition_id": 15543552982874619885,
"source_id": "_ingest-cli-source",
"node_id": "indexer-0",
"num_docs": 1273644,
"uncompressed_docs_size_in_bytes": 2865582819,
"time_range": {
"start": 1702585310,
"end": 1706069131
},
"create_timestamp": 1712927748,
"maturity": {
"type": "mature"
},
"tags": [
"job_id!",
"job_id:ff8fb966f1c74a769437a5d09eabd1f4"
],
"footer_offsets": {
"start": 1151113177,
"end": 1151796275
},
"delete_opstamp": 0,
"num_merge_ops": 1
},
...
],
"delete_tasks": []
}
Steps to reproduce (if applicable)
Steps to reproduce the behavior:
- create an index
curl -X POST http://my-indexer:7280/api/v1/indexes -H "Content-Type: application/yaml" --data-binary @index_config.yaml
- run local file ingest command
quickwit tool local-ingest --index my-index-2 --input-path filename.jsonl
- run query
quickwit index search --index my-index-2 --query "*"
Expected behavior
ingested docs show up in search
Configuration:
quickwit --version
:
Quickwit v0.7.1 (01c2c7f 2024-01-23T01:49:36Z)
cat index_config.yaml
:
version: 0.7
index_id: my-index-2
index_uri: "s3://my-bucket/qw_index/my-index-2"
doc_mapping:
mode: dynamic
dynamic_mapping:
indexed: true
stored: true
tokenizer: default
record: basic
expand_dots: true
fast: true
field_mappings:
- name: ob_id
type: text
fast: true
tokenizer: raw
- name: some_text
type: text
fast: true
tokenizer: default
- name: another_text
type: text
fast: true
tokenizer: default
- name: url
type: text
fast: true
tokenizer: default
- name: some_url
type: text
fast: true
tokenizer: raw
- name: origin_url
type: text
fast: true
tokenizer: default
- name: file_extension
type: text
fast: true
tokenizer: raw
- name: resource_type
type: text
fast: true
tokenizer: raw
- name: timestamp
type: datetime
input_formats:
- unix_timestamp
output_format: unix_timestamp_secs
fast_precision: seconds
fast: true
tag_fields: ["some_url"]
timestamp_field: timestamp
partition_key: "job_id"
max_num_partitions: 10000
indexing_settings:
commit_timeout_secs: 300
split_num_docs_target: 1000000
merge_policy:
type: "stable_log"
merge_factor: 10
max_merge_factor: 12
maturation_period: 6h
resources:
heap_size: 4000000000
search_settings:
default_search_fields: [some_text, another_text]
cat node_config.yaml
version: 0.7
cluster_id: ${QW_CLUSTER_ID}
node_id: ${QW_NODE_ID}
listen_address: ${QW_LISTEN_ADDRESS}
metastore_uri: s3://${S3_BUCKET}/qw_index
default_index_root_uri: s3://${S3_BUCKET}/qw_index
rest:
listen_port: ${QW_LISTEN_PORT:-7280}
grpc:
max_message_size: 200MiB
storage:
s3:
region: ${AWS_REGION:-us-east-1}
endpoint: https://s3.${AWS_REGION:-us-east-1}.amazonaws.com
indexer:
split_store_max_num_bytes: ${QW_INDEX_STORAGE_SIZE:-5GiB}
split_store_max_num_splits: 1000
max_concurrent_split_uploads: 12
cpu_capacity: 4
ingest_api:
max_queue_memory_usage: ${QW_INGEST_MEMORY_SIZE:-4GiB}
max_queue_disk_usage: ${QW_INGEST_STORAGE_SIZE:-4GiB}
searcher:
fast_field_cache_capacity: 2G
split_footer_cache_capacity: 1G
partial_request_cache_capacity: 512M
max_num_concurrent_split_searches: 512
max_num_concurrent_split_streams: 512
split_cache:
max_num_bytes: ${QW_SPLIT_CACHE_STORAGE_SIZE:-5GiB}
num_concurrent_downloads: 16
jaeger:
enable_endpoint: false
ecs_task_definition
:
8 vCPU
16GB memory
100 GB storage
can you check whether any file was created on s3?
The s3 metastore is not safe for concurrent writers. I suspect the documents got indexed, a split was created, but its entry in the metastore has been overridden by the node with the metastore service.
and in the metastore.json, there are mature splits and published docs... which gets me even more confusing
@trinity-1686a it seems like each file was too big (1.8G, 600K lines), I splitted the file by 100K lines and it had no trouble being ingested.
is there a limit for max size when using quickwit tool local-ingest
?
I don't recall there being one. If there is, it ought to print some message and make the cli return a non zero status code.
Rereading your messages, I'm a bit estranged by a few things. quickwit index describe --index my-indexer-2
says there is no document, no split, no nothing. Yet the metastore.json
clearly shows there being multiple splits. Are you sure the metastore.json
you shared is the one used by the running Quickwit instance?
I also note that in the screenshot of the AWS management console, you are filtering for objects which name contains "metastore". If you don't filter, do other objects appear?