quickwit-oss/quickwit

local file ingestion success, 0 published documents/splits

Opened this issue · 5 comments

Describe the bug
I am trying to ingest about 1.2B documents (1.8GB x 1900 files) using single node configuration deployed on Amazon ECS.
quickwit tool local-ingest seems successful, but no split files in s3 bucket, nothing in search query result. quickwit index describe also show nothing. When I take a look in metastore.json file, there are mature splits and published docs, and so on. I wish someone from the team could guide me what I am missing here.

local-ingest success message:

❯ download: s3://my-bucket/.../896.json.gz to ./896.json.gz
-rw-r--r-- 1 root root 258M Apr  2 20:03 896.json.gz
❯ unzipping 896.json.gz
-rw-r--r-- 1 root root 1.7G Apr  2 20:03 896.json
❯ making jsonl from 896.json
-rw-r--r-- 1 root root 1.6G Apr 15 11:42 896.jsonl
❯ Ingesting documents locally...

--------------------------------------------------
 Connectivity checklist 
 ✔ metastore storage
 ✔ metastore
 ✔ index storage
 ✔ _ingest-cli-source

 Num docs    2444 Parse errs     0 PublSplits   0 Input size     5MB Thrghput  2.75MB/s Time 00:00:02
 Num docs   17136 Parse errs     0 PublSplits   0 Input size    38MB Thrghput 12.86MB/s Time 00:00:03
 Num docs   36729 Parse errs     0 PublSplits   0 Input size    82MB Thrghput 20.66MB/s Time 00:00:04
 ...
 Num docs  761206 Parse errs     0 PublSplits   0 Input size  1712MB Thrghput  0.00MB/s Time 00:01:17
 Num docs  761206 Parse errs     0 PublSplits   0 Input size  1712MB Thrghput  0.00MB/s Time 00:01:18
 Num docs  761206 Parse errs     0 PublSplits   1 Input size  1712MB Thrghput  0.00MB/s Time 00:01:19

Indexed 761,206 documents in 1m 19s.
2024-04-15T11:38:28.350Z ERROR quickwit_actors::actor_context: exit activating-kill-switch actor=MergeSplitDownloader-empty-hLbL exit_status=DownstreamClosed
Now, you can query the index with the following command:
quickwit index search --index my-indexer-2 --config ./config/quickwit.yaml --query "my query"
Clearing local cache directory...
✔ Local cache directory cleared.
✔ Documents successfully indexed.

quickwit index describe --index my-indexer-2:

  General Information
 --------------------------------------------+--------------------------------------------------------------------------------------------- 
  Index ID                                   | my-indexer-2
  Index URI                                  | s3://my-bucket/qw_index/my-indexer-2  
  Number of published documents              | 0 (0)
  Size of published documents (uncompressed) | 0 B
  Number of published splits                 | 0
  Size of published splits                   | 0 B
  Timestamp field                            | "timestamp"
  Timestamp range start                      | Timestamp does not exist for the index.
  Timestamp range end                        | Timestamp does not exist for the index.

quickwit index search --index my-indexer-2 --query "*":

{
  "num_hits": 0,
  "hits": [],
  "elapsed_time_micros": 821,
  "errors": []
}

metastore.json:

{
  "version": "0.7",
  "index": {
    "version": "0.7",
    "index_uid": "my-indexer-2:01HV96E19YCPBGS1EJ82BSMY6M",
    "index_config": {
      "version": "0.7",
      "index_id": "my-indexer-2",
      "index_uri": "s3://my-bucket/qw_index/my-indexer-2",
      "doc_mapping": {
        "field_mappings": [
          ...
        ],
        "tag_fields": [
          "some_url"
        ],
        "store_source": false,
        "index_field_presence": false,
        "timestamp_field": "timestamp",
        "mode": "dynamic",
        "dynamic_mapping": {
          ...
        },
        "partition_key": "job_id",
        "max_num_partitions": 10000,
        "tokenizers": []
      },
      "indexing_settings": {
        "commit_timeout_secs": 300,
        "docstore_compression_level": 8,
        "docstore_blocksize": 1000000,
        "split_num_docs_target": 1000000,
        "merge_policy": {
          "type": "stable_log",
          "min_level_num_docs": 100000,
          "merge_factor": 10,
          "max_merge_factor": 12,
          "maturation_period": "6h"
        },
        "resources": {
          "heap_size": "4.0 GB"
        }
      },
      "search_settings": {
        "default_search_fields": [
          ...
        ]
      },
      "retention": null
    },
    "checkpoint": {
      "_ingest-api-source": {},
      "_ingest-cli-source": {
        "file:///quickwit/0.jsonl": "00000000001346960658",
        "file:///quickwit/1.jsonl": "00000000001672670326",
        "file:///quickwit/10.jsonl": "00000000001494934641",
        "file:///quickwit/100.jsonl": "00000000001370648178",
        "file:///quickwit/1000.jsonl": "00000000001654146853",
        ...
        "file:///quickwit/895.jsonl": "00000000001734691539",
        "file:///quickwit/896.jsonl": "00000000001701627756",
        "file:///quickwit/897.jsonl": "00000000001788978023",
        "file:///quickwit/898.jsonl": "00000000001726562370",
        "file:///quickwit/899.jsonl": "00000000001798043217",
        "file:///quickwit/9.jsonl": "00000000001432854516"
      },
      "_ingest-source": {}
    },
    "create_timestamp": 1712926950,
    "sources": [
      {
        "version": "0.7",
        "source_id": "_ingest-cli-source",
        "max_num_pipelines_per_indexer": 1,
        "desired_num_pipelines": 1,
        "enabled": true,
        "source_type": "ingest-cli",
        "input_format": "json"
      },
      {
        "version": "0.7",
        "source_id": "_ingest-source",
        "max_num_pipelines_per_indexer": 1,
        "desired_num_pipelines": 1,
        "enabled": false,
        "source_type": "ingest",
        "input_format": "json"
      },
      {
        "version": "0.7",
        "source_id": "_ingest-api-source",
        "max_num_pipelines_per_indexer": 1,
        "desired_num_pipelines": 1,
        "enabled": true,
        "source_type": "ingest-api",
        "input_format": "json"
      }
    ]
  },
  "splits": [
    {
      "split_state": "MarkedForDeletion",
      "update_timestamp": 1712927482,
      "publish_timestamp": 1712927337,
      "version": "0.7",
      "split_id": "01HV96QF4JB9RK7WAK8F88BAD9",
      "index_uid": "my-indexer-2:01HV96E19YCPBGS1EJ82BSMY6M",
      "partition_id": 15543552982874619885,
      "source_id": "_ingest-cli-source",
      "node_id": "indexer-0",
      "num_docs": 743311,
      "uncompressed_docs_size_in_bytes": 1672670326,
      "time_range": {
        "start": 1702718790,
        "end": 1705296854
      },
      "create_timestamp": 1712927332,
      "maturity": {
        "type": "immature",
        "maturation_period_millis": 21600000
      },
      "tags": [
        "job_id!",
        "job_id:ff8fb966f1c74a769437a5d09eabd1f4"
      ],
      "footer_offsets": {
        "start": 663139484,
        "end": 663563079
      },
      "delete_opstamp": 0,
      "num_merge_ops": 0
    },
    {
      "split_state": "MarkedForDeletion",
      "update_timestamp": 1712927482,
      "publish_timestamp": 1712927184,
      "version": "0.7",
      "split_id": "01HV96K8RFST0Z2SJS7ASF1RRN",
      "index_uid": "my-indexer-2:01HV96E19YCPBGS1EJ82BSMY6M",
      "partition_id": 15543552982874619885,
      "source_id": "_ingest-cli-source",
      "node_id": "indexer-0",
      "num_docs": 600643,
      "uncompressed_docs_size_in_bytes": 1346960658,
      "time_range": {
        "start": 1702543504,
        "end": 1706364084
      },
      "create_timestamp": 1712927180,
      "maturity": {
        "type": "immature",
        "maturation_period_millis": 21600000
      },
      "tags": [
        "job_id!",
        "job_id:ff8fb966f1c74a769437a5d09eabd1f4"
      ],
      "footer_offsets": {
        "start": 541671361,
        "end": 542018700
      },
      "delete_opstamp": 0,
      "num_merge_ops": 0
    },
    {
      "split_state": "Published",
      "update_timestamp": 1712927482,
      "publish_timestamp": 1712927482,
      "version": "0.7",
      "split_id": "01HV96W77N5G1V8V6G9WXE292R",
      "index_uid": "my-indexer-2:01HV96E19YCPBGS1EJ82BSMY6M",
      "partition_id": 15543552982874619885,
      "source_id": "_ingest-cli-source",
      "node_id": "indexer-0",
      "num_docs": 1343954,
      "uncompressed_docs_size_in_bytes": 3019630984,
      "time_range": {
        "start": 1702543504,
        "end": 1706364084
      },
      "create_timestamp": 1712927475,
      "maturity": {
        "type": "mature"
      },
      "tags": [
        "job_id!",
        "job_id:ff8fb966f1c74a769437a5d09eabd1f4"
      ],
      "footer_offsets": {
        "start": 1204412102,
        "end": 1205130771
      },
      "delete_opstamp": 0,
      "num_merge_ops": 1
    },
    {
      "split_state": "Published",
      "update_timestamp": 1712927761,
      "publish_timestamp": 1712927761,
      "version": "0.7",
      "split_id": "01HV974Q8KPMA5HH8YCC7037CS",
      "index_uid": "my-indexer-2:01HV96E19YCPBGS1EJ82BSMY6M",
      "partition_id": 15543552982874619885,
      "source_id": "_ingest-cli-source",
      "node_id": "indexer-0",
      "num_docs": 1273644,
      "uncompressed_docs_size_in_bytes": 2865582819,
      "time_range": {
        "start": 1702585310,
        "end": 1706069131
      },
      "create_timestamp": 1712927748,
      "maturity": {
        "type": "mature"
      },
      "tags": [
        "job_id!",
        "job_id:ff8fb966f1c74a769437a5d09eabd1f4"
      ],
      "footer_offsets": {
        "start": 1151113177,
        "end": 1151796275
      },
      "delete_opstamp": 0,
      "num_merge_ops": 1
    },        
    ...        
  ],
  "delete_tasks": []
}

Steps to reproduce (if applicable)
Steps to reproduce the behavior:

  1. create an index curl -X POST http://my-indexer:7280/api/v1/indexes -H "Content-Type: application/yaml" --data-binary @index_config.yaml
  2. run local file ingest command quickwit tool local-ingest --index my-index-2 --input-path filename.jsonl
  3. run query quickwit index search --index my-index-2 --query "*"

Expected behavior
ingested docs show up in search

Configuration:

quickwit --version:
Quickwit v0.7.1 (01c2c7f 2024-01-23T01:49:36Z)

cat index_config.yaml:

version: 0.7
index_id: my-index-2
index_uri: "s3://my-bucket/qw_index/my-index-2"

doc_mapping:
  mode: dynamic
  dynamic_mapping:
    indexed: true
    stored: true
    tokenizer: default
    record: basic
    expand_dots: true
    fast: true
  field_mappings:
    - name: ob_id
      type: text
      fast: true
      tokenizer: raw
    - name: some_text
      type: text
      fast: true
      tokenizer: default
    - name: another_text
      type: text
      fast: true
      tokenizer: default
    - name: url
      type: text
      fast: true
      tokenizer: default
    - name: some_url
      type: text
      fast: true
      tokenizer: raw
    - name: origin_url
      type: text
      fast: true
      tokenizer: default
    - name: file_extension
      type: text
      fast: true
      tokenizer: raw
    - name: resource_type
      type: text
      fast: true
      tokenizer: raw
    - name: timestamp
      type: datetime
      input_formats:
        - unix_timestamp
      output_format: unix_timestamp_secs
      fast_precision: seconds
      fast: true
  tag_fields: ["some_url"] 
  timestamp_field: timestamp
  partition_key: "job_id"
  max_num_partitions: 10000 

indexing_settings:
  commit_timeout_secs: 300
  split_num_docs_target: 1000000
  merge_policy:
    type: "stable_log"
    merge_factor: 10
    max_merge_factor: 12
    maturation_period: 6h
  resources:
    heap_size: 4000000000 

search_settings:
  default_search_fields: [some_text, another_text]

cat node_config.yaml

version: 0.7

cluster_id: ${QW_CLUSTER_ID}
node_id: ${QW_NODE_ID}
listen_address: ${QW_LISTEN_ADDRESS}

metastore_uri: s3://${S3_BUCKET}/qw_index
default_index_root_uri: s3://${S3_BUCKET}/qw_index

rest:
  listen_port: ${QW_LISTEN_PORT:-7280}

grpc:
  max_message_size: 200MiB

storage:
  s3:
    region: ${AWS_REGION:-us-east-1}
    endpoint: https://s3.${AWS_REGION:-us-east-1}.amazonaws.com

indexer:
  split_store_max_num_bytes: ${QW_INDEX_STORAGE_SIZE:-5GiB}
  split_store_max_num_splits: 1000
  max_concurrent_split_uploads: 12
  cpu_capacity: 4

ingest_api:
  max_queue_memory_usage: ${QW_INGEST_MEMORY_SIZE:-4GiB}
  max_queue_disk_usage: ${QW_INGEST_STORAGE_SIZE:-4GiB}

searcher:
  fast_field_cache_capacity: 2G
  split_footer_cache_capacity: 1G
  partial_request_cache_capacity: 512M
  max_num_concurrent_split_searches: 512
  max_num_concurrent_split_streams: 512
  split_cache:
    max_num_bytes: ${QW_SPLIT_CACHE_STORAGE_SIZE:-5GiB}
    num_concurrent_downloads: 16

jaeger:
  enable_endpoint: false

ecs_task_definition:

8 vCPU
16GB memory
100 GB storage

can you check whether any file was created on s3?
The s3 metastore is not safe for concurrent writers. I suspect the documents got indexed, a split was created, but its entry in the metastore has been overridden by the node with the metastore service.

there's only metastore.json file in the S3 bucket

Screenshot from 2024-04-15 21-28-31

and in the metastore.json, there are mature splits and published docs... which gets me even more confusing

@trinity-1686a it seems like each file was too big (1.8G, 600K lines), I splitted the file by 100K lines and it had no trouble being ingested.
is there a limit for max size when using quickwit tool local-ingest?

I don't recall there being one. If there is, it ought to print some message and make the cli return a non zero status code.

Rereading your messages, I'm a bit estranged by a few things. quickwit index describe --index my-indexer-2 says there is no document, no split, no nothing. Yet the metastore.json clearly shows there being multiple splits. Are you sure the metastore.json you shared is the one used by the running Quickwit instance?

I also note that in the screenshot of the AWS management console, you are filtering for objects which name contains "metastore". If you don't filter, do other objects appear?