qdrant/qdrant

[question]search API timeout , collection is yellow , not sure when will turn green and wanna know why?

Closed this issue · 4 comments

no7dw commented

Current Behavior

search API timeout , collection is yellow.
Docker v1.9 on Linux

2024-04-30T12:35:56.820282Z  INFO actix_web::middleware::logger: 10.0.8.35 "GET /collections/new_text HTTP/1.1" 200 681 "-" "python-httpx/0.27.0" 0.000285
2024-04-30T12:36:33.480292Z  INFO actix_web::middleware::logger: 10.0.8.35 "GET /collections HTTP/1.1" 200 196 "-" "python-httpx/0.27.0" 0.000168
2024-04-30T12:36:33.482359Z  INFO actix_web::middleware::logger: 10.0.8.35 "GET /collections/new_text HTTP/1.1" 200 686 "-" "python-httpx/0.27.0" 0.000513
2024-04-30T12:36:33.485410Z  INFO actix_web::middleware::logger: 10.0.8.35 "GET /collections/new_text HTTP/1.1" 200 687 "-" "python-httpx/0.27.0" 0.000261
2024-04-30T12:36:37.642380Z DEBUG collection::shards::local_shard::shard_ops: Search timeout reached: 60 seconds
2024-04-30T12:36:37.642405Z DEBUG collection::shards::replica_set::execute_read_operation: Read operation failed: Timeout error: Operation 'Search' timed out after 60 seconds
2024-04-30T12:36:37.642645Z  WARN qdrant::actix::helpers: error processing request: 1 of 1 read operations failed:
  Timeout error: Operation 'Search' timed out after 60 seconds
2024-04-30T12:36:37.642741Z  INFO actix_web::middleware::logger: 10.0.8.35 "POST /collections/new_text/points/search HTTP/1.1" 500 144 "-" "python-httpx/0.27.0" 60.002235
2024-04-30T12:36:57.090142Z DEBUG collection::shards::local_shard::shard_ops: Search timeout reached: 60 seconds
2024-04-30T12:36:57.090169Z DEBUG collection::shards::replica_set::execute_read_operation: Read operation failed: Timeout error: Operation 'Search' timed out after 60 seconds
2024-04-30T12:36:57.090390Z  WARN qdrant::actix::helpers: error processing request: 1 of 1 read operations failed:
  Timeout error: Operation 'Search' timed out after 60 seconds
2024-04-30T12:36:57.090510Z  INFO actix_web::middleware::logger: 10.0.8.35 "POST /collections/new_text/points/search HTTP/1.1" 500 146 "-" "python-httpx/0.27.0" 60.002175
2024-04-30T12:37:33.791869Z DEBUG collection::shards::local_shard::shard_ops: Search timeout reached: 60 seconds
2024-04-30T12:37:33.791894Z DEBUG collection::shards::replica_set::execute_read_operation: Read operation failed: Timeout error: Operation 'Search' timed out after 60 seconds
2024-04-30T12:37:33.792126Z  WARN qdrant::actix::helpers: error processing request: 1 of 1 read operations failed:
  Timeout error: Operation 'Search' timed out after 60 seconds
2024-04-30T12:37:33.792229Z  INFO actix_web::middleware::logger: 10.0.8.35 "POST /collections/new_text/points/search HTTP/1.1" 500 145 "-" "python-httpx/0.27.0" 60.002023

Steps to Reproduce

No

Expected Behavior

expected as previsouly performance usually took <2 secs

Possible Solution

Context (Environment)

run on Docker v1.9 on Linux
collection points : 22Million

collection config:

{
  "result": {
    "status": "yellow",
    "optimizer_status": "ok",
    "indexed_vectors_count": 17850651,
    "points_count": 23511278,
    "segments_count": 14,
    "config": {
      "params": {
        "vectors": {
          "gemini:models/embedding-001": {
            "size": 768,
            "distance": "Cosine"
          },
          "huggingface:jinaai/jina-embeddings-v2-base-en": {
            "size": 768,
            "distance": "Cosine"
          },
          "mistral:mistral-embed": {
            "size": 1024,
            "distance": "Cosine"
          },
          "openai:text-embedding-ada-002": {
            "size": 1536,
            "distance": "Cosine",
            "on_disk": true
          }
        },
        "shard_number": 1,
        "replication_factor": 1,
        "write_consistency_factor": 1,
        "on_disk_payload": true
      },
      "hnsw_config": {
        "m": 16,
        "ef_construct": 100,
        "full_scan_threshold": 10000,
        "max_indexing_threads": 0,
        "on_disk": false
      },
      "optimizer_config": {
        "deleted_threshold": 0.2,
        "vacuum_min_vector_number": 1000,
        "default_segment_number": 0,
        "max_segment_size": null,
        "memmap_threshold": 120000,
        "indexing_threshold": 20000,
        "flush_interval_sec": 5,
        "max_optimization_threads": 16
      },
      "wal_config": {
        "wal_capacity_mb": 32,
        "wal_segments_ahead": 0
      },
      "quantization_config": {
        "binary": {
          "always_ram": true
        }
      }
    },
    "payload_schema": {
      "created_at": {
        "data_type": "integer",
        "points": 23511278
      },
      "content": {
        "data_type": "text",
        "points": 23511278
      },
      "website_profile": {
        "data_type": "keyword",
        "points": 23511278
      },
      "language": {
        "data_type": "keyword",
        "points": 300057
      },
      "source": {
        "data_type": "keyword",
        "points": 23511278
      },
      "headline": {
        "data_type": "text",
        "points": 23511278
      },
      "website_url": {
        "data_type": "keyword",
        "points": 23511278
      },
      "published_at": {
        "data_type": "integer",
        "points": 21310307
      }
    }
  },
  "status": "ok",
  "time": 0.000178057
}

cpu&mem seems busy than before
htop
Pasted Graphic 6

iotop
Pasted Graphic 7

for iotop sometime show
6.37 M/s ./qdrant --uri http://10.0.9.90:6335 [rocksdb:high]
3108 be/4 root 0.00 B/s 3.91 K/s barad_agent

disk perf
iops : min=38484, max=68776, avg=58462.63, stdev=8412.20, samples=19

Pasted Graphic 8

weird things is: if added some of the filter (will filter less than 0.5Million record) will not timeout. previously is quite normal, scroll api is normal as usual.

almost ~8 hours passed , still timeout

according to debug log:

2024-04-30T11:40:10.103268Z DEBUG segment::index::hnsw_index::hnsw: building additional index for field created_at
2024-04-30T11:40:43.880405Z DEBUG segment::index::hnsw_index::hnsw: building additional index for field website_url
2024-04-30T11:40:43.883697Z DEBUG segment::index::hnsw_index::hnsw: building additional index for field headline
2024-04-30T11:41:32.322913Z DEBUG segment::index::hnsw_index::hnsw: building additional index for field website_url
2024-04-30T11:41:36.693755Z DEBUG segment::index::hnsw_index::hnsw: building additional index for field source
2024-04-30T11:41:45.083401Z DEBUG segment::index::hnsw_index::hnsw: building additional index for field content
2024-04-30T11:41:48.850779Z DEBUG segment::index::hnsw_index::hnsw: building additional index for field language
2024-04-30T11:41:48.850810Z DEBUG segment::index::hnsw_index::hnsw: building additional index for field source
2024-04-30T11:41:49.240156Z DEBUG segment::index::hnsw_index::hnsw: building additional index for field content
2024-04-30T11:46:15.715200Z DEBUG reqwest::connect: starting new connection: https://telemetry.qdrant.io/
2024-04-30T11:54:09.682583Z DEBUG segment::index::hnsw_index::hnsw: building additional index for field website_profile
2024-04-30T11:54:10.705628Z DEBUG segment::index::hnsw_index::hnsw: building additional index for field published_at
2024-04-30T11:54:25.977056Z DEBUG segment::index::hnsw_index::hnsw: finish additional payload field indexing
2024-04-30T11:54:25.982247Z DEBUG segment::index::hnsw_index::hnsw: building HNSW for 1415328 vectors with 8 CPUs
2024-04-30T11:56:18.476187Z DEBUG segment::index::hnsw_index::hnsw: finish main graph
2024-04-30T11:56:18.515158Z DEBUG segment::index::hnsw_index::hnsw: building additional index for field created_at
2024-04-30T11:58:57.732872Z DEBUG segment::index::hnsw_index::hnsw: building additional index for field website_url
2024-04-30T11:58:58.071999Z DEBUG segment::index::hnsw_index::hnsw: building additional index for field headline
2024-04-30T12:10:44.927503Z DEBUG segment::index::hnsw_index::hnsw: building additional index for field language
2024-04-30T12:10:44.927531Z DEBUG segment::index::hnsw_index::hnsw: building additional index for field source
2024-04-30T12:10:47.917964Z DEBUG segment::index::hnsw_index::hnsw: building additional index for field content
2024-04-30T12:29:56.423444Z DEBUG collection::shards::local_shard::shard_ops: Search timeout reached: 60 seconds
2024-04-30T12:29:56.423478Z DEBUG collection::shards::replica_set::execute_read_operation: Read operation failed: Timeout error: Operation 'Search' timed out after 60 seconds
2024-04-30T12:30:05.452039Z DEBUG collection::shards::local_shard::shard_ops: Search timeout reached: 60 seconds
2024-04-30T12:30:05.452065Z DEBUG collection::shards::replica_set::execute_read_operation: Read operation failed: Timeout error: Operation 'Search' timed out after 60 seconds
2024-04-30T12:30:31.193711Z DEBUG collection::shards::local_shard::shard_ops: Search timeout reached: 60 seconds
2024-04-30T12:30:31.193737Z DEBUG collection::shards::replica_set::execute_read_operation: Read operation failed: Timeout error: Operation 'Search' timed out after 60 seconds
2024-04-30T12:30:40.234053Z DEBUG collection::shards::local_shard::shard_ops: Search timeout reached: 60 seconds
2024-04-30T12:30:40.234081Z DEBUG collection::shards::replica_set::execute_read_operation: Read operation failed: Timeout error: Operation 'Search' timed out after 60 seconds
2024-04-30T12:36:37.642380Z DEBUG collection::shards::local_shard::shard_ops: Search timeout reached: 60 seconds
2024-04-30T12:36:37.642405Z DEBUG collection::shards::replica_set::execute_read_operation: Read operation failed: Timeout error: Operation 'Search' timed out after 60 seconds
2024-04-30T12:36:57.090142Z DEBUG collection::shards::local_shard::shard_ops: Search timeout reached: 60 seconds
2024-04-30T12:36:57.090169Z DEBUG collection::shards::replica_set::execute_read_operation: Read operation failed: Timeout error: Operation 'Search' timed out after 60 seconds
2024-04-30T12:37:33.791869Z DEBUG collection::shards::local_shard::shard_ops: Search timeout reached: 60 seconds
2024-04-30T12:37:33.791894Z DEBUG collection::shards::replica_set::execute_read_operation: Read operation failed: Timeout error: Operation 'Search' timed out after 60 seconds
2024-04-30T12:42:27.238580Z DEBUG collection::shards::local_shard::shard_ops: Search timeout reached: 60 seconds
2024-04-30T12:42:27.238607Z DEBUG collection::shards::replica_set::execute_read_operation: Read operation failed: Timeout error: Operation 'Search' timed out after 60 seconds
2024-04-30T12:46:17.535698Z DEBUG reqwest::connect: starting new connection: https://telemetry.qdrant.io/
[root@qdrant1 ~]# docker logs  -n 500 f8763fa56e06 | grep finish
2024-04-30T06:46:06.245446Z DEBUG qdrant: Waiting for thread grpc_internal to finish
2024-04-30T11:26:29.712004Z DEBUG segment::index::hnsw_index::hnsw: finish main graph
2024-04-30T11:32:38.957801Z DEBUG segment::index::hnsw_index::hnsw: finish main graph
2024-04-30T11:40:09.802142Z DEBUG segment::index::hnsw_index::hnsw: finish additional payload field indexing
2024-04-30T11:40:10.050789Z DEBUG segment::index::hnsw_index::hnsw: finish main graph
2024-04-30T11:54:25.977056Z DEBUG segment::index::hnsw_index::hnsw: finish additional payload field indexing
2024-04-30T11:56:18.476187Z DEBUG segment::index::hnsw_index::hnsw: finish main graph

judge from the above , seems stiling builidng index after startup.

Possible Implementation

wanna know why first ,and how we avoid this phenomenon and when will it turn green (how could i estimated when will turn green and not timeout base on any clue(log/ htop/iotop metrics)

Hey @no7dw, thanks for the detailed report!

requests timeout definitely could be explained by the yellow status of the collection.
According to your collection info, there are ~6millions of un-indexed vectors.

Regarding your indexing time, it worth some investigation. I notice that the memory usage of qdrant process is 200+Gb, while I would expect, that binary quantization + hnsw index together would only contribute up to 32Gb of RAM.

To make sure, could you please share the output of cat /proc/<qdrant-pid>/status | grep Rss, where <qdrant-pid> is a pid of qdrant process.

My hypothesis would be that your payload index consumes a lot of memory. Especially the

"content": {
       "data_type": "text",
       "points": 23511278
     },

And high memory usage may, in theory, slowdown all other memory-related operations which includes indexing.

no7dw commented
[root@qdrant1 ~]# ps -ef| grep qdrant
root     1928203 1928176  3 Apr30 ?        00:19:41 ./qdrant --uri http://10.0.9.90:6335
root     2324591 2324450  0 10:33 pts/3    00:00:00 grep --color=auto qdrant

[root@qdrant1 ~]# cat /proc/1928203/status | grep -i rss
VmRSS:	212186196 kB
RssAnon:	212070748 kB
RssFile:	  115448 kB
RssShmem:	       0 kB

your hypothesis seems make sense. to support some feature (full text search) if my memory serves. will get back to you with more details about this.

no7dw commented

after a few config change & experiment, we managed to limit the mem decrease from previously ~200G to ~30G

  • recreate a new collection to avoid this issue (#4112) (#4131)
  • remove some of the payload index when create new collection for @generall 's hypothesis about content consumes too much memory(in our case (contents(mostly) + headlines filelds ))

we managed to limit the mem decrease from previously ~200G to ~30G

Fantastic!

Assuming we can consider this solved, I'm closing this issue now. Feel free to open it again if you have any further questions.