[question]search API timeout , collection is yellow , not sure when will turn green and wanna know why?
Closed this issue · 4 comments
Current Behavior
search API timeout , collection is yellow.
Docker v1.9 on Linux
2024-04-30T12:35:56.820282Z INFO actix_web::middleware::logger: 10.0.8.35 "GET /collections/new_text HTTP/1.1" 200 681 "-" "python-httpx/0.27.0" 0.000285
2024-04-30T12:36:33.480292Z INFO actix_web::middleware::logger: 10.0.8.35 "GET /collections HTTP/1.1" 200 196 "-" "python-httpx/0.27.0" 0.000168
2024-04-30T12:36:33.482359Z INFO actix_web::middleware::logger: 10.0.8.35 "GET /collections/new_text HTTP/1.1" 200 686 "-" "python-httpx/0.27.0" 0.000513
2024-04-30T12:36:33.485410Z INFO actix_web::middleware::logger: 10.0.8.35 "GET /collections/new_text HTTP/1.1" 200 687 "-" "python-httpx/0.27.0" 0.000261
2024-04-30T12:36:37.642380Z DEBUG collection::shards::local_shard::shard_ops: Search timeout reached: 60 seconds
2024-04-30T12:36:37.642405Z DEBUG collection::shards::replica_set::execute_read_operation: Read operation failed: Timeout error: Operation 'Search' timed out after 60 seconds
2024-04-30T12:36:37.642645Z WARN qdrant::actix::helpers: error processing request: 1 of 1 read operations failed:
Timeout error: Operation 'Search' timed out after 60 seconds
2024-04-30T12:36:37.642741Z INFO actix_web::middleware::logger: 10.0.8.35 "POST /collections/new_text/points/search HTTP/1.1" 500 144 "-" "python-httpx/0.27.0" 60.002235
2024-04-30T12:36:57.090142Z DEBUG collection::shards::local_shard::shard_ops: Search timeout reached: 60 seconds
2024-04-30T12:36:57.090169Z DEBUG collection::shards::replica_set::execute_read_operation: Read operation failed: Timeout error: Operation 'Search' timed out after 60 seconds
2024-04-30T12:36:57.090390Z WARN qdrant::actix::helpers: error processing request: 1 of 1 read operations failed:
Timeout error: Operation 'Search' timed out after 60 seconds
2024-04-30T12:36:57.090510Z INFO actix_web::middleware::logger: 10.0.8.35 "POST /collections/new_text/points/search HTTP/1.1" 500 146 "-" "python-httpx/0.27.0" 60.002175
2024-04-30T12:37:33.791869Z DEBUG collection::shards::local_shard::shard_ops: Search timeout reached: 60 seconds
2024-04-30T12:37:33.791894Z DEBUG collection::shards::replica_set::execute_read_operation: Read operation failed: Timeout error: Operation 'Search' timed out after 60 seconds
2024-04-30T12:37:33.792126Z WARN qdrant::actix::helpers: error processing request: 1 of 1 read operations failed:
Timeout error: Operation 'Search' timed out after 60 seconds
2024-04-30T12:37:33.792229Z INFO actix_web::middleware::logger: 10.0.8.35 "POST /collections/new_text/points/search HTTP/1.1" 500 145 "-" "python-httpx/0.27.0" 60.002023
Steps to Reproduce
No
Expected Behavior
expected as previsouly performance usually took <2 secs
Possible Solution
Context (Environment)
run on Docker v1.9 on Linux
collection points : 22Million
collection config:
{
"result": {
"status": "yellow",
"optimizer_status": "ok",
"indexed_vectors_count": 17850651,
"points_count": 23511278,
"segments_count": 14,
"config": {
"params": {
"vectors": {
"gemini:models/embedding-001": {
"size": 768,
"distance": "Cosine"
},
"huggingface:jinaai/jina-embeddings-v2-base-en": {
"size": 768,
"distance": "Cosine"
},
"mistral:mistral-embed": {
"size": 1024,
"distance": "Cosine"
},
"openai:text-embedding-ada-002": {
"size": 1536,
"distance": "Cosine",
"on_disk": true
}
},
"shard_number": 1,
"replication_factor": 1,
"write_consistency_factor": 1,
"on_disk_payload": true
},
"hnsw_config": {
"m": 16,
"ef_construct": 100,
"full_scan_threshold": 10000,
"max_indexing_threads": 0,
"on_disk": false
},
"optimizer_config": {
"deleted_threshold": 0.2,
"vacuum_min_vector_number": 1000,
"default_segment_number": 0,
"max_segment_size": null,
"memmap_threshold": 120000,
"indexing_threshold": 20000,
"flush_interval_sec": 5,
"max_optimization_threads": 16
},
"wal_config": {
"wal_capacity_mb": 32,
"wal_segments_ahead": 0
},
"quantization_config": {
"binary": {
"always_ram": true
}
}
},
"payload_schema": {
"created_at": {
"data_type": "integer",
"points": 23511278
},
"content": {
"data_type": "text",
"points": 23511278
},
"website_profile": {
"data_type": "keyword",
"points": 23511278
},
"language": {
"data_type": "keyword",
"points": 300057
},
"source": {
"data_type": "keyword",
"points": 23511278
},
"headline": {
"data_type": "text",
"points": 23511278
},
"website_url": {
"data_type": "keyword",
"points": 23511278
},
"published_at": {
"data_type": "integer",
"points": 21310307
}
}
},
"status": "ok",
"time": 0.000178057
}
cpu&mem seems busy than before
htop
for iotop sometime show
6.37 M/s ./qdrant --uri http://10.0.9.90:6335 [rocksdb:high]
3108 be/4 root 0.00 B/s 3.91 K/s barad_agent
disk perf
iops : min=38484, max=68776, avg=58462.63, stdev=8412.20, samples=19
weird things is: if added some of the filter (will filter less than 0.5Million record) will not timeout. previously is quite normal, scroll api is normal as usual.
almost ~8 hours passed , still timeout
according to debug log:
2024-04-30T11:40:10.103268Z DEBUG segment::index::hnsw_index::hnsw: building additional index for field created_at
2024-04-30T11:40:43.880405Z DEBUG segment::index::hnsw_index::hnsw: building additional index for field website_url
2024-04-30T11:40:43.883697Z DEBUG segment::index::hnsw_index::hnsw: building additional index for field headline
2024-04-30T11:41:32.322913Z DEBUG segment::index::hnsw_index::hnsw: building additional index for field website_url
2024-04-30T11:41:36.693755Z DEBUG segment::index::hnsw_index::hnsw: building additional index for field source
2024-04-30T11:41:45.083401Z DEBUG segment::index::hnsw_index::hnsw: building additional index for field content
2024-04-30T11:41:48.850779Z DEBUG segment::index::hnsw_index::hnsw: building additional index for field language
2024-04-30T11:41:48.850810Z DEBUG segment::index::hnsw_index::hnsw: building additional index for field source
2024-04-30T11:41:49.240156Z DEBUG segment::index::hnsw_index::hnsw: building additional index for field content
2024-04-30T11:46:15.715200Z DEBUG reqwest::connect: starting new connection: https://telemetry.qdrant.io/
2024-04-30T11:54:09.682583Z DEBUG segment::index::hnsw_index::hnsw: building additional index for field website_profile
2024-04-30T11:54:10.705628Z DEBUG segment::index::hnsw_index::hnsw: building additional index for field published_at
2024-04-30T11:54:25.977056Z DEBUG segment::index::hnsw_index::hnsw: finish additional payload field indexing
2024-04-30T11:54:25.982247Z DEBUG segment::index::hnsw_index::hnsw: building HNSW for 1415328 vectors with 8 CPUs
2024-04-30T11:56:18.476187Z DEBUG segment::index::hnsw_index::hnsw: finish main graph
2024-04-30T11:56:18.515158Z DEBUG segment::index::hnsw_index::hnsw: building additional index for field created_at
2024-04-30T11:58:57.732872Z DEBUG segment::index::hnsw_index::hnsw: building additional index for field website_url
2024-04-30T11:58:58.071999Z DEBUG segment::index::hnsw_index::hnsw: building additional index for field headline
2024-04-30T12:10:44.927503Z DEBUG segment::index::hnsw_index::hnsw: building additional index for field language
2024-04-30T12:10:44.927531Z DEBUG segment::index::hnsw_index::hnsw: building additional index for field source
2024-04-30T12:10:47.917964Z DEBUG segment::index::hnsw_index::hnsw: building additional index for field content
2024-04-30T12:29:56.423444Z DEBUG collection::shards::local_shard::shard_ops: Search timeout reached: 60 seconds
2024-04-30T12:29:56.423478Z DEBUG collection::shards::replica_set::execute_read_operation: Read operation failed: Timeout error: Operation 'Search' timed out after 60 seconds
2024-04-30T12:30:05.452039Z DEBUG collection::shards::local_shard::shard_ops: Search timeout reached: 60 seconds
2024-04-30T12:30:05.452065Z DEBUG collection::shards::replica_set::execute_read_operation: Read operation failed: Timeout error: Operation 'Search' timed out after 60 seconds
2024-04-30T12:30:31.193711Z DEBUG collection::shards::local_shard::shard_ops: Search timeout reached: 60 seconds
2024-04-30T12:30:31.193737Z DEBUG collection::shards::replica_set::execute_read_operation: Read operation failed: Timeout error: Operation 'Search' timed out after 60 seconds
2024-04-30T12:30:40.234053Z DEBUG collection::shards::local_shard::shard_ops: Search timeout reached: 60 seconds
2024-04-30T12:30:40.234081Z DEBUG collection::shards::replica_set::execute_read_operation: Read operation failed: Timeout error: Operation 'Search' timed out after 60 seconds
2024-04-30T12:36:37.642380Z DEBUG collection::shards::local_shard::shard_ops: Search timeout reached: 60 seconds
2024-04-30T12:36:37.642405Z DEBUG collection::shards::replica_set::execute_read_operation: Read operation failed: Timeout error: Operation 'Search' timed out after 60 seconds
2024-04-30T12:36:57.090142Z DEBUG collection::shards::local_shard::shard_ops: Search timeout reached: 60 seconds
2024-04-30T12:36:57.090169Z DEBUG collection::shards::replica_set::execute_read_operation: Read operation failed: Timeout error: Operation 'Search' timed out after 60 seconds
2024-04-30T12:37:33.791869Z DEBUG collection::shards::local_shard::shard_ops: Search timeout reached: 60 seconds
2024-04-30T12:37:33.791894Z DEBUG collection::shards::replica_set::execute_read_operation: Read operation failed: Timeout error: Operation 'Search' timed out after 60 seconds
2024-04-30T12:42:27.238580Z DEBUG collection::shards::local_shard::shard_ops: Search timeout reached: 60 seconds
2024-04-30T12:42:27.238607Z DEBUG collection::shards::replica_set::execute_read_operation: Read operation failed: Timeout error: Operation 'Search' timed out after 60 seconds
2024-04-30T12:46:17.535698Z DEBUG reqwest::connect: starting new connection: https://telemetry.qdrant.io/
[root@qdrant1 ~]# docker logs -n 500 f8763fa56e06 | grep finish
2024-04-30T06:46:06.245446Z DEBUG qdrant: Waiting for thread grpc_internal to finish
2024-04-30T11:26:29.712004Z DEBUG segment::index::hnsw_index::hnsw: finish main graph
2024-04-30T11:32:38.957801Z DEBUG segment::index::hnsw_index::hnsw: finish main graph
2024-04-30T11:40:09.802142Z DEBUG segment::index::hnsw_index::hnsw: finish additional payload field indexing
2024-04-30T11:40:10.050789Z DEBUG segment::index::hnsw_index::hnsw: finish main graph
2024-04-30T11:54:25.977056Z DEBUG segment::index::hnsw_index::hnsw: finish additional payload field indexing
2024-04-30T11:56:18.476187Z DEBUG segment::index::hnsw_index::hnsw: finish main graph
judge from the above , seems stiling builidng index after startup.
Possible Implementation
wanna know why first ,and how we avoid this phenomenon and when will it turn green (how could i estimated when will turn green and not timeout base on any clue(log/ htop/iotop metrics)
Hey @no7dw, thanks for the detailed report!
requests timeout definitely could be explained by the yellow status of the collection.
According to your collection info, there are ~6millions of un-indexed vectors.
Regarding your indexing time, it worth some investigation. I notice that the memory usage of qdrant process is 200+Gb, while I would expect, that binary quantization + hnsw index together would only contribute up to 32Gb of RAM.
To make sure, could you please share the output of cat /proc/<qdrant-pid>/status | grep Rss
, where <qdrant-pid>
is a pid of qdrant process.
My hypothesis would be that your payload index consumes a lot of memory. Especially the
"content": {
"data_type": "text",
"points": 23511278
},
And high memory usage may, in theory, slowdown all other memory-related operations which includes indexing.
[root@qdrant1 ~]# ps -ef| grep qdrant
root 1928203 1928176 3 Apr30 ? 00:19:41 ./qdrant --uri http://10.0.9.90:6335
root 2324591 2324450 0 10:33 pts/3 00:00:00 grep --color=auto qdrant
[root@qdrant1 ~]# cat /proc/1928203/status | grep -i rss
VmRSS: 212186196 kB
RssAnon: 212070748 kB
RssFile: 115448 kB
RssShmem: 0 kB
your hypothesis seems make sense. to support some feature (full text search) if my memory serves. will get back to you with more details about this.
after a few config change & experiment, we managed to limit the mem decrease from previously ~200G to ~30G
we managed to limit the mem decrease from previously ~200G to ~30G
Fantastic!
Assuming we can consider this solved, I'm closing this issue now. Feel free to open it again if you have any further questions.