metrico/qryn

Qryn crash under load - ERR_STREAM_PREMATURE_CLOSE

jpsfs opened this issue · 6 comments

jpsfs commented

Hi!

I'm facing an issue while using PromQL through qryn.

The query is the following:

histogram_quantile(0.50, sum(rate(http_server_request_duration_seconds_bucket{service_namespace=~"$environment",service_name=~"$component",  instance=~"$instance",http_route=~"$route", http_request_method=~"$method"}[$__rate_interval])) by (le))

It translates to this ClickHouse query:

WITH idx AS (select `fingerprint` from `qryn`.`time_series_gin` as `time_series_gin` where ((((`key` = 'service_namespace') and (match(val, '.+') = 1)) or ((`key` = 'service_name') and (match(val, '.+') = 1)) or ((`key` = 'instance') and (match(val, '.+') = 1)) or ((`key` = 'http_route') and (match(val, '.+') = 1)) or ((`key` = 'http_request_method') and (match(val, '.+') = 1)) or ((`key` = '__name__') and (`val` = 'http_server_request_duration_seconds_bucket'))) and (`date` >= toDate(fromUnixTimestamp(1718310540))) and (`date` <= toDate(fromUnixTimestamp(1718312340))) and (`type` in (0,0))) group by `fingerprint` having (groupBitOr(bitShiftLeft(((`key` = 'service_namespace') and (match(val, '.+') = 1))::UInt64, 0)+bitShiftLeft(((`key` = 'service_name') and (match(val, '.+') = 1))::UInt64, 1)+bitShiftLeft(((`key` = 'instance') and (match(val, '.+') = 1))::UInt64, 2)+bitShiftLeft(((`key` = 'http_route') and (match(val, '.+') = 1))::UInt64, 3)+bitShiftLeft(((`key` = 'http_request_method') and (match(val, '.+') = 1))::UInt64, 4)+bitShiftLeft(((`key` = '__name__') and (`val` = 'http_server_request_duration_seconds_bucket'))::UInt64, 5)) = 63)), raw AS (select argMaxMerge(last) as `value`,`fingerprint`,intDiv(timestamp_ns, 15000000000) * 15000 as `timestamp_ms` from `metrics_15s` as `metrics_15s` where ((`fingerprint` in (idx)) and (`timestamp_ns` >= 1718310540000000000) and (`timestamp_ns` <= 1718312340000000000) and (`type` in (0,0))) group by `fingerprint`,`timestamp_ms` order by `fingerprint`,`timestamp_ms`), timeSeries AS (select `fingerprint`,arraySort(JSONExtractKeysAndValues(labels, 'String')) as `labels` from `qryn`.`time_series` where ((`fingerprint` in (idx)) and (`type` in (0,0)))) select any(labels) as `stream`,arraySort(groupArray((raw.timestamp_ms, raw.value))) as `values` from raw as `raw` any left join timeSeries as time_series on `time_series`.`fingerprint` = raw.fingerprint group by `raw`.`fingerprint` order by `raw`.`fingerprint`

And after a few seconds it crashes with the following error:

Error [ERR_STREAM_PREMATURE_CLOSE]: Premature close
    at Gunzip.onclose (node:internal/streams/end-of-stream:154:30)
    at Gunzip.emit (node:events:531:35)
    at emitCloseNT (node:internal/streams/destroy:147:10)
    at process.processTicksAndRejections (node:internal/process/task_queues:81:21)
Emitted 'error' event on Readable instance at:
    at emitErrorNT (node:internal/streams/destroy:169:8)
    at emitErrorCloseNT (node:internal/streams/destroy:128:3)
    at process.processTicksAndRejections (node:internal/process/task_queues:82:21) {
  code: 'ERR_STREAM_PREMATURE_CLOSE'
}

Running this query directly in ClickHouse, it returns in around 150ms, with a total size of 50MiB.
Any pointers on what I should do to overcome this?

Best,
José

Thanks for the report @jpsfs
Do you see any logs or errors from ClickHouse as this query fails?

jpsfs commented

Thank you for the follow-up @lmangani !
Only normal ClickHouse logs, the query itself doesn't seem to fail on ClickHouse and if I try to execute it manually it succeeds quite fast.

I forgot to mention that this was tested in the latest version (released today) as well as in the previous two versions.

If the database is smaller (less data) the query succeeds on qryn as well.

Best,
José

@jpsfs do you have an error message: "timeout" in Grafana when you request histogram_quantile(0.50, ....) ?

jpsfs commented

@jpsfs could you please retest using the latest release and provide any feedback?

EXPERIMENTAL_PROMQL_OPTIMIZE==1
akvlad commented

Hello @jpsfs . 3.2.24 version is released.

  • sum and rate functions were optimized using clickhouse functions.

Please set the env var EXPERIMENTAL_PROMQL_OPTIMIZE=1 before usage.
Please share the user experience of using sum and rate functions (like in your histogram_quantile.... request) so we can decide if the further optimizations are worth to be done.