[kube-prometheus-stack] Replaying WAL does not complete due to a restart

Question

[kube-prometheus-stack] Replaying WAL does not complete due to a restart

shubhjoshi-dgs opened this issue 5 months ago · 5 comments

Is your feature request related to a problem ?

yes, the liveness and readiness probe fails as "Replaying WAL, this may take a while" takes alot of time and memory for the node breaches. Also, i am not able to set initial delay for health check, to complete the WAL process.

Describe the solution you'd like.

add initial delay seconds flag in values of chart to mitigate the issue of time taken for WAL process to complete.

Describe alternatives you've considered.

tried editing the statefulset, but it doesn't let me

Additional context.

ts=2024-06-03T12:03:16.735Z caller=main.go:556 level=info msg="Starting Prometheus Server" mode=server version="(version=2.41.0, branch=HEAD, revision=c0d8a56c69014279464c0e15d8bfb0e153af0dab)"
ts=2024-06-03T12:03:16.736Z caller=main.go:561 level=info build_context="(go=go1.19.4, platform=linux/amd64, user=root@d20a03e77067, date=20221220-10:40:45)"
ts=2024-06-03T12:03:16.736Z caller=main.go:562 level=info host_details="(Linux 5.15.146+ #1 SMP Fri Mar 8 13:04:16 UTC 2024 x86_64 prometheus-prometheus-kube-prometheus-prometheus-0 (none))"
ts=2024-06-03T12:03:16.736Z caller=main.go:563 level=info fd_limits="(soft=1048576, hard=1048576)"
ts=2024-06-03T12:03:16.736Z caller=main.go:564 level=info vm_limits="(soft=unlimited, hard=unlimited)"
ts=2024-06-03T12:03:16.750Z caller=web.go:559 level=info component=web msg="Start listening for connections" address=0.0.0.0:9090
ts=2024-06-03T12:03:16.751Z caller=main.go:993 level=info msg="Starting TSDB ..."
ts=2024-06-03T12:03:16.819Z caller=repair.go:56 level=info component=tsdb msg="Found healthy block" mint=1704304800026 maxt=1704693600000 ulid=01HR8T9YYPTRAVHPS5XK9BA3WZ
ts=2024-06-03T12:03:16.809Z caller=tls_config.go:232 level=info component=web msg="Listening on" address=[::]:9090
ts=2024-06-03T12:03:16.819Z caller=tls_config.go:271 level=info component=web msg="TLS is disabled." http2=false address=[::]:9090
ts=2024-06-03T12:03:16.819Z caller=repair.go:56 level=info component=tsdb msg="Found healthy block" mint=1704693600012 maxt=1705276800000 ulid=01HR8TR2XBF8KAKW224ZQT3GRQ
ts=2024-06-03T12:03:16.820Z caller=repair.go:56 level=info component=tsdb msg="Found healthy block" mint=1705276800021 maxt=1705860000000 ulid=01HR8VFQG7S5GKM40BAYPX0ZKD
ts=2024-06-03T12:03:16.820Z caller=repair.go:56 level=info component=tsdb msg="Found healthy block" mint=1705860000006 maxt=1706443200000 ulid=01HR8W880D60T9WPMV0G6QKT0V
ts=2024-06-03T12:03:16.831Z caller=repair.go:56 level=info component=tsdb msg="Found healthy block" mint=1706443200061 maxt=1707026400000 ulid=01HR8WZ3XCT6KQSX3FC5EQ9NHT
ts=2024-06-03T12:03:16.852Z caller=repair.go:56 level=info component=tsdb msg="Found healthy block" mint=1707026400001 maxt=1707609600000 ulid=01HR8XNRPWXS6GCKJPD5Q69DG8
ts=2024-06-03T12:03:16.862Z caller=repair.go:56 level=info component=tsdb msg="Found healthy block" mint=1707609600068 maxt=1708192800000 ulid=01HR8YA8VA1HS9RSG1GE4S6R1Q
ts=2024-06-03T12:03:16.877Z caller=repair.go:56 level=info component=tsdb msg="Found healthy block" mint=1708192800007 maxt=1708776000000 ulid=01HR8YZNHYYS4G2GQVJP8DMTKX
ts=2024-06-03T12:03:16.908Z caller=repair.go:56 level=info component=tsdb msg="Found healthy block" mint=1708776000000 maxt=1709359200000 ulid=01HR8ZMRA2K9EJ5XFXWZR9V39W
ts=2024-06-03T12:03:16.917Z caller=repair.go:56 level=info component=tsdb msg="Found healthy block" mint=1709359200037 maxt=1709942400000 ulid=01HRGSAN8YN38HR572Q8Z25GD6
ts=2024-06-03T12:03:16.918Z caller=repair.go:56 level=info component=tsdb msg="Found healthy block" mint=1709942400014 maxt=1710525600000 ulid=01HS25GCC5G5DHMRXSHFX8WD3S
ts=2024-06-03T12:03:16.918Z caller=repair.go:56 level=info component=tsdb msg="Found healthy block" mint=1710525600022 maxt=1711108800000 ulid=01HSKHPMNXT6YPKCY5874W11W5
ts=2024-06-03T12:03:16.930Z caller=repair.go:56 level=info component=tsdb msg="Found healthy block" mint=1711108800096 maxt=1711692000000 ulid=01HT4Y9HPR2SARKA30CH04ZENJ
ts=2024-06-03T12:03:16.960Z caller=repair.go:56 level=info component=tsdb msg="Found healthy block" mint=1711692000106 maxt=1712275200000 ulid=01HTPANCDS3AQZVHCZPJX5A6V1
ts=2024-06-03T12:03:16.973Z caller=repair.go:56 level=info component=tsdb msg="Found healthy block" mint=1712275200012 maxt=1712858400000 ulid=01HV7P8846M00NQY0F1EPCRMVA
ts=2024-06-03T12:03:16.995Z caller=repair.go:56 level=info component=tsdb msg="Found healthy block" mint=1712858400000 maxt=1713441600000 ulid=01HVRWGCSJK0RJTM80AS8TJPHX
ts=2024-06-03T12:03:17.016Z caller=repair.go:56 level=info component=tsdb msg="Found healthy block" mint=1713441600000 maxt=1713636000000 ulid=01HVYNCT9HS6WHP9X9ANXZW3HK
ts=2024-06-03T12:03:17.017Z caller=repair.go:56 level=info component=tsdb msg="Found healthy block" mint=1713636000000 maxt=1713830400000 ulid=01HW4EN29FQ67VWA2W4QC5RZWG
ts=2024-06-03T12:03:17.030Z caller=repair.go:56 level=info component=tsdb msg="Found healthy block" mint=1713830400016 maxt=1713895200000 ulid=01HW6K4NCFMTHE3CV1PS2VT2RW
ts=2024-06-03T12:03:17.031Z caller=repair.go:56 level=info component=tsdb msg="Found healthy block" mint=1713895200000 maxt=1713960000000 ulid=01HW8A4NHGTFSMSBF37G1S4YMH
ts=2024-06-03T12:03:17.048Z caller=repair.go:56 level=info component=tsdb msg="Found healthy block" mint=1713960000000 maxt=1714024800000 ulid=01HWA7XB2W39GQJMFBRDRD64DR
ts=2024-06-03T12:03:17.057Z caller=repair.go:56 level=info component=tsdb msg="Found healthy block" mint=1714024800000 maxt=1714608000000 ulid=01HWVN2VX7MXMXMGNGQ56FCQ9G
ts=2024-06-03T12:03:17.072Z caller=repair.go:56 level=info component=tsdb msg="Found healthy block" mint=1714608000000 maxt=1714802400000 ulid=01HX1DT439AVXYXCJAPAR0KFHT
ts=2024-06-03T12:03:17.089Z caller=repair.go:56 level=info component=tsdb msg="Found healthy block" mint=1714802400000 maxt=1714996800000 ulid=01HX776JFX45FC2XKKXSN0B3P2
ts=2024-06-03T12:03:17.091Z caller=repair.go:56 level=info component=tsdb msg="Found healthy block" mint=1714996800000 maxt=1715061600000 ulid=01HX94M9JDF71DXDQC8HQKXMSP
ts=2024-06-03T12:03:17.106Z caller=repair.go:56 level=info component=tsdb msg="Found healthy block" mint=1715061600000 maxt=1715126400000 ulid=01HXB2B981ZKQQCY7N2PQXGZ0F
ts=2024-06-03T12:03:17.106Z caller=repair.go:56 level=info component=tsdb msg="Found healthy block" mint=1715126400013 maxt=1715148000000 ulid=01HXBXVS0TN1HNF6JE8FXKA0ZE
ts=2024-06-03T12:03:17.107Z caller=repair.go:56 level=info component=tsdb msg="Found healthy block" mint=1715148000000 maxt=1715169600000 ulid=01HXCBJDHDNX40MHKJF54K2VPJ
ts=2024-06-03T12:03:17.108Z caller=repair.go:56 level=info component=tsdb msg="Found healthy block" mint=1715169600000 maxt=1715191200000 ulid=01HXD04DHFDAPZJ0ABRFWH9Z46
ts=2024-06-03T12:03:17.138Z caller=repair.go:56 level=info component=tsdb msg="Found healthy block" mint=1715191200048 maxt=1715256000000 ulid=01HXF4TYWQ9866TAQXM8T3EMEP
ts=2024-06-03T12:03:17.155Z caller=repair.go:56 level=info component=tsdb msg="Found healthy block" mint=1715320800000 maxt=1715328000000 ulid=01HXGW1986XFPRA39CCPRMTV8R
ts=2024-06-03T12:03:17.166Z caller=repair.go:56 level=info component=tsdb msg="Found healthy block" mint=1715256000000 maxt=1715320800000 ulid=01HXGW5GCMVVM1C4YW8M9TD57H
ts=2024-06-03T12:03:17.175Z caller=repair.go:56 level=info component=tsdb msg="Found healthy block" mint=1715328000000 maxt=1715335200000 ulid=01HXH2HVGMQP6TPT0S6QVXJ2MP
ts=2024-06-03T12:03:17.192Z caller=repair.go:56 level=info component=tsdb msg="Found healthy block" mint=1715335200000 maxt=1715342400000 ulid=01HXH9DJRER4D9ERN4EY6WQ2DB
ts=2024-06-03T12:03:17.204Z caller=repair.go:56 level=info component=tsdb msg="Found healthy block" mint=1715342400000 maxt=1715349600000 ulid=01HXHG9A0F34YATJ7YR008XBHF
ts=2024-06-03T12:03:17.219Z caller=repair.go:56 level=info component=tsdb msg="Found healthy block" mint=1715349600000 maxt=1715356800000 ulid=01HXHQ518D5K1WN8QVW6CY9Z8R
ts=2024-06-03T12:03:17.220Z caller=repair.go:56 level=info component=tsdb msg="Found healthy block" mint=1715356800000 maxt=1715364000000 ulid=01HXHY0RGDYCDT34YJN49ZZ1JF
ts=2024-06-03T12:03:17.254Z caller=repair.go:56 level=info component=tsdb msg="Found healthy block" mint=1715364000000 maxt=1715385600000 ulid=01HXV1J2HPCJDZ8PV1DR4380B7
ts=2024-06-03T12:03:17.259Z caller=repair.go:56 level=info component=tsdb msg="Found healthy block" mint=1715385600000 maxt=1715400000000 ulid=01HXVCJHDZJW22J462DJVB9KMH
ts=2024-06-03T12:03:17.274Z caller=repair.go:56 level=info component=tsdb msg="Found healthy block" mint=1715670306103 maxt=1715774400000 ulid=01HXYKCE1SZWT4YB96CSXX5RFV
ts=2024-06-03T12:03:17.293Z caller=repair.go:56 level=info component=tsdb msg="Found healthy block" mint=1715774400000 maxt=1716357600000 ulid=01HYFT5W9N3B859JB79Y6KFC2X
ts=2024-06-03T12:03:17.302Z caller=repair.go:56 level=info component=tsdb msg="Found healthy block" mint=1716357600000 maxt=1716552000000 ulid=01HYNJ8STJ9778ZFBYX8E1M4AN
ts=2024-06-03T12:03:17.317Z caller=repair.go:56 level=info component=tsdb msg="Found healthy block" mint=1716552000000 maxt=1716746400000 ulid=01HYVBCB4DC97RTSD4XEBYVFAK
ts=2024-06-03T12:03:17.334Z caller=repair.go:56 level=info component=tsdb msg="Found healthy block" mint=1716746400000 maxt=1716940800000 ulid=01HZ14YK601SWSRE9F205QA3ZX
ts=2024-06-03T12:03:17.342Z caller=repair.go:56 level=info component=tsdb msg="Found healthy block" mint=1716940800029 maxt=1717135200000 ulid=01HZ751T69EMGVES8ND6A0XM7Z
ts=2024-06-03T12:03:17.343Z caller=repair.go:56 level=info component=tsdb msg="Found healthy block" mint=1717135200000 maxt=1717200000000 ulid=01HZ8VVEE3KFDD9Y5J4E1Y24F1
ts=2024-06-03T12:03:17.344Z caller=repair.go:56 level=info component=tsdb msg="Found healthy block" mint=1717200000005 maxt=1717264800000 ulid=01HZB0GZHABBADPKBZ3S1WDX6V
ts=2024-06-03T12:03:17.345Z caller=repair.go:56 level=info component=tsdb msg="Found healthy block" mint=1717264800004 maxt=1717286400000 ulid=01HZBN4E5GVTTX27TV2YEQMD3M
ts=2024-06-03T12:03:17.346Z caller=repair.go:56 level=info component=tsdb msg="Found healthy block" mint=1717286400000 maxt=1717308000000 ulid=01HZC2ZW07VTCZ8TT75HYPWHHN
ts=2024-06-03T12:03:17.361Z caller=repair.go:56 level=info component=tsdb msg="Found healthy block" mint=1717329600000 maxt=1717336800000 ulid=01HZCQDWG23Z5K8CSYVVHFXFZ4
ts=2024-06-03T12:03:17.362Z caller=repair.go:56 level=info component=tsdb msg="Found healthy block" mint=1717308000000 maxt=1717329600000 ulid=01HZCQQ6NHDPKCHEGPV078AHND
ts=2024-06-03T12:03:17.393Z caller=dir_locker.go:77 level=warn component=tsdb msg="A lockfile from a previous execution already existed. It was replaced" file=/prometheus/lock
ts=2024-06-03T12:06:50.100Z caller=head.go:562 level=info component=tsdb msg="Replaying on-disk memory mappable chunks if any"
ts=2024-06-03T12:06:52.058Z caller=head.go:606 level=info component=tsdb msg="On-disk memory mappable chunks replay completed" duration=1.95749232s
ts=2024-06-03T12:06:52.058Z caller=head.go:612 level=info component=tsdb msg="Replaying WAL, this may take a while"
ts=2024-06-03T12:08:45.660Z caller=main.go:828 level=warn msg="Received SIGTERM, exiting gracefully..."
ts=2024-06-03T12:09:03.765Z caller=main.go:1197 level=info msg="Loading configuration file" filename=/etc/prometheus/config_out/prometheus.env.yaml
ts=2024-06-03T12:09:09.042Z caller=main.go:852 level=info msg="Stopping scrape discovery manager..."
ts=2024-06-03T12:09:09.124Z caller=manager.go:953 level=info component="rule manager" msg="Starting rule manager..."

Answer 1 · 2024-06-03T19:37:21.000Z

In recent chart releases, one can give Prometheus more time to start up by setting prometheus.prometheusSpec.maximumStartupDurationSeconds (introduced in chart release 56.3):

Defines the maximum time that the prometheus container’s startup probe will wait before being considered failed. The startup probe will return success after the WAL replay is complete. If set, the value should be greater than 60 (seconds). Otherwise it will be equal to 600 seconds (15 minutes).

If you wish to modify the statefulset itself directly, you need to pause the prometheus CR first by setting prometheus.prometheusSpec.paused=true:

When a Prometheus deployment is paused, no actions except for deletion will be performed on the underlying objects.

Other than that, with older releases of the chart, you can patch the prometheus container, e.g.

prometheus:
  prometheusSpec:
    containers:
      - name: prometheus
        startupProbe:
          failureThreshold: 60
          periodSeconds: 30

Answer 2 · 2024-06-04T07:46:40.000Z

@zeritti thanks for quick prompt. The above solutions has solved the problem for health check duration. But the prometheus pod is allocating the entire node memory and failing then. Is there anyway I can skip this process of "Replaying WAL, this may take a while". Ofcource, we can move the prometheus pod to new node with higher memory, but is there anyway by which we can atleast start the container once and later do the optimisation.

Answer 3 · 2024-06-09T05:34:53.000Z

You should be able to delete the WAL files. You lose the metrics but they are probably useless by now anyways.