prometheus/alertmanager

Optimise alert ingest path

gouthamve opened this issue · 13 comments

From @brancz :
While benchmarking & profiling Alertmanager it quickly became obvious, that the ingest path blocks a lot

Can you share some numbers about lock wait times?

Need to run the benchmarks again, for exact numbers. I remember that the channel alerts are sent to in the dispatcher had the longest wait times. Increasing the channel buffer improved this a bit, but obviously only pushes out the issue.

A question we need to ask ourselves as well is, what kind of load are we expecting Alertmanager to handle? (nonetheless the limit should be resource bound not a technical limitation)

I think a million active alerts is a reasonable starting point. See also prometheus/prometheus#2585

Hey! I am a graduate student who want to apply GSoC this summer. I previously have some Go experience and I am seeking a Go performance optimization related project.

The problem here is the dispatcher can not send out alert messages to all kinds of client efficiently, or data can not write into dispatcher efficiently?

It's everything before the dispatcher that needs optimisation.

I think a million active alerts is a reasonable starting point.

What sort of request rate are you expecting for these 1 million alerts? In what little testing I've done, AM had no issue maintaining several thousands of active alerts (< 50,000), but had issues processing incoming alert batches when the rate hit ~30 requests/sec (the issue could occur at a lower rate, this was just the rate I noticed at which AM locked up).

With a 10s eval interval it'd be 100k alerts/s which with a batch size of 64 would be ~1.5k requests/s. prometheus/prometheus#2585 can bring that down to ~250/s.

Hey, I tried to use Go pprof to profiling AlertManager, but this seems can only help us to locate some inefficient function implementation. The whole workflow of AlertManager includes: deduplicating, grouping, inhibition, and routing, if we want to locate which stage is the bottleneck in high concurrency case, seems we need manually write code to track and time functions?

pprof005

@brancz Do you have some benchmark code to share?

@starsdeep I had previously built and used this, I remember having taken mutex profiles and seen a lot of lock contention starting at the dispatcher.

@brancz I tried to use the ambench code to launch a benchmark, there are some goroutine block events though, there is no mutex contention events:

jj

PS: I launch AlertManager instances using "goreman start" with "DEBUG=true" environment variable, and run load test with ./ambench -alertmanagers=http://localhost:9093,http://localhost:9094,http://localhost:9095

what does the block profile look like? 🙂

I would look closer at that Dispatcher 90s that's in the ingest path of alerts being added through the API.

I think a million active alerts is a reasonable starting point.

What sort of request rate are you expecting for these 1 million alerts? In what little testing I've done, AM had no issue maintaining several thousands of active alerts (< 50,000), but had issues processing incoming alert batches when the rate hit ~30 requests/sec (the issue could occur at a lower rate, this was just the rate I noticed at which AM locked up).

@stuartnelson3 hi, about "AM had no issue maintaining several thousands of active alerts (< 50,000)", Do you have the corresponding benchmark data?