graphite-project/carbon

carbon-aggregate 100% CPU

rudybroersma opened this issue · 2 comments

Hi,

We have 2 boxes with optical network taps and 'fastnetmon' running on each node. fastnetmon sends data to graphite (graphite runs on 1 box), and we use carbon-aggregate to create totals. Our aggregation-rules.conf looks like this:

all.hosts.<ip>.incoming.average.pps (60) = sum fastnetmon*.hosts.<ip>.incoming.average.pps
all.hosts.<ip>.outgoing.average.pps (60) = sum fastnetmon*.hosts.<ip>.outgoing.average.pps
all.hosts.<ip>.incoming.average.bps (60) = sum fastnetmon*.hosts.<ip>.incoming.average.bps
all.hosts.<ip>.outgoing.average.bps (60) = sum fastnetmon*.hosts.<ip>.outgoing.average.bps

all.total.incoming.bps (60) = sum fastnetmon*.total.incoming.bps
all.total.outgoing.bps (60) = sum fastnetmon*.total.outgoing.bps
all.total.incoming.pps (60) = sum fastnetmon*.total.incoming.pps
all.total.outgoing.pps (60) = sum fastnetmon*.total.outgoing.pps
all.total.incoming.flows (60) = sum fastnetmon*.total.incoming.flows
all.total.outgoing.flows (60) = sum fastnetmon*.total.outgoing.flows

To give an idea of traffic, we do about 4 to 5 gbit/s traffic in+out. 50k IPs.

Our carbon-aggregate service uses consistently 100% CPU. We also see lines like:

29/01/2020 11:45:54 :: CarbonClientProtocol(127.0.0.1:2004:None) send queue has space available
29/01/2020 11:45:56 :: CarbonClientFactory(127.0.0.1:2004:None) send queue is full (20000 datapoints)

What can I do to lower the load on carbon-aggregate? Can i loadbalance this process on multiple hosts?

Hi @rudybroersma ,

You need to switch to RELAY_METHOD = aggregated-consistent-hashing - then carbon will distribute metric across carbon caches using aggregation rules. See #865 or #32 for details. But please note that it probably has some issues, like #325
Another option - try to use aggregators on https://github.com/grobian/carbon-c-relay or https://github.com/grafana/carbon-relay-ng. They are also single threaded, but maybe faster (because written in C / Golang)

stale commented

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.