carbon-aggregate 100% CPU
rudybroersma opened this issue · 2 comments
Hi,
We have 2 boxes with optical network taps and 'fastnetmon' running on each node. fastnetmon sends data to graphite (graphite runs on 1 box), and we use carbon-aggregate to create totals. Our aggregation-rules.conf looks like this:
all.hosts.<ip>.incoming.average.pps (60) = sum fastnetmon*.hosts.<ip>.incoming.average.pps
all.hosts.<ip>.outgoing.average.pps (60) = sum fastnetmon*.hosts.<ip>.outgoing.average.pps
all.hosts.<ip>.incoming.average.bps (60) = sum fastnetmon*.hosts.<ip>.incoming.average.bps
all.hosts.<ip>.outgoing.average.bps (60) = sum fastnetmon*.hosts.<ip>.outgoing.average.bps
all.total.incoming.bps (60) = sum fastnetmon*.total.incoming.bps
all.total.outgoing.bps (60) = sum fastnetmon*.total.outgoing.bps
all.total.incoming.pps (60) = sum fastnetmon*.total.incoming.pps
all.total.outgoing.pps (60) = sum fastnetmon*.total.outgoing.pps
all.total.incoming.flows (60) = sum fastnetmon*.total.incoming.flows
all.total.outgoing.flows (60) = sum fastnetmon*.total.outgoing.flows
To give an idea of traffic, we do about 4 to 5 gbit/s traffic in+out. 50k IPs.
Our carbon-aggregate service uses consistently 100% CPU. We also see lines like:
29/01/2020 11:45:54 :: CarbonClientProtocol(127.0.0.1:2004:None) send queue has space available
29/01/2020 11:45:56 :: CarbonClientFactory(127.0.0.1:2004:None) send queue is full (20000 datapoints)
What can I do to lower the load on carbon-aggregate? Can i loadbalance this process on multiple hosts?
Hi @rudybroersma ,
You need to switch to RELAY_METHOD = aggregated-consistent-hashing
- then carbon will distribute metric across carbon caches using aggregation rules. See #865 or #32 for details. But please note that it probably has some issues, like #325
Another option - try to use aggregators on https://github.com/grobian/carbon-c-relay or https://github.com/grafana/carbon-relay-ng. They are also single threaded, but maybe faster (because written in C / Golang)
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.