Baseline agent memory usage has increased in ECK integration tests due to agentbeat
cmacknz opened this issue · 6 comments
See elastic/cloud-on-k8s#7790 and the following comments. The agentbeat instance implementing the filestream-monitoring component was being OOMKilled.
An at least 85M jump in memory usage occurs in 8.14.0+8.15.9 but not 8.13.0, causing the ECK fleet tests to fail. ECK uses a 350Mi memory limit, which is lower than the default 700Mi provided in the agent reference configuration for k8s.
8.13.0
kubectl top pod test-agent-system-int-sf6b-agent-6n7vm -n e2e-mercury ✔ 12985 11:20:01
NAME CPU(cores) MEMORY(bytes)
test-agent-system-int-sf6b-agent-6n7vm 83m 265Mi
8.14.0+
kubectl top pod test-agent-system-int-vlpc-agent-28vhc -n e2e-mercury ✔ 12958 09:55:47
NAME CPU(cores) MEMORY(bytes)
test-agent-system-int-vlpc-agent-28vhc 171m 349Mi
The heap profiles from agent diagnostics when the process was being OOMKilled were not revealing, but they may not have been captured at the ideal time.
Pinging @elastic/elastic-agent-control-plane (Team:Elastic-Agent-Control-Plane)
I have been poking around looking for the root cause of this and I haven't found an super obvious root cause yet. It might be more of a "death by 1000 papercuts" situation.
The heap sizes I've looked at are ~10 MB higher each which can partially explain this. I think the largest contributor is the increased number of .init sections given all of the Beat modules are now present in agentbeat.
Here is an 8.14.0 agentbeat in_use heap:
Here is an 8.13.4 metricbeat in_use heap:
Both of these were an instance of the http-metrics-monitoring component.
We can look at reducing the number of func init()
by switching to func InitializeModule
only in the case that the subcommand of agentbeat
is actually being ran.
Another option is to see if we can reduce the heap usage of each of the func init()
and make the improvement across the board. Or do both.
Looking at the worst offender in github.com/goccy/go-json
at 9.4mb and 4.6mb of heap, it is only used in the Filebeat cel input as a dependency of a dependency. This previously only affected Filebeat processes but now it affects every agent process because it is always imported into agentbeat.
❯ go mod why github.com/goccy/go-json
# github.com/goccy/go-json
github.com/elastic/beats/v7/x-pack/filebeat/input/cel
github.com/lestrrat-go/jwx/v2/jwt
github.com/lestrrat-go/jwx/v2/internal/json
github.com/goccy/go-json
There isn't a strictly easy fix for this. We'd have to improve it upstream, or move that input to a different JWT library. For example https://github.com/golang-jwt/jwt has no dependencies but I have no idea if it covers all the necessary use cases.
Interestingly goccy/go-json is supposed to be optional: https://github.com/lestrrat-go/jwx/blob/develop/v2/docs/20-global-settings.md#switching-to-a-faster-json-library
I don't see us explicitly opting in to that, hmm.
Ah, go mod why
is only telling me the module with the requirement for the newest version. Looking at go mod graph
it shows me we have more things depending goccy which explains why it is compiled in.
If I just delete the httpjson and cel inputs from the tree for example I then get:
go mod why github.com/goccy/go-json
# github.com/goccy/go-json
github.com/elastic/beats/v7/x-pack/metricbeat/module/gcp/billing
cloud.google.com/go/bigquery
github.com/apache/arrow/go/v12/arrow/array
github.com/goccy/go-json
Anyway, that is the worst offender but it has always been there.