elastic/elastic-agent

Baseline agent memory usage has increased in ECK integration tests due to agentbeat

cmacknz opened this issue · 6 comments

See elastic/cloud-on-k8s#7790 and the following comments. The agentbeat instance implementing the filestream-monitoring component was being OOMKilled.

An at least 85M jump in memory usage occurs in 8.14.0+8.15.9 but not 8.13.0, causing the ECK fleet tests to fail. ECK uses a 350Mi memory limit, which is lower than the default 700Mi provided in the agent reference configuration for k8s.

8.13.0

kubectl top pod test-agent-system-int-sf6b-agent-6n7vm -n e2e-mercury                                               ✔  12985  11:20:01
NAME                                     CPU(cores)   MEMORY(bytes)
test-agent-system-int-sf6b-agent-6n7vm   83m          265Mi

8.14.0+

kubectl top pod test-agent-system-int-vlpc-agent-28vhc -n e2e-mercury                                                                          ✔  12958  09:55:47
NAME                                     CPU(cores)   MEMORY(bytes)
test-agent-system-int-vlpc-agent-28vhc   171m         349Mi

The heap profiles from agent diagnostics when the process was being OOMKilled were not revealing, but they may not have been captured at the ideal time.

Screenshot 2024-05-09 at 11 37 58 AM

Pinging @elastic/elastic-agent-control-plane (Team:Elastic-Agent-Control-Plane)

I have been poking around looking for the root cause of this and I haven't found an super obvious root cause yet. It might be more of a "death by 1000 papercuts" situation.

The heap sizes I've looked at are ~10 MB higher each which can partially explain this. I think the largest contributor is the increased number of .init sections given all of the Beat modules are now present in agentbeat.

Here is an 8.14.0 agentbeat in_use heap:
Screenshot 2024-05-10 at 4 54 31 PM

Here is an 8.13.4 metricbeat in_use heap:
Screenshot 2024-05-10 at 4 55 13 PM

Both of these were an instance of the http-metrics-monitoring component.

We can look at reducing the number of func init() by switching to func InitializeModule only in the case that the subcommand of agentbeat is actually being ran.

Another option is to see if we can reduce the heap usage of each of the func init() and make the improvement across the board. Or do both.

Looking at the worst offender in github.com/goccy/go-json at 9.4mb and 4.6mb of heap, it is only used in the Filebeat cel input as a dependency of a dependency. This previously only affected Filebeat processes but now it affects every agent process because it is always imported into agentbeat.

❯ go mod why github.com/goccy/go-json
# github.com/goccy/go-json
github.com/elastic/beats/v7/x-pack/filebeat/input/cel
github.com/lestrrat-go/jwx/v2/jwt
github.com/lestrrat-go/jwx/v2/internal/json
github.com/goccy/go-json

There isn't a strictly easy fix for this. We'd have to improve it upstream, or move that input to a different JWT library. For example https://github.com/golang-jwt/jwt has no dependencies but I have no idea if it covers all the necessary use cases.

Interestingly goccy/go-json is supposed to be optional: https://github.com/lestrrat-go/jwx/blob/develop/v2/docs/20-global-settings.md#switching-to-a-faster-json-library

I don't see us explicitly opting in to that, hmm.

Ah, go mod why is only telling me the module with the requirement for the newest version. Looking at go mod graph it shows me we have more things depending goccy which explains why it is compiled in.

If I just delete the httpjson and cel inputs from the tree for example I then get:

go mod why github.com/goccy/go-json
# github.com/goccy/go-json
github.com/elastic/beats/v7/x-pack/metricbeat/module/gcp/billing
cloud.google.com/go/bigquery
github.com/apache/arrow/go/v12/arrow/array
github.com/goccy/go-json

Anyway, that is the worst offender but it has always been there.