[Self-Managed]: Fleet Server permanently goes offline and memory consumption increases on changing logging level to debug.

Question

[Self-Managed]: Fleet Server permanently goes offline and memory consumption increases on changing logging level to debug.

amolnater-qasource opened this issue 7 months ago · 12 comments

amolnater-qasource commented 7 months ago

Kibana Build details:

VERSION: 8.14.0 BC1
BUILD: 73520
COMMIT: c1513cd7e5a00eab209ba02d30cafd6945d75470

Artifact Link: https://staging.elastic.co/8.14.0-a40d088a/summary-8.14.0.html

Host OS: All

Preconditions:

8.14.0-BC1 Kibana self-managed environment should be available.
Fleet Server should be installed.

Steps to reproduce:

Navigate to Fleet>Agents>Agent logs tab.
Update logging level to debug.
Observe fleet-server gets offline permanently and memory consumption increases.

Expected Result:
Fleet Server should remain Healthy on changing logging level to debug.

Logs:
elastic-agent-diagnostics-2024-04-23T04-48-12Z-00.zip

Screenshot:

Note:

Issue is consistently reproducible at our end.

Answer 1 · 2024-04-23T05:17:35.000Z

@manishgupta-qasource Please review.

Answer 2 · 2024-04-23T05:25:48.000Z

Secondary review for this ticket is Done

Answer 3 · 2024-04-23T19:36:22.000Z

components:
    - id: fleet-server-default
      state:
        component:
            apmconfig: null
            limits:
                gomaxprocs: 0
                source:
                    fields:
                        go_max_procs:
                            kind:
                                numbervalue: 0
        component_idx: 2
        features_idx: 2
        message: 'Healthy: communicating with pid ''6060'''
        state: 2
        units:
            input-fleet-server-default-fleet-server-fleet_server-a4eeee2f-bf68-436c-8c3f-f860be6f8299:
                message: 'Error - could not start the HTTP server for the API: failed to listen on the named pipe \\.\pipe\UwGGXFL1il700DVAc6q-T-1Z9J1UjGMU.sock: open \\.\pipe\UwGGXFL1il700DVAc6q-T-1Z9J1UjGMU.sock: Access is denied.'
                state: 4
            output-fleet-server-default:
                message: 'Error - could not start the HTTP server for the API: failed to listen on the named pipe \\.\pipe\UwGGXFL1il700DVAc6q-T-1Z9J1UjGMU.sock: open \\.\pipe\UwGGXFL1il700DVAc6q-T-1Z9J1UjGMU.sock: Access is denied.'
                state: 4
        version_info:
            build_hash: "11861004"
            meta:
                build_time: 2024-04-18 09:05:58 +0000 UTC
                commit: "11861004"
            name: fleet-server

fleet_message: |+
    fail to checkin to fleet-server: all hosts failed: 1 error occurred:
    	* requester 0/1 to host https://localhost:8221/ errored: Post "https://localhost:8221/api/fleet/agents/f9489d84-c941-40ef-84eb-e07adcf4b37c/checkin?": dial tcp 127.0.0.1:8221: connectex: No connection could be made because the target machine actively refused it.

fleet_state: 4
log_level: debug
message: 1 or more components/units in a failed state
state: 3

Answer 4 · 2024-04-23T20:14:46.000Z

I see logs like this frequently repeating:

{"log.level":"info","@timestamp":"2024-04-23T04:47:39.249Z","message":"Error - could not start the HTTP server for the API: failed to listen on the named pipe \\\\.\\pipe\\UwGGXFL1il700DVAc6q-T-1Z9J1UjGMU.sock: open \\\\.\\pipe\\UwGGXFL1il700DVAc6q-T-1Z9J1UjGMU.sock: Access is denied.","component":{"binary":"fleet-server","dataset":"elastic_agent.fleet_server","id":"fleet-server-default","type":"fleet-server"},"log":{"source":"fleet-server-default"},"service.name":"fleet-server","service.type":"fleet-server","state":"FAILED","ecs.version":"1.6.0","ecs.version":"1.6.0"}

Answer 5 · 2024-04-23T22:30:47.000Z

This error is from when the fleet-server tries to start the local metrics server, specifically in github.com/elastic/elastic-agent-libs/api; with https://github.com/elastic/elastic-agent-libs/blob/main/api/routes.go#L39

Answer 6 · 2024-04-23T22:44:41.000Z

The changes in elastic-agent-api are:

elastic/elastic-agent-libs#192
elastic/elastic-agent-libs#193
But these should not have an impact on the method call we use

Answer 7 · 2024-04-23T22:50:03.000Z

Is this recreateable on any other OS, or is it just on windows?

Answer 8 · 2024-04-24T07:31:50.000Z

Hi @michel-laterman

Thank you for looking into this issue.

We have revalidated this issue for linux fleet server on 8.14.0 BC1 kibana cloud environment and had below observations:

Observations:

Linux fleet server gets offline for sometime on setting logging level to debug.
However it gets back Healthy and memory consumption also doesn't increase like Windows fleet-server.

Logs for Linux fleet-server:
elastic-agent-diagnostics-2024-04-24T06-12-49Z-00 (1).zip

Build details:
VERSION: 8.14.0 BC1
BUILD: 73520
COMMIT: c1513cd7e5a00eab209ba02d30cafd6945d75470

Screenshot:

Please let us know if anything else is required from our end.
Thanks!

Answer 9 · 2024-04-24T18:07:32.000Z

From what I can see this could have been caused by the policy output reload work we tried to add; The PRs have been reverted in 8.15 and 8.14 as of this morning

Answer 10 · 2024-04-24T18:12:19.000Z

Thanks Michel. @amolnater-qasource can we retest when the next BC is available? There should be one built tomorrow April 25.

Answer 11 · 2024-05-10T21:21:24.000Z

Hi @amolnater-qasource did you get a chance to retest this one? Thanks!

Answer 12 · 2024-05-13T06:32:47.000Z

Hi Team,

We have revalidated this issue on latest 8.14.0 BC4 kibana self-managed environment and found it fixed now:

Observations:

Fleet Server remains Healthy on changing logging level to debug.

Logs:
elastic-agent-diagnostics-2024-05-13T06-30-55Z-00.zip

Screenshot:

Build details:
VERSION: 8.14.0 BC4
BUILD: 73836
COMMIT: 23ed1207772b3ae958cb05bc4cdbe39b83507707

Hence we are closing and marking this issue as QA:Validated.

Thanks!