elastic/fleet-server

[Self-Managed]: Fleet Server permanently goes offline and memory consumption increases on changing logging level to debug.

amolnater-qasource opened this issue · 12 comments

Kibana Build details:

VERSION: 8.14.0 BC1
BUILD: 73520
COMMIT: c1513cd7e5a00eab209ba02d30cafd6945d75470

Artifact Link: https://staging.elastic.co/8.14.0-a40d088a/summary-8.14.0.html

Host OS: All

Preconditions:

  1. 8.14.0-BC1 Kibana self-managed environment should be available.
  2. Fleet Server should be installed.

Steps to reproduce:

  1. Navigate to Fleet>Agents>Agent logs tab.
  2. Update logging level to debug.
  3. Observe fleet-server gets offline permanently and memory consumption increases.

Expected Result:
Fleet Server should remain Healthy on changing logging level to debug.

Logs:
elastic-agent-diagnostics-2024-04-23T04-48-12Z-00.zip

Screenshot:
image

Note:

  • Issue is consistently reproducible at our end.

Secondary review for this ticket is Done

components:
    - id: fleet-server-default
      state:
        component:
            apmconfig: null
            limits:
                gomaxprocs: 0
                source:
                    fields:
                        go_max_procs:
                            kind:
                                numbervalue: 0
        component_idx: 2
        features_idx: 2
        message: 'Healthy: communicating with pid ''6060'''
        state: 2
        units:
            input-fleet-server-default-fleet-server-fleet_server-a4eeee2f-bf68-436c-8c3f-f860be6f8299:
                message: 'Error - could not start the HTTP server for the API: failed to listen on the named pipe \\.\pipe\UwGGXFL1il700DVAc6q-T-1Z9J1UjGMU.sock: open \\.\pipe\UwGGXFL1il700DVAc6q-T-1Z9J1UjGMU.sock: Access is denied.'
                state: 4
            output-fleet-server-default:
                message: 'Error - could not start the HTTP server for the API: failed to listen on the named pipe \\.\pipe\UwGGXFL1il700DVAc6q-T-1Z9J1UjGMU.sock: open \\.\pipe\UwGGXFL1il700DVAc6q-T-1Z9J1UjGMU.sock: Access is denied.'
                state: 4
        version_info:
            build_hash: "11861004"
            meta:
                build_time: 2024-04-18 09:05:58 +0000 UTC
                commit: "11861004"
            name: fleet-server

fleet_message: |+
    fail to checkin to fleet-server: all hosts failed: 1 error occurred:
    	* requester 0/1 to host https://localhost:8221/ errored: Post "https://localhost:8221/api/fleet/agents/f9489d84-c941-40ef-84eb-e07adcf4b37c/checkin?": dial tcp 127.0.0.1:8221: connectex: No connection could be made because the target machine actively refused it.

fleet_state: 4
log_level: debug
message: 1 or more components/units in a failed state
state: 3

I see logs like this frequently repeating:

{"log.level":"info","@timestamp":"2024-04-23T04:47:39.249Z","message":"Error - could not start the HTTP server for the API: failed to listen on the named pipe \\\\.\\pipe\\UwGGXFL1il700DVAc6q-T-1Z9J1UjGMU.sock: open \\\\.\\pipe\\UwGGXFL1il700DVAc6q-T-1Z9J1UjGMU.sock: Access is denied.","component":{"binary":"fleet-server","dataset":"elastic_agent.fleet_server","id":"fleet-server-default","type":"fleet-server"},"log":{"source":"fleet-server-default"},"service.name":"fleet-server","service.type":"fleet-server","state":"FAILED","ecs.version":"1.6.0","ecs.version":"1.6.0"}

This error is from when the fleet-server tries to start the local metrics server, specifically in github.com/elastic/elastic-agent-libs/api; with https://github.com/elastic/elastic-agent-libs/blob/main/api/routes.go#L39

The changes in elastic-agent-api are:

Is this recreateable on any other OS, or is it just on windows?

Hi @michel-laterman

Thank you for looking into this issue.

We have revalidated this issue for linux fleet server on 8.14.0 BC1 kibana cloud environment and had below observations:

Observations:

  • Linux fleet server gets offline for sometime on setting logging level to debug.
  • However it gets back Healthy and memory consumption also doesn't increase like Windows fleet-server.

Logs for Linux fleet-server:
elastic-agent-diagnostics-2024-04-24T06-12-49Z-00 (1).zip

Build details:
VERSION: 8.14.0 BC1
BUILD: 73520
COMMIT: c1513cd7e5a00eab209ba02d30cafd6945d75470

Screenshot:
image

Please let us know if anything else is required from our end.
Thanks!

From what I can see this could have been caused by the policy output reload work we tried to add; The PRs have been reverted in 8.15 and 8.14 as of this morning

Thanks Michel. @amolnater-qasource can we retest when the next BC is available? There should be one built tomorrow April 25.

Hi @amolnater-qasource did you get a chance to retest this one? Thanks!

Hi Team,

We have revalidated this issue on latest 8.14.0 BC4 kibana self-managed environment and found it fixed now:

Observations:

  • Fleet Server remains Healthy on changing logging level to debug.

Logs:
elastic-agent-diagnostics-2024-05-13T06-30-55Z-00.zip

Screenshot:
image

Build details:
VERSION: 8.14.0 BC4
BUILD: 73836
COMMIT: 23ed1207772b3ae958cb05bc4cdbe39b83507707

Hence we are closing and marking this issue as QA:Validated.

Thanks!