[Self-Managed]: Fleet Server permanently goes offline and memory consumption increases on changing logging level to debug.
amolnater-qasource opened this issue · 12 comments
Kibana Build details:
VERSION: 8.14.0 BC1
BUILD: 73520
COMMIT: c1513cd7e5a00eab209ba02d30cafd6945d75470
Artifact Link: https://staging.elastic.co/8.14.0-a40d088a/summary-8.14.0.html
Host OS: All
Preconditions:
- 8.14.0-BC1 Kibana self-managed environment should be available.
- Fleet Server should be installed.
Steps to reproduce:
- Navigate to Fleet>Agents>Agent logs tab.
- Update logging level to debug.
- Observe fleet-server gets offline permanently and memory consumption increases.
Expected Result:
Fleet Server should remain Healthy on changing logging level to debug.
Logs:
elastic-agent-diagnostics-2024-04-23T04-48-12Z-00.zip
Note:
- Issue is consistently reproducible at our end.
@manishgupta-qasource Please review.
Secondary review for this ticket is Done
components:
- id: fleet-server-default
state:
component:
apmconfig: null
limits:
gomaxprocs: 0
source:
fields:
go_max_procs:
kind:
numbervalue: 0
component_idx: 2
features_idx: 2
message: 'Healthy: communicating with pid ''6060'''
state: 2
units:
input-fleet-server-default-fleet-server-fleet_server-a4eeee2f-bf68-436c-8c3f-f860be6f8299:
message: 'Error - could not start the HTTP server for the API: failed to listen on the named pipe \\.\pipe\UwGGXFL1il700DVAc6q-T-1Z9J1UjGMU.sock: open \\.\pipe\UwGGXFL1il700DVAc6q-T-1Z9J1UjGMU.sock: Access is denied.'
state: 4
output-fleet-server-default:
message: 'Error - could not start the HTTP server for the API: failed to listen on the named pipe \\.\pipe\UwGGXFL1il700DVAc6q-T-1Z9J1UjGMU.sock: open \\.\pipe\UwGGXFL1il700DVAc6q-T-1Z9J1UjGMU.sock: Access is denied.'
state: 4
version_info:
build_hash: "11861004"
meta:
build_time: 2024-04-18 09:05:58 +0000 UTC
commit: "11861004"
name: fleet-server
fleet_message: |+
fail to checkin to fleet-server: all hosts failed: 1 error occurred:
* requester 0/1 to host https://localhost:8221/ errored: Post "https://localhost:8221/api/fleet/agents/f9489d84-c941-40ef-84eb-e07adcf4b37c/checkin?": dial tcp 127.0.0.1:8221: connectex: No connection could be made because the target machine actively refused it.
fleet_state: 4
log_level: debug
message: 1 or more components/units in a failed state
state: 3
I see logs like this frequently repeating:
{"log.level":"info","@timestamp":"2024-04-23T04:47:39.249Z","message":"Error - could not start the HTTP server for the API: failed to listen on the named pipe \\\\.\\pipe\\UwGGXFL1il700DVAc6q-T-1Z9J1UjGMU.sock: open \\\\.\\pipe\\UwGGXFL1il700DVAc6q-T-1Z9J1UjGMU.sock: Access is denied.","component":{"binary":"fleet-server","dataset":"elastic_agent.fleet_server","id":"fleet-server-default","type":"fleet-server"},"log":{"source":"fleet-server-default"},"service.name":"fleet-server","service.type":"fleet-server","state":"FAILED","ecs.version":"1.6.0","ecs.version":"1.6.0"}
This error is from when the fleet-server tries to start the local metrics server, specifically in github.com/elastic/elastic-agent-libs/api
; with https://github.com/elastic/elastic-agent-libs/blob/main/api/routes.go#L39
The changes in elastic-agent-api are:
- elastic/elastic-agent-libs#192
- elastic/elastic-agent-libs#193
But these should not have an impact on the method call we use
Is this recreateable on any other OS, or is it just on windows?
Thank you for looking into this issue.
We have revalidated this issue for linux fleet server on 8.14.0 BC1 kibana cloud environment and had below observations:
Observations:
- Linux fleet server gets offline for sometime on setting logging level to debug.
- However it gets back Healthy and memory consumption also doesn't increase like Windows fleet-server.
Logs for Linux fleet-server:
elastic-agent-diagnostics-2024-04-24T06-12-49Z-00 (1).zip
Build details:
VERSION: 8.14.0 BC1
BUILD: 73520
COMMIT: c1513cd7e5a00eab209ba02d30cafd6945d75470
Please let us know if anything else is required from our end.
Thanks!
From what I can see this could have been caused by the policy output reload work we tried to add; The PRs have been reverted in 8.15 and 8.14 as of this morning
Thanks Michel. @amolnater-qasource can we retest when the next BC is available? There should be one built tomorrow April 25.
Hi @amolnater-qasource did you get a chance to retest this one? Thanks!
Hi Team,
We have revalidated this issue on latest 8.14.0 BC4 kibana self-managed environment and found it fixed now:
Observations:
- Fleet Server remains Healthy on changing logging level to debug.
Logs:
elastic-agent-diagnostics-2024-05-13T06-30-55Z-00.zip
Build details:
VERSION: 8.14.0 BC4
BUILD: 73836
COMMIT: 23ed1207772b3ae958cb05bc4cdbe39b83507707
Hence we are closing and marking this issue as QA:Validated.
Thanks!