After an mqtt failure, the agent no longer runs pktvisor and enters an error loop
manrodrigues opened this issue · 0 comments
manrodrigues commented
I had an agent running for a while (I was testing netprobe) and, at a certain point, the agent had an mqtt failure and after that, it ended pktvisor, but it didn't put pktvisor to run anymore. For this reason, the agent entered an eternal loop of errors and the data stopped being scraped.
{"level":"info","ts":"2022-12-17T12:49:59.819Z","caller":"pktvisor/pktvisor.go:202","msg":"pktvisor stdout","log":"[2022-12-17 12:49:59.819] [visor] [info] REQUEST: GET /api/v1/metrics/app 200"}
{"level":"error","ts":"2022-12-17T12:50:08.809Z","caller":"agent/logging.go:49","msg":"ERROR mqtt log","payload":["[client] ","Connect comms goroutine - error triggered",{"Op":"read","Net":"tcp","Source":{"IP":"192.168.1.177","Port":42050,"Zone":""},"Addr":{"IP":"34.208.164.162","Port":8883,"Zone":""},"Err":{"Syscall":"read","Err":104}}]}
{"level":"error","ts":"2022-12-17T12:50:08.809Z","caller":"agent/comms.go:28","msg":"connection to mqtt lost","error":"read tcp 192.168.1.177:42050->34.208.164.162:8883: read: connection reset by peer"}
{"level":"info","ts":"2022-12-17T12:50:08.809Z","caller":"agent/comms.go:29","msg":"reconnecting...."}
{"level":"error","ts":"2022-12-17T12:50:08.809Z","caller":"agent/heartbeats.go:138","msg":"error sending heartbeat","error":"connection lost before Publish completed"}
{"level":"info","ts":"2022-12-17T12:50:08.809Z","caller":"cloud_config/cloud_config.go:173","msg":"using explicitly specified cloud configuration","address":"tls://agents.stg.orb.live:8883","id":"3398195e-abb1-49ea-8912-94fa6930a742"}
{"level":"error","ts":"2022-12-17T12:50:10.056Z","caller":"agent/logging.go:49","msg":"ERROR mqtt log","payload":["[client] ","Connect comms goroutine - error triggered",{}]}
{"level":"error","ts":"2022-12-17T12:50:10.056Z","caller":"agent/comms.go:28","msg":"connection to mqtt lost","error":"EOF"}
{"level":"info","ts":"2022-12-17T12:50:10.056Z","caller":"agent/comms.go:29","msg":"reconnecting...."}
{"level":"info","ts":"2022-12-17T12:50:10.056Z","caller":"pktvisor/pktvisor.go:202","msg":"pktvisor stdout","log":"[2022-12-17 12:50:10.056] [visor] [info] REQUEST: GET /api/v1/metrics/app 200"}
{"level":"error","ts":"2022-12-17T12:50:10.056Z","caller":"agent/comms.go:86","msg":"failed to subscribe to agent control plane RPC topic","topic":"channels/bd370678-c9c2-428d-ba8b-168e8c31a1a6/messages/fromcore","error":"not Connected"}
{"level":"error","ts":"2022-12-17T12:50:10.056Z","caller":"agent/comms.go:87","msg":"critical failure: unable to subscribe to control plane"}
{"level":"info","ts":"2022-12-17T12:50:10.056Z","caller":"agent/agent_prof.go:205","msg":"routine call for stop agent","routine":"comms"}
{"level":"info","ts":"2022-12-17T12:50:10.056Z","caller":"pktvisor/pktvisor.go:202","msg":"pktvisor stdout","log":"[2022-12-17 12:50:10.056] [visor] [info] REQUEST: GET /api/v1/metrics/app 200"}
{"level":"info","ts":"2022-12-17T12:50:10.057Z","caller":"pktvisor/pktvisor.go:202","msg":"pktvisor stdout","log":"[2022-12-17 12:50:10.056] [visor] [info] REQUEST: GET /api/v1/metrics/app 200"}
{"level":"info","ts":"2022-12-17T12:50:10.097Z","caller":"pktvisor/pktvisor.go:268","msg":"routine call to stop pktvisor","routine":"comms"}
{"level":"info","ts":"2022-12-17T12:50:10.097Z","caller":"pktvisor/pktvisor.go:202","msg":"pktvisor stdout","log":"[2022-12-17 12:50:10.097] [visor] [info] Shutting down"}
{"level":"info","ts":"2022-12-17T12:50:10.097Z","caller":"pktvisor/pktvisor.go:202","msg":"pktvisor stdout","log":"[2022-12-17 12:50:10.097] [visor] [info] policy [policy_probe_without_histog]: stopping"}
{"level":"info","ts":"2022-12-17T12:50:10.097Z","caller":"pktvisor/pktvisor.go:202","msg":"pktvisor stdout","log":"[2022-12-17 12:50:10.097] [visor] [info] policy [policy_probe_without_histog]: input instance default_netprobe-59a9df65523e4634 not stopped because it is in use by another policy."}
{"level":"info","ts":"2022-12-17T12:50:10.097Z","caller":"pktvisor/pktvisor.go:202","msg":"pktvisor stdout","log":"[2022-12-17 12:50:10.097] [visor] [info] policy [policy_probe_5packet_2targets]: stopping"}
{"level":"info","ts":"2022-12-17T12:50:10.097Z","caller":"pktvisor/pktvisor.go:202","msg":"pktvisor stdout","log":"[2022-12-17 12:50:10.097] [visor] [info] policy [policy_probe_5packet_2targets]: input instance default_netprobe-ac5e85ca40cba6b1 not stopped because it is in use by another policy."}
{"level":"info","ts":"2022-12-17T12:50:10.097Z","caller":"pktvisor/pktvisor.go:202","msg":"pktvisor stdout","log":"[2022-12-17 12:50:10.097] [visor] [info] policy [default_netprobe-ac5e85ca40cba6b1-resources]: stopping"}
{"level":"info","ts":"2022-12-17T12:50:10.100Z","caller":"pktvisor/pktvisor.go:202","msg":"pktvisor stdout","log":"[2022-12-17 12:50:10.100] [visor] [info] policy [policy_probe_50packet_2targets_copy]: stopping"}
{"level":"info","ts":"2022-12-17T12:50:10.100Z","caller":"pktvisor/pktvisor.go:202","msg":"pktvisor stdout","log":"[2022-12-17 12:50:10.100] [visor] [info] policy [policy_probe_50packet_2targets_copy]: input instance default_netprobe-f753ca6c3b6befb9 not stopped because it is in use by another policy."}
{"level":"info","ts":"2022-12-17T12:50:10.100Z","caller":"pktvisor/pktvisor.go:202","msg":"pktvisor stdout","log":"[2022-12-17 12:50:10.100] [visor] [info] policy [policy_probe]: stopping"}
{"level":"info","ts":"2022-12-17T12:50:10.100Z","caller":"pktvisor/pktvisor.go:202","msg":"pktvisor stdout","log":"[2022-12-17 12:50:10.100] [visor] [info] policy [policy_probe]: input instance default_netprobe-59a9df65523e4634 not stopped because it is in use by another policy."}
{"level":"info","ts":"2022-12-17T12:50:10.100Z","caller":"pktvisor/pktvisor.go:202","msg":"pktvisor stdout","log":"[2022-12-17 12:50:10.100] [visor] [info] policy [default_netprobe-59a9df65523e4634-resources]: stopping"}
{"level":"info","ts":"2022-12-17T12:50:10.100Z","caller":"pktvisor/pktvisor.go:202","msg":"pktvisor stdout","log":"[2022-12-17 12:50:10.100] [visor] [info] policy [default_netprobe-f753ca6c3b6befb9-resources]: stopping"}
{"level":"warn","ts":"2022-12-17T12:50:10.306Z","caller":"internal/transaction.go:111","msg":"Failed to scrape Prometheus endpoint","scrape_timestamp":1671281410306,"target_labels":"{__name__=\"up\", instance=\"localhost:10853\", job=\"pktvisor_prometheus/policy_probe\"}"}
{"level":"info","ts":"2022-12-17T12:50:10.306Z","caller":"otlpmqttexporter/otlp.go:223","msg":"scraped metrics for policy","policy":"policy_probe","policy_id":"43817bc4-5d65-442f-8a6b-70ab74a195b7"}
{"level":"info","ts":"2022-12-17T12:50:10.307Z","caller":"otlpmqttexporter/otlp.go:230","msg":"request metrics count per policyID","policyID":"43817bc4-5d65-442f-8a6b-70ab74a195b7","metric_count":12}
{"level":"error","ts":"2022-12-17T12:50:10.334Z","caller":"otlpmqttexporter/otlp.go:247","msg":"error sending metrics RPC","topic":"channels/bd370678-c9c2-428d-ba8b-168e8c31a1a6/messages/otlp/pktvisor/m/3","error":"not Connected"}
{"level":"info","ts":"2022-12-17T12:50:10.334Z","caller":"exporterhelper/queued_retry.go:427","msg":"Exporting failed. Will retry the request after interval.","error":"not Connected","interval":"4.091047216s"}
{"level":"info","ts":"2022-12-17T12:50:10.384Z","caller":"pktvisor/pktvisor.go:202","msg":"pktvisor stdout","log":"[2022-12-17 12:50:10.384] [visor] [info] REQUEST: GET /api/v1/metrics/app 200"}
{"level":"info","ts":"2022-12-17T12:50:10.385Z","caller":"pktvisor/pktvisor.go:202","msg":"pktvisor stdout","log":"[2022-12-17 12:50:10.385] [visor] [info] REQUEST: GET /api/v1/taps 200"}
{"level":"info","ts":"2022-12-17T12:50:10.385Z","caller":"agent/rpc_to.go:51","msg":"sending capabilities","value":"{\"schema_version\":\"1.0\",\"orb_agent\":{\"version\":\"0.22.0-develop-f407fb3\"},\"agent_tags\":null,\"backends\":{\"pktvisor\":{\"version\":\"4.3.0-develop-891adf5\",\"data\":{\"taps\":{\"default_netprobe\":{\"config\":{\"test_type\":\"ping\"},\"input_type\":\"netprobe\",\"interface\":\"visor.module.input/1.0\",\"tags\":null}}}}}}"}
{"level":"info","ts":"2022-12-17T12:50:10.385Z","caller":"pktvisor/pktvisor.go:202","msg":"pktvisor stdout","log":"[2022-12-17 12:50:10.385] [visor] [info] exit with success"}
{"level":"info","ts":"2022-12-17T12:50:10.841Z","caller":"agent/rpc_to.go:93","msg":"execution period of the re-request of retryGroupMembership","waiting period":0.000001966}
{"level":"info","ts":"2022-12-17T12:50:11.065Z","caller":"agent/comms.go:123","msg":"completed RPC unsubscription to group","group_id":"0c6e438b-c8be-4c24-886e-f3ff3ef5df1c","group_name":"group-probe","topic":"channels/a8f227c9-d67b-4d72-a799-98e3434afcb6/messages/fromcore"}
{"level":"error","ts":"2022-12-17T12:50:11.066Z","caller":"agent/logging.go:49","msg":"ERROR mqtt log","payload":["[client] ","Connect comms goroutine - error triggered",{}]}
{"level":"error","ts":"2022-12-17T12:50:11.066Z","caller":"agent/comms.go:28","msg":"connection to mqtt lost","error":"EOF"}
{"level":"error","ts":"2022-12-17T12:50:11.066Z","caller":"agent/comms.go:172","msg":"failed to subscribe to group channel/topic","group_id":"0c6e438b-c8be-4c24-886e-f3ff3ef5df1c","group_name":"group-probe","topic":"channels/a8f227c9-d67b-4d72-a799-98e3434afcb6/messages/fromcore","error":"connection lost before Subscribe completed"}
{"level":"info","ts":"2022-12-17T12:50:11.066Z","caller":"agent/comms.go:29","msg":"reconnecting...."}
{"level":"info","ts":"2022-12-17T12:50:11.066Z","caller":"agent/rpc_to.go:153","msg":"execution period of the re-request of retryAgentPolicy","period":0.000003161}
{"level":"error","ts":"2022-12-17T12:50:11.066Z","caller":"agent/rpc_from.go:24","msg":"failed to send agent policies request","error":"not Connected"}
{"level":"error","ts":"2022-12-17T12:50:11.076Z","caller":"pktvisor/utils.go:43","msg":"received error from payload","error":"Get \"http://localhost:10853/api/v1/metrics/app\": dial tcp 127.0.0.1:10853: connect: connection refused"}
{"level":"info","ts":"2022-12-17T12:50:11.076Z","caller":"agent/comms.go:48","msg":"waiting until a backend is in running state","backend":"pktvisor","current state":"process running, REST API unavailable","wait time":"2s"}
{"level":"error","ts":"2022-12-17T12:50:12.461Z","caller":"agent/logging.go:49","msg":"ERROR mqtt log","payload":["[client] ","Connect comms goroutine - error triggered",{}]}
{"level":"error","ts":"2022-12-17T12:50:12.461Z","caller":"agent/comms.go:28","msg":"connection to mqtt lost","error":"EOF"}
{"level":"info","ts":"2022-12-17T12:50:12.461Z","caller":"agent/comms.go:29","msg":"reconnecting...."}
Expected behavior:
Whenever the agent has an error and ends pktvisor, it must restart pktvisor and check its full operation before doing anything else. If it is not possible to restore pktvisor to work, the agent must terminate.
Full logs:
https://app.zenhub.com/files/340174841/b476970a-9354-4ce4-8133-5d3d99e38c31/download