Update to CF v8.1.0: job and index data missing in loggregator metrics
jochenehret opened this issue · 5 comments
Hi Loggregator Team,
we have just evaluated the update from cf-deployment v7.9.0 to v8.1.0 and we've noticed that some loggregator metrics are incomplete. We haven't found the exact root cause, but it's most likely related to the changes introduced in loggregator-release v105.1: https://github.com/cloudfoundry/loggregator-release/releases/tag/v105.1
We are using a firehose nozzle to process all events. In CF v7.9.0 we see for example:
DEBUG 2019-05-02 11:29:40 counter_processor.go:15:Process CounterProcessor.Process: origin:"gorouter" eventType:CounterEvent timestamp:1556796580870895345 deployment:"cf" job:"router" index:"75d4a99a-7d2a-46c7-96b2-5fbf03bc193b" ip:"10.1.1.8" tags:<key:"source_id" value:"gorouter" > counterEvent:<name:"responses.xxx" delta:0 total:8 >
DEBUG 2019-05-02 11:29:42 counter_processor.go:15:Process CounterProcessor.Process: origin:"loggregator.metron" eventType:CounterEvent timestamp:1556796582249425497 deployment:"cf" job:"nats" index:"b93b69d3-6116-4630-b6c7-2c6740ce0151" ip:"10.1.1.3" tags:<key:"direction" value:"egress" > tags:<key:"metric_version" value:"2.0" > tags:<key:"source_id" value:"metron" > counterEvent:<name:"dropped" delta:0 total:60 >
After the update to CF v8.1.0 we see:
DEBUG 2019-05-02 11:17:35 counter_processor.go:15:Process CounterProcessor.Process: origin:"gorouter" eventType:CounterEvent timestamp:1556795854963439469 deployment:"" job:"" index:"" ip:"" tags:<key:"source_id" value:"gorouter" > counterEvent:<name:"backend_exhausted_conns" delta:0 total:0 >
DEBUG 2019-05-02 11:31:29 counter_processor.go:15:Process CounterProcessor.Process: origin:"policy-server" eventType:CounterEvent timestamp:1556796689328565213 deployment:"" job:"" index:"" ip:"" tags:<key:"source_id" value:"policy-server" > counterEvent:<name:"UptimeRequestCount" delta:1 total:1524 >
DEBUG 2019-05-02 11:17:35 counter_processor.go:15:Process CounterProcessor.Process: origin:"loggregator.metron" eventType:CounterEvent timestamp:1556795855406614256 deployment:"cf" job:"cc-worker" index:"26af209c-b94f-4b18-b1b3-e20a72451eb1" ip:"10.0.65.69" tags:<key:"direction" value:"egress" > tags:<key:"metric_version" value:"2.0" > tags:<key:"source_id" value:"metron" > counterEvent:<name:"dropped" delta:0 total:9 >
In the "gorouter" and "policy-server" events the fields "job" and "index" are empty. Other event sources, like "loggregator.metron", are not affected. We have not seen any significant changes in the gorouter components, so we suspect one of the loggregator components.
The events above were observed with a nozzle using the loggregator v1 API. We see the same problem with the v2 API:
DEBUG 2019-05-02 12:35:00 processor.go:14:ProcessEnvelope CounterProcessor.Process: timestamp:1556800499963550599 source_id:"gorouter" tags:<key:"__v1_type" value:"CounterEvent" > tags:<key:"deployment" value:"" > tags:<key:"index" value:"" > tags:<key:"ip" value:"" > tags:<key:"job" value:"" > tags:<key:"origin" value:"gorouter" > counter:<name:"responses.3xx" total:419 >
Can you please check? Thanks for your assistance!
We have created an issue in Pivotal Tracker to manage this:
https://www.pivotaltracker.com/story/show/165764978
The labels on this github issue will be updated when the story is started.
Hi @jochenehret! This was actually a bug in loggregator-agent-release
that was fixed in version 3.11 (https://github.com/cloudfoundry/loggregator-agent-release/releases/tag/v3.11).
However, it's considered a breaking change in cf-deployment due to changing over from expvar forwarder to Prometheus. It won't get included until v9. If you don't use expvar endpoints, you can bump loggregator-agent-release
to 3.11. Otherwise, you'll lose metrics from loggregator agent.
Hi @heycait ,
thanks for your fast response. An update to loggregator-agent v3.11 fixed the missing tags.
However, we now have the problem that the "loggregator.metron" metrics are missing, like this one:
origin:"loggregator.metron" eventType:CounterEvent timestamp:1556873327970778067 deployment:"cf" job:"diego-cell" index:"e0753b0f-d805-45c3-a857-0852df484b82" ip:"10.0.73.0" tags:<key:"metric_version" value:"2.0" > tags:<key:"source_id" value:"metron" > counterEvent:<name:"egress" delta:0 total:1647015 >
We use these to calculate the total sum of messages over all agents. Has this metric been renamed or been removed?
Loggregator agent is the new name for metron. So these are the metrics that we mentioned you'd lose when upgrading to loggregator-agent-release 3.11 on cf-d below v9.
Actually, there's another workaround as the original problem was the loss of tags from UDP forwarder. Instead of upgrading loggregator-agent-release to 3.11, stay on version 3.9. You can re-enable UDP on the loggregator_agent addon and remove the loggr-udp-forwarder
job from api
,router
, and tcp-router
in an ops-file like the one below.
- type: remove
path: /instance_groups/name=api/jobs/name=loggr-udp-forwarder
- type: remove
path: /instance_groups/name=router/jobs/name=loggr-udp-forwarder
- type: remove
path: /instance_groups/name=tcp-router/jobs/name=loggr-udp-forwarder
- type: replace
path: /addons/name=loggregator_agent/jobs/name=loggregator_agent/properties/disable_udp
value: false
Thanks @heycait for the clarification and the proposed fix!