cloudfoundry/loggregator

Update to CF v8.1.0: job and index data missing in loggregator metrics

jochenehret opened this issue · 5 comments

Hi Loggregator Team,

we have just evaluated the update from cf-deployment v7.9.0 to v8.1.0 and we've noticed that some loggregator metrics are incomplete. We haven't found the exact root cause, but it's most likely related to the changes introduced in loggregator-release v105.1: https://github.com/cloudfoundry/loggregator-release/releases/tag/v105.1

We are using a firehose nozzle to process all events. In CF v7.9.0 we see for example:

DEBUG 2019-05-02 11:29:40 counter_processor.go:15:Process CounterProcessor.Process: origin:"gorouter" eventType:CounterEvent timestamp:1556796580870895345 deployment:"cf" job:"router" index:"75d4a99a-7d2a-46c7-96b2-5fbf03bc193b" ip:"10.1.1.8" tags:<key:"source_id" value:"gorouter" > counterEvent:<name:"responses.xxx" delta:0 total:8 >
DEBUG 2019-05-02 11:29:42 counter_processor.go:15:Process CounterProcessor.Process: origin:"loggregator.metron" eventType:CounterEvent timestamp:1556796582249425497 deployment:"cf" job:"nats" index:"b93b69d3-6116-4630-b6c7-2c6740ce0151" ip:"10.1.1.3" tags:<key:"direction" value:"egress" > tags:<key:"metric_version" value:"2.0" > tags:<key:"source_id" value:"metron" > counterEvent:<name:"dropped" delta:0 total:60 >

After the update to CF v8.1.0 we see:

DEBUG 2019-05-02 11:17:35 counter_processor.go:15:Process CounterProcessor.Process: origin:"gorouter" eventType:CounterEvent timestamp:1556795854963439469 deployment:"" job:"" index:"" ip:"" tags:<key:"source_id" value:"gorouter" > counterEvent:<name:"backend_exhausted_conns" delta:0 total:0 >
DEBUG 2019-05-02 11:31:29 counter_processor.go:15:Process CounterProcessor.Process: origin:"policy-server" eventType:CounterEvent timestamp:1556796689328565213 deployment:"" job:"" index:"" ip:"" tags:<key:"source_id" value:"policy-server" > counterEvent:<name:"UptimeRequestCount" delta:1 total:1524 >
DEBUG 2019-05-02 11:17:35 counter_processor.go:15:Process CounterProcessor.Process: origin:"loggregator.metron" eventType:CounterEvent timestamp:1556795855406614256 deployment:"cf" job:"cc-worker" index:"26af209c-b94f-4b18-b1b3-e20a72451eb1" ip:"10.0.65.69" tags:<key:"direction" value:"egress" > tags:<key:"metric_version" value:"2.0" > tags:<key:"source_id" value:"metron" > counterEvent:<name:"dropped" delta:0 total:9 >

In the "gorouter" and "policy-server" events the fields "job" and "index" are empty. Other event sources, like "loggregator.metron", are not affected. We have not seen any significant changes in the gorouter components, so we suspect one of the loggregator components.

The events above were observed with a nozzle using the loggregator v1 API. We see the same problem with the v2 API:

DEBUG 2019-05-02 12:35:00 processor.go:14:ProcessEnvelope CounterProcessor.Process: timestamp:1556800499963550599 source_id:"gorouter" tags:<key:"__v1_type" value:"CounterEvent" > tags:<key:"deployment" value:"" > tags:<key:"index" value:"" > tags:<key:"ip" value:"" > tags:<key:"job" value:"" > tags:<key:"origin" value:"gorouter" > counter:<name:"responses.3xx" total:419 >

Can you please check? Thanks for your assistance!

We have created an issue in Pivotal Tracker to manage this:

https://www.pivotaltracker.com/story/show/165764978

The labels on this github issue will be updated when the story is started.

Hi @jochenehret! This was actually a bug in loggregator-agent-release that was fixed in version 3.11 (https://github.com/cloudfoundry/loggregator-agent-release/releases/tag/v3.11).

However, it's considered a breaking change in cf-deployment due to changing over from expvar forwarder to Prometheus. It won't get included until v9. If you don't use expvar endpoints, you can bump loggregator-agent-release to 3.11. Otherwise, you'll lose metrics from loggregator agent.

Hi @heycait ,

thanks for your fast response. An update to loggregator-agent v3.11 fixed the missing tags.

However, we now have the problem that the "loggregator.metron" metrics are missing, like this one:

origin:"loggregator.metron" eventType:CounterEvent timestamp:1556873327970778067 deployment:"cf" job:"diego-cell" index:"e0753b0f-d805-45c3-a857-0852df484b82" ip:"10.0.73.0" tags:<key:"metric_version" value:"2.0" > tags:<key:"source_id" value:"metron" > counterEvent:<name:"egress" delta:0 total:1647015 >

We use these to calculate the total sum of messages over all agents. Has this metric been renamed or been removed?

Loggregator agent is the new name for metron. So these are the metrics that we mentioned you'd lose when upgrading to loggregator-agent-release 3.11 on cf-d below v9.

Actually, there's another workaround as the original problem was the loss of tags from UDP forwarder. Instead of upgrading loggregator-agent-release to 3.11, stay on version 3.9. You can re-enable UDP on the loggregator_agent addon and remove the loggr-udp-forwarder job from api,router, and tcp-router in an ops-file like the one below.

- type: remove
  path: /instance_groups/name=api/jobs/name=loggr-udp-forwarder

- type: remove
  path: /instance_groups/name=router/jobs/name=loggr-udp-forwarder

- type: remove
  path: /instance_groups/name=tcp-router/jobs/name=loggr-udp-forwarder

- type: replace
  path: /addons/name=loggregator_agent/jobs/name=loggregator_agent/properties/disable_udp
  value: false

Thanks @heycait for the clarification and the proposed fix!