AxonIQ/reference-guide

EventProcessorLatencyMonitor behaviour does not match documentation

cula011 opened this issue ยท 6 comments

From https://docs.axoniq.io/reference-guide/v/4.6/axon-framework/monitoring/metrics:

EventProcessorLatencyMonitor - Measures the difference in message timestamps between the last ingested and the last processed event message.

Looking at the code https://github.com/AxonFramework/AxonFramework/blob/axon-4.6.x/metrics-micrometer/src/main/java/org/axonframework/micrometer/EventProcessorLatencyMonitor.java#L121, it's comparing the ingested message timestamp against now:
actualCounter.set(clock.wallTime() - message.getTimestamp().toEpochMilli());

This becomes especially apparent during a replay. In our case, we're replaying an event log, where the first ingested event is over one year old, which spikes the latency metric to a value greater than one year. This seems incorrect, since the events are being processed correctly and promptly, thus reducing our ability to successfully use this metric for alerting purposes during true latency.

My question here is: "What is expected behaviour of EventProcessorLatencyMonitor?"

smcvb commented

Thanks for filing this issue with us, @cula011.

The monitor has undergone some changes throughout its lifecycle.
Initially, it did describe the difference between the ingested event and the previously processed event, as is present in the Reference Guide.
You can still find hints of that in this pull request (from 2020), for example.

In December 2020, this was altered in this commit, which sadly did not (1) adjust the JavaDoc or (2) the Reference Guide.
However, this most recent adjustment is definitely intended.
Although it shows a spike in replays, it provides the metric on "how far behind" your event processor is with event processing.

I made a pull request for Axon Framework to make that adjustment, which you can find here.
I'll also make an adjustment to the Reference Guide to provide a more usable description.

Thanks for the clarification @smcvb and I appreciate the quick turnaround on updating the documentation.

A related observation, which came up while testing the behaviour of the metric was that with multiple nodes in play, we could simulate an environment where one node's metric reports minimal latency, while the other is reporting a more significant latency.

Take the following sequence as an example:

  • Application running with two nodes
  • An external system integrated via a tracking event processor becomes unavailable
  • Node 1, which to this point, had a lock on processing for the given tracking event processor errors and releases the lock
  • Node 2, obtains the lock and attempts to process the event, but fails due to continued unavailability of external system
  • At this point, Node 1 and 2 are in a sort of round robin, each having their latency metric growing due to continued unavailability of the external system
  • External system becomes available again
  • Whichever node is attempting to process events at this point, will do so successfully
  • The node that has obtained the lock and processed events will update its latency metric to reflect good health
  • The other node, which is effectively "locked out" for the time being, will effectively stop processing events, thus not updating its latency metric to reflect good health
  • As a result, the two nodes will report significantly different values against the latency metric.

I understand one way to address would be to only look at the latency metric against the active node. I guess my follow-up question is more about whether this intentional. Keen to hear your feedback.

Thanks

Hi @smcvb, I realise this issue is closed, so may not be getting attention, but I would really appreciate your thoughts on the above comment. Thanks again.

smcvb commented

Hi @cula011, that it's closed is not why I/we didn't reply. I'm just very busy at the moment. ๐Ÿ˜…
I had the issue pending-for-reply (as one of my browser tabs), so I would reach it eventually. :-)


Concerning the follow-up observation you shared: well, that's indeed a possibility.
The MessageMonitor setup is not such as having this occur intentionally.

The interfaces are drafted up to work on a generic level for any "message processor."
Hence, it needs to work for Command Buses and Event Processors alike, even though this on/off switching behavior can lead to odd metrics.

The best you can do, is what you've deduced yourself: check the latency metrics of the node holding an actual claim on a token.


On a slightly related note, I am curious whether you've heard about a recent tool we've been developing at AxonIQ: Inspector Axon.
Inspector Axon is a tool that will inspect your Axon Framework application to provide insights and management tooling.
Management-wise, it, for example, allows you to split tokens, reset event processors, or reprocess dead letters.
On the point of insights, it provides a message flow graph, aggregate metrics, handler metrics, and several more.

The reason we've drafted up this tool is so we have an easier way to relay "what information about your Axon Framework application is important."
For one, it should give you clearer, out-of-the-box latency metrics for event processing.

If you have not heard about it, you can check it out on inspector.axoniq.io


Again, sorry to leave you waiting, @cula011.
We aim to monitor all activity on our repositories.
Sadly enough, sometimes you need to wait a little longer than you expected.
Regardless, I hope the above helps you further!

Hi @smcvb, thanks for your response given the circumstances.

I was not aware of Inspector Axon previously. I've had a brief look at the site and a video from a prior livestream. If I understand correctly, it's built with the assumption that an application would push metrics out to an Axon server remotely, which is unfortunately not something that would be allowed at my current workplace. However, it is interesting to see what kind of information you're observing and try to take away some of those learnings.

smcvb commented

For now, you'd be dealing with a online solution only, that's true.
Note that Inspector Axon does not capture or store PII of your applications at all.
Nonetheless, you're not the only one who'd still be blocked, given the above.
Hence why an on-prem version will happen, of course.

A bit on how it runs:
The communication between an Axon Framework application and the Inspector Axon server, is through RSocket connections.
The communication between the Inspector Axon Servers is through commands, events, and queries, using Axon Server.