Feature Request: Ability to trace message flow between devices
Closed this issue · 0 comments
Hi Tango developers
Overview
During development at SKAO we found that the ability to observe the interaction between devices would prove invaluable to debug problems as well as identifying bottlenecks.
The benefits of this type of distributed tracing has been well proven and we believe that Tango would greatly benefit from having some kind of mechanism to facilitate this kind of tracing.
For more information on distributed tracing and its benefits, please see Opentracing
Proof of concept
As a proof of concept we integrated with APM and by working around some limitations (discussed below) we were able to get limited functionality going. Below are some screenshots with descriptions.
As we can see below we have several devices that implement the ConfigureScan
command and this command cascades down these devices when it is executed on the SubarrayNode
. Here we can easily identify any missing devices or devices that take too long.
Each of these transactions
(commands) can be expanded to display the arguments sent to that command on that particular device. Note the labels.function_arguments
Limitations encountered
-
Knowing who called who
In the POC, (as can be seen on the screenshot above) we used JSON in a String type argument to pass along that identifier
parent_id
. If this key was not present then it would be assumed that this command was the first and an ID generated and passed on to the next command. Obviously limiting to a String argument is too restrictive in the real world.In the web services world a request header with a UUID would be added at the ingress service and this header would then flow throughout the services as that particular request was fulfilled. This enables the tracing of the request throughout the system.
-
Knowing how a long a command took
In the POC we added a Python decorator to the command that starts and ends a transaction before and after the command is executed respectively.
Requirements
-
At a minimum we need some kind of identifier that would connect the calling device (parent) and called device(child). From this information we'll be able to build the call graph.
-
We also require some mechanism to register when a command starts and ends. This gives us the command durations.
Finally, opentracing
If tracing is deemed to be beneficial it may be worth exploring the opentracing specification to see how/if its implementation can be facilitated. This would allow us the ability to plug into existing tracing tooling that is available today ( zipkin, jaeger, lightstep )