Plugin: System Latency Heat Map

Question

Plugin: System Latency Heat Map

Opened this issue 3 months ago · 13 comments

🔖 Summary

Utilizing the Backstage graph plugin interface we are proposing a system latency heat map. Open telemetry traces would be used to track message pub sub and service invocation and their latency. This monitoring is very important in distributed architectures as detecting contention in distributed systems is time consuming going through Grafana and Prometheus dashboards. The C4 https://c4model.com/ standard fits perfectly into a heat map overlay that allows engineers to see the progressive nature of latency. See attached diagram.

🌐 Project website (if applicable)

No response

✌️ Context

The plugin should take advantage of the existing plugins as dependencies such as the graph plugins that ship with Backstage. The system latency heat map is an overlay on top of the graph showing a gradient latency map. Other plugins such as ones that use and expose Open Telemetry metrics are ideal to obtain timing on endpoint to endpoint synchronous (rest calls) or asynchronous messaging (pub/sub) between components, dependencies, resources and APIs. In addition to phase one as stated, phase 2 will add RAG AI predicted latency using a backpropagation model with known outputs of tolerable latencies. This could be annotated in the component, API, resource catalog-info.yaml files. More to add as we can discuss as a team!
Please come and join everyone's input is invited and let's have fun!

👀 Have you spent some time to check if this plugin request has been raised before?

I checked and didn't find similar issue

✍️ Are you willing to maintain the plugin?

I understand the responsibilities as a Plugin Maintainer Governance & will maintain the plugin

🏢 Have you read the Code of Conduct?

I have read the Code of Conduct

Are you willing to submit PR?

No, but I'm happy to collaborate on a PR with someone else

Answer 1 · 2024-10-03T01:57:43.000Z

I am thinking to start out simple and get the Open Telemetry data first and create an alerts table to be added to a frontend card.

Answer 2 · 2024-10-03T01:58:26.000Z

Any thoughts from the community would be great. I can solo but prefer collaboration.

Answer 3 · 2024-10-03T09:21:34.000Z

really hope this can take off, when i first looked at the catalog graph i was wondering if we could do something like you presented.

Answer 4 · 2024-10-05T18:12:05.000Z

I'd love to see this take off!

In my development of backstage, i've enjoyed viewing it as a system that reads and presents data really well. Most of the config elements are chosen by catalog annotations by component owners. Like DevOps Dashboards, Github Actions and the New Relic Dashboard.

What do you think about using catalog annotations to decide what service that component may use to report service status?

For example we could add a processor that looks for the following annotation:

newrelic.com/APP_ID: 1129082

Which is then used by something that handles that to call the New Relic App Reporting API.

Answer 5 · 2024-10-05T23:47:10.000Z

Hi @Phiph you must be a psychic LOL. Yes, we are planning to do it that way exactly. Like an annotation latencyheatmap/processingthreshold: 300ms as in the total processing time acceptable threshold of one thread on a service(component). Also, it will report the consuming/providing response times between services for both pub/sub and rest invocations. You could set SLA, SLI and SLO acceptable thresholds in the annotation or in appconfig.

Often services in a distributed architecture are cumulative in latency so it will be important to see it from a graph perspective.

So, we are starting this weekend setting up the repo in community-plugins and will start in on just setting up a simple frontend screen with an alerts card. Then start looking at how we can get the Open Telemetry data.

Thanks for your thought more welcome and we need help!

Answer 6 · 2024-10-05T23:56:28.000Z

i would also like to add a small suggestion on the UI aspect of this, instead of doing a green/red/yellow circle to indicate the status of the connected entity, we can just make it more minimal and change the color of the line + entity box to indicate the status of the entity based on telemetry.

Answer 7 · 2024-10-06T09:36:40.000Z

ha thanks @tomfanara. I don't primarily use backstage for distributed microservice architectures, however there are some components that call on other components which we map out using the catalog's graph [DependsOn] or [consumesAPI] so being able to present a RAG status would be a ideal for any user looking at the catalog.

I'm not too sure how I feel about letting the component owners handle the processingthreshold. It could mean that it gives the teams their own choice on what Green is, but from an organisation perspective you may want to set the bar. Maybe there should be sensible defaults that are config driven?

I also use soundcheck in my instance of backstage, so having an API that can just give me the information so I can use it for certification would also be of interest.

Answer 8 · 2024-10-06T13:06:35.000Z

@Phiph, I have never used Soundcheck. Just checked it out but this is what we (my company) also need to adopt. We will eventually use scorecards to see how our templates (scaffolding) are used and govern standards.

Yes, I like your idea on a config that has maybe a default global tolerance setting like conservative, aggressive or moderate.

Also, I like @nia-potato comments above and I incorporated a rough look and feel in the diagram above! All good!

Answer 9 · 2024-10-06T13:11:30.000Z

The following link shows the various applications of RAG. https://bizbrolly.com/practical-uses-of-retrieval-augmented-generation-rag-in-ai/ . We would be a candidate for data analysis and reporting as a way of predicting latency! However, the first phase of latencyheatmap is to do query averaging on traces just to see real-time issues.

Also, I think you could ask your service catalog what systems are in latency trouble and have it report back using the LLMs for context, then using a SLM (less parameters) for searches of predicted analytics thus augmenting data for the retrieval.

This plugin can serve as a good knowledge share on how to apply RAG AI to systems analysis.

Answer 10 · 2024-10-06T14:52:55.000Z

As a result of observing concerning latency, we would then look at increasing replicas or scaling the microservice(s) with KEDA. KEDA is a horizontal scaling technology in K8 that creates another pod to handle through put. There is also vertical scaling by increasing memory or CPU cores. Typically, in microservices the unlimited thread pools or thread loops scale themselves to the amount of CPU cores.

Answer 11 · 2024-10-08T20:24:49.000Z

It looks like this would be very helpful. Would anyone like to be assigned?

Answer 12 · 2024-10-23T13:12:23.000Z

@tomfanara have you got an initial set up ready? I can start the workspace with a backend and frond end?

Answer 13 · 2024-12-22T18:05:21.000Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.