bloomberg/goldpinger

Add latencies on graph edges ?

seeker89 opened this issue · 11 comments

I noticed the fun spinoff here https://github.com/vmarchaud/consul-topology-visualizer#inspiration

And in the spirit of cross-pollination I'm wondering - would people find it useful to display the latencies on the edges ? It looks kind of neat, although will only really be useful for smaller graphs, and is also redundant with the metrics already exported. It could be a checkbox on the UI.

Thoughts ?

That's a very good idea ! 👍
This could even go further by colorizing the edges depending on the latency.
I don't think it's heavy/redundant since it's not necessarily the same guys that use the UI and the ones that leverage the metrics :-)

We've been working on some network observability tooling @Shopify, inspired by Microsoft's PingMesh paper. We've come up with this so far:

latency grid

  • Each row represents a node sending out a ping, each column a node receiving a ping.
  • The dark cross is a failing node.
  • Green, yellow, and red indicates low, medium, and high round trips, respectively. Currently bucketed with fixed thresholds.

We were wondering if this visualization (plus the round trip additions) would be of interest to contribute upstream to resolve this issue?

[EDIT]
I should mention we're actually interested in emitting more than just round trips. We'll probably want to output TLS handshake, connection open, and DNS resolution times.

Hey @thegedge thanks a lot for putting this forward.

TL;DR: We really like this idea, and would absolutely welcome the contribution.

A few thoughts, in no particular order:

  • I like how this representation allows for a compact view of larger clusters, than what can be easily viewed as a graph,
  • there is probably a lot of tweaks that can be made to it, and that could probably be configurable to to suit various use cases. Some ideas that spring to mind:
    • using a continuous scale, instead of bucketing (it would look more like a height map)
    • using grayscale for the whole thing (to allow exporting very small images)
    • wondering if it would be interesting to produce animations that show evolution over time
  • I assume that the image is an artist's impression, otherwise there should be a repeating green on the diagonal ?
  • measuring TLS handshake, connection open, and DNS resolution times would all expand the spectrum of utility of goldpinger, so are a great idea.

So to answer your question: yes please. What can we do to help with that ?

using a continuous scale

I've discussed this with my team, and it's definitely another possibility. The hard thing about it (EDIT: "hard" here meaning configurability, in case someone wanted bucketing and someone else wanted continuous) is that these are just static files, so the best I think we'll be able to do is perhaps put some JS constants at the top of the file that people could tweak for their own preferences. Another option would be to set up a make target to compile the static files from templates.

We have lots of clusters, so we're actually planning on having a side service to persist/aggregate all of the data, and present a global view.

wondering if it would be interesting to produce animations that show evolution over time

Unfortunately, that would mean persisting the data, or having the JS keep some of it in memory. I'm sure this wouldn't be terribly difficult, but likely outside of the scope of what we'll be doing in goldpinger.

I assume that the image is an artist's impression, otherwise there should be a repeating green on the diagonal ?

Actually, this is live data from one of our own clusters (minus the black cross, which was an artificially introduced failure). I was pretty surprised also that the diagonal wasn't green. FYI green in that image would mean <100 ms round trip.

So to answer your question: yes please. What can we do to help with that ?

We already have this running internally, with our own fork of goldpinger :)

I'll polish it up a bit, and then get some PRs rolling.

I've discussed this with my team, and it's definitely another possibility. The hard thing about it (EDIT: "hard" here meaning configurability, in case someone wanted bucketing and someone else wanted continuous) is that these are just static files, so the best I think we'll be able to do is perhaps put some JS constants at the top of the file that people could tweak for their own preferences. Another option would be to set up a make target to compile the static files from templates.

I'm not sure I understand that bit. I initially thought it was an actual image being produced - do you mean that's a dynamically build HTML + CSS ? Or do you mean SVG or equivalent ?

We could probably just have a dropdown at the top bar of the UI, that allows you to pick some options ? Or something along these lines ?

Actually, this is live data from one of our own clusters (minus the black cross, which was an artificially introduced failure). I was pretty surprised also that the diagonal wasn't green. FYI green in that image would mean <100 ms round trip.

This is intriguing. I'd probably assume that something's seriously wrong, if a ping to localhost takes >100ms.

We already have this running internally, with our own fork of goldpinger :)

I'll polish it up a bit, and then get some PRs rolling.

Very sweet, looking forward to taking it for a spin !

I'm not sure I understand that bit. I initially thought it was an actual image being produced - do you mean that's a dynamically build HTML + CSS ? Or do you mean SVG or equivalent ?

Yep, it's an <svg> that gets populated using d3.js to supply data via the /check_all endpoint.

We could probably just have a dropdown at the top bar of the UI, that allows you to pick some options ? Or something along these lines ?

I'll just make these settings be JS variables with hardcoded values for the first PR, with (eventually) a follow-up PR that adds in some UI elements to configure them. How does that sound?

This is intriguing. I'd probably assume that something's seriously wrong, if a ping to localhost takes >100ms.

Agreed, and I'm looking into that. I'm thinking the problem could potentially be the use of wall-clock time with so many goroutines running, although I've seen ~100ms timings from something as simple as echo 'test' | nc localhost 8080 within the container. Maybe a combination of scheduling and some general slowness in the server. Definitely needs some more digging.

Hey @thegedge just checking in how that's going ? Do you need any assistance ?

Sorry for the lack of communication, @seeker89. I've been caught up on some changes for our own internal project which have been keeping me busy

Unfortunately this means we are no longer using goldpinger, but I did want to make this visualization available to the project. You can find it here: master...Shopify:add-latency. There's still some work to be done, so I'll hand it off there to someone who would like to take it to the finish line.

That's a real shame. What did you decide to build instead ?

We ended up rebuilding a stripped down pinger, without all the bells and whistles (no API, no swagger, no static file serving). Now we're focused on a federated Prometheus cluster, with a central dashboard to combine all of this data across clusters in a useful visualization (likely something similar to the screenshot I posted above).

Honestly, the primary reason for us making this move is the ability to move faster. Maintaining a fork with internal, experimental, and public work would be too much friction right now.

One other finding I can share: we had very low CPU requirements set up in k8s, so our round-trip times were way off (a combination of goroutine scheduling + cgroups throttling). A simple change that dramatically improved our timing was to do the pings serially instead of spawning goroutines for all of the pings at once. Eventually I plan on staggering the pings, but for now doing everything in serial mostly results in good timings.

We ended up rebuilding a stripped down pinger, without all the bells and whistles (no API, no swagger, no static file serving).

I would be curious to know why the bells and whistles were a problem ? They don't really add much of an overhead in any meaningful way ?

Now we're focused on a federated Prometheus cluster, with a central dashboard to combine all of this data across clusters in a useful visualization (likely something similar to the screenshot I posted above).

That's something the community could definitely benefit from. If you keep the prometheus metrics compatible with goldpinger's, maybe we could reuse the same dashboard !

Either way, good luck, and keep rocking!