Gossip Glomers are a series of challenges to build familiarity with distributed systems created by Fly.io and Kyle Kingsbury. This repo contains my solutions to these challenges.
The first challenge is a simple echo worker to test the capabilities of the Maelstrom tool.
A UUID library may satisfy this requirement, however if IDs need to be serial, then a timestamp variation such as Snowflake may be appropriate.
The first two parts (3a and 3b) of this challenge establish a network of broadcast nodes by implementing a crude gossip protocol. Each node informs its neighbors of new incoming message IDs. If a node receives word of a message it's already received, it ignores it.
Note: Gossiping is a common way to propagate information across a cluster when you don't need strong consistency guarantees.
The third part of this challenge involves making this system tolerant to network partitions by gracefully handling timeouts and failed broadcasts. My gut instinct is to switch from Send
ing a broadcast to neighbors to using an RPC instead with a timeout context. On failure or timeout, we should queue up an action to retry this after a slight delay with some pseudo-random jitter to prevent piling up outgoing requests.
RPCs can be sent through the SyncRPC()
method, so to avoid blocking, we can also set up a pool of worker goroutines to accept messages to send off of a channel. The SyncRPC()
method also allows us to supply a context.WithTimeout()
, which I've set to 100ms.
For 3d, I introduced a batching method to periodically sync with neighbor nodes, and for 3e (still in progress), I added some jitter to the timer for each node to prevent all of the messages getting sent at the same time. My best case was a medium stable latency of 1.668 seconds, which is still really high.
My first attempt was to use an append-only log, tracking inserted deltas with unique IDs in batches for each neighbor node. Once things were synched, we could calculate the total by traversing the log and summing up each delta once per UUID. This appeared to work, but I switched it to try using the Maelstrom KV store recommended in the challenge description. By using the KV.CompareAndSwap()
method and retrying on failure, it eventually reached consistency.
TODO: Investigate state-based conflict-free replicated data types as an alternative.
In progress...