GlobalId Exercise for Shoreline

This repo has benchmarks and tests. To run either, download dependencies with

$ mix deps.get

The test can be run with the standard mix test command. Benchmarks are run with

$ mix run bench.exs

FAQ

  1. Please describe your solution to get_id and why it is correct i.e. guaranteed globally unique.

The id is generated by taking the 42 bits of the timestamp and combining that with the node id (11 bits) and the serial (11 bits). The timestamp gives us one unique id per millisecond. Adding the node id gives one unique id per node per millisecond. The serial is meant to uniquify values generated on a node during the one millisecond period. The default is to use the Counter agent which simply returns sequential numbers until it overflows the 11 bit value and resets.

  1. Please explain how your solution achieves the desired performance i.e. 100,000 or more requests per second per node. How did you verify this?

I've created a simple benchmark that compares the performance of the solution with some baseline cases. On my laptop, the solution performs somewhere in the neighborhood of 422K ops/second. The final performance will depend significantly on the final production environment.

One possible wrinkle is that the solution serializes requests when fetching the serial number on each node. As parallelism goes up, performance goes down. There are ways to mitigate this performance impact, but the "correct" way to address this issue also really depends on what the final application/environment looks like.

  1. Please enumerate possible failure cases and describe how your solution correctly handles each case. How did you verify correctness?
  • How do you manage uniqueness after a node crashes and restarts?

Provided that the node doesn't crash and restart within a single millisecond, the advancing timestamp should ensure that new ids are unique.

There is a possible failure case where the Counter agent crashes and restarts within a single millisecond. If the crash and restart happens while the counter value is low, there is a possibility that a duplicate id will be generated. While possible, this failure case is extremely unlikely. The agent process is very unlikely to crash to begin with, and it would have to be supervised to be automatically restarted. The risk could be mitigated by having the process sleep for a millisecond on startup. This would give the timestamp time to advance before handing out the next id.

  • How do you manage uniqueness after the entire system fails and restarts?

The failure case for the entire system failing looks very much like the failure case for a single node. As long as the cluster takes more than a millisecond to restart, then the timestamp will increment and new ids will be unique. The mitigation of this corner case is the same, which is having the Counter agent sleep for a millisecond on startup.

  • How do you handle software defects?

I'm using a layered approach to ensuring quality of the code. The first layer is using Credo to lint the code for obvious problems. The second layer is using Dialyzer to verify types and ferret out subtle errors. Both Credo and Dialyzer run cleanly on the current code.

The final layer is a test suite using property-based tests. The code is quite simple, and I haven't done any coverage analysis. If this were "real" code, I would have measured test coverage.

  • Other failure cases

There is an interesting failure case when the code is running on a very, very fast processor. The requirements call for 100 unique ids per millisecond, and this code is capable of generating 2048 unique ids per millisecond. However, if run on a sufficiently fast processor, with no limit on the number of ids requested per second, then the serial number will overflow and duplicate ids will be generated. Aside from the obvious rate-limiting to mitigate this failure mode, another interesting solution is possible.

The requirements state that node ids are in the range 0..1024 inclusive. This amounts to a total of 1025 possible values, requiring 11 bits to represent. If the requirements for node ids were reduced to 1024 total values, then the node id could be represented in 10 bits, leaving 12 bits for the serial number. This would expand the serial space to the point where it is quite unlikely that any system produced in the foreseeable future would be able to exhaust serial numbers in a millisecond.

There is also a potential performance issue as more requests are made in parallel. The benchmark runs requests serially by default, but as more parallel requests are added the performance suffers. This is likely due to contention for the Counter agent. (Interestingly, the strong_random case shows even worse performance degradation, suggesting that there is resource contention in the OpenSSL library.) One possible way of dealing with this degradation is to run multiple agents handing out serial numbers from non-overlapping domains. You could, for example, run two agents handing out about 2,000 serials each without having to worry about synchronizing them. You would have to ensure that requests were relatively evenly balanced between the two, but it should help relieve contention in highly parallel environments.