This repo has benchmarks and tests. To run either, download dependencies with
$ mix deps.get
The test can be run with the standard mix test
command. Benchmarks are run
with
$ mix run bench.exs
- Please describe your solution to get_id and why it is correct i.e. guaranteed globally unique.
The id is generated by taking the 42 bits of the timestamp and combining that
with the node id (11 bits) and the serial (11 bits). The timestamp gives us one
unique id per millisecond. Adding the node id gives one unique id per node per
millisecond. The serial is meant to uniquify values generated on a node during
the one millisecond period. The default is to use the Counter
agent which
simply returns sequential numbers until it overflows the 11 bit value and
resets.
- Please explain how your solution achieves the desired performance i.e. 100,000 or more requests per second per node. How did you verify this?
I've created a simple benchmark that compares the performance of the solution with some baseline cases. On my laptop, the solution performs somewhere in the neighborhood of 422K ops/second. The final performance will depend significantly on the final production environment.
One possible wrinkle is that the solution serializes requests when fetching the serial number on each node. As parallelism goes up, performance goes down. There are ways to mitigate this performance impact, but the "correct" way to address this issue also really depends on what the final application/environment looks like.
- Please enumerate possible failure cases and describe how your solution correctly handles each case. How did you verify correctness?
- How do you manage uniqueness after a node crashes and restarts?
Provided that the node doesn't crash and restart within a single millisecond, the advancing timestamp should ensure that new ids are unique.
There is a possible failure case where the Counter
agent crashes and restarts
within a single millisecond. If the crash and restart happens while the counter
value is low, there is a possibility that a duplicate id will be generated.
While possible, this failure case is extremely unlikely. The agent process is
very unlikely to crash to begin with, and it would have to be supervised to be
automatically restarted. The risk could be mitigated by having the process sleep
for a millisecond on startup. This would give the timestamp time to advance
before handing out the next id.
- How do you manage uniqueness after the entire system fails and restarts?
The failure case for the entire system failing looks very much like the failure
case for a single node. As long as the cluster takes more than a millisecond to
restart, then the timestamp will increment and new ids will be unique. The
mitigation of this corner case is the same, which is having the Counter
agent
sleep for a millisecond on startup.
- How do you handle software defects?
I'm using a layered approach to ensuring quality of the code. The first layer is using Credo to lint the code for obvious problems. The second layer is using Dialyzer to verify types and ferret out subtle errors. Both Credo and Dialyzer run cleanly on the current code.
The final layer is a test suite using property-based tests. The code is quite simple, and I haven't done any coverage analysis. If this were "real" code, I would have measured test coverage.
- Other failure cases
There is an interesting failure case when the code is running on a very, very fast processor. The requirements call for 100 unique ids per millisecond, and this code is capable of generating 2048 unique ids per millisecond. However, if run on a sufficiently fast processor, with no limit on the number of ids requested per second, then the serial number will overflow and duplicate ids will be generated. Aside from the obvious rate-limiting to mitigate this failure mode, another interesting solution is possible.
The requirements state that node ids are in the range 0..1024 inclusive. This amounts to a total of 1025 possible values, requiring 11 bits to represent. If the requirements for node ids were reduced to 1024 total values, then the node id could be represented in 10 bits, leaving 12 bits for the serial number. This would expand the serial space to the point where it is quite unlikely that any system produced in the foreseeable future would be able to exhaust serial numbers in a millisecond.
There is also a potential performance issue as more requests are made in
parallel. The benchmark runs requests serially by default, but as more parallel
requests are added the performance suffers. This is likely due to contention for
the Counter
agent. (Interestingly, the strong_random
case shows even worse
performance degradation, suggesting that there is resource contention in the
OpenSSL library.) One possible way of dealing with this degradation is to run
multiple agents handing out serial numbers from non-overlapping domains. You
could, for example, run two agents handing out about 2,000 serials each without
having to worry about synchronizing them. You would have to ensure that requests
were relatively evenly balanced between the two, but it should help relieve
contention in highly parallel environments.