The idea behind this project is to share the little I learned about GIUD implementations and provide education to others.
Disclaimer: Although I carry good security foundations, I do not consider myself a cryptography expert, please take this data with a grain of salt and build your own opinions.
GUID (Globally Unique Identifier) are generators for identifiers typically used as primary keys in databases. GUID solve problems typically inherent to three tiers architectures, more precisely:
- Scalability - growing hardware vertically has its limits, soon or later multiple instances are required to sustain the load. Centralized id generators have a significant (negative) impact on performance. GUID typically allow distributing generators across multiple instances.
- Resilience - the ability to run multiple instances of a generator is a great way to improve its availabliity and avoid a single point of failure.
- Security - sequences are trivial to predict, in a distributed world different sources of entropy are used to guarantee uniqueness via a random, difficult to predict (long) string.
Sequences is a RDBMS centric solution to ease primary key generation.
DB SEQ | GUID | |
---|---|---|
Bit length | 32 or 64 bits | typically 128+ bits |
Format | number | Strings, with the exception of UUID, which has native binary support in many databases |
Compute load | Trivial +1 but typically runs a single instance or requires coordination | distributed generation & complex: can require IO, random sources and hash calculations, coordination. |
Collisions detection | Trivial and predictable: a database sequence collisions once it overflows | Much harder to predict |
Indexing at rest | trivial: B-TREE indexes were engineered with sequences in mind | Opaque GUID typically performs poorly as database indexes due to their randomness. Time based GUID encode the date as part of the id to get around this limitation. |
There are mainly two categories of GUIDs: Opaque and time-based. Time based GUID trade a few bits of entropy to encode a date, they are therefore more efficient to index at the cost of a shorter randomness range.
Guess Protection | Indexing at REST | Implementations | |
---|---|---|---|
Opaque GUID | hard | poor | ie. Nanoid, CUID2, UUIDv4 |
Time-based GUID | Leaks dates for a shorter randomness | good | ie. ULID, KSUID, UUIDv7 |
The table below provides a comprehensive comparison between the major GUID implementations.
Please click on the image to browse an editable version
Disclaimer, these benchmark are performed against a given architecture (JS/Arm), different implementations/runtimes will likely give different results.
Platform info:
==============
Darwin 22.2.0 arm64
Node.JS: 18.12.1
V8: 10.2.154.15-node.12
CPU: Apple M1 Pro × 8
Memory: 16 GB
Running "GUID Benchmark" suite...
Progress: 100%
uuid v1:
4 460 889 ops/s, ±0.57% | 63.99% slower
uuid v4:
8 870 032 ops/s, ±0.54% | 28.41% slower
uuid v7:
443 205 ops/s, ±0.82% | 96.42% slower
nanoid:
5 300 343 ops/s, ±0.48% | 57.22% slower
cuid:
347 037 ops/s, ±1.15% | 97.2% slower
cuid2:
60 490 ops/s, ±0.52% | slowest, 99.51% slower
ulid (monotonic):
12 389 427 ops/s, ±0.64% | fastest
xid:
3 417 174 ops/s, ±0.41% | 72.42% slower
ksuid:
532 223 ops/s, ±1.02% | 95.7% slower
Finished 9 cases!
Fastest: ulid (monotonic)
Slowest: cuid2
At this point you should have a pretty good idea of the implementations out there and how they compare. Let us put things in context.
Some are, but mostly in (old) browsers... and that's a really good news, because ID generation typically takes place Server side, where implementations are stable and uniform.
Machines run algorithms that approximate randomness by selecting algorithms and trusted entropy sources. A cryptographically secure pseudo-random number generator (CSPRNG) is a algorithm with properties that make it suitable for use in cryptography. On Linux for instance, the kernel gathers noisy data from various devices and transfers them to an internal pool of entropy.
Unfortunate, not all cryptography stacks are equal. For instance, old versions of CSPRNG are famous for their poor implementations in JavaScript or PHP. Algorithms such as SHA1 also proved to be unsecured but are still used today in some GUID generators.
While ensuring that a GUID generator is cryptography certified is important, an ID generator is nothing in view of a proper security architecture.
The answer depends on the use case and the specifics of a GUID implementation
With that said, the advertised Math is probably not a good reflection of reality: In an application, a GUID is used to create a multitude of different entities. These entities by definition do not overlap with one another and are therefore in their own space, further reducing the chances of collisions.
It all depends on requirements.
From a legal standpoint, some will argue that the ID generation date is PII data. While this classification could not be validated, a lawyer friend of mine found the argument difficult to defend in a court without considering the complete security architecture.
From a security standpoint, it depends on your use case, but very unlikely. Zendesk recently wrote an article on the topic, recommended.
This section helps clarify some of the terms we used in this document.
A hash collision is a random match in hash values that occurs when a hashing algorithm produces the same hash value for two distinct pieces of data.
The communication between the different elements of a complex activity so as to enable them to work together effectively.
In Cryptography, refers to the randomness collected by a system for use in algorithms that require random seeds. (jargon word for "randomness")
A monotonic generator ensures ids are always sorted in order. This term is particularly relevant to date based guid generators which are based on time. For a given instance, at instant T, the next id should always be sorted.
An opaque identifier is one that doesn't expose its inner details or structure.
Pagination is a process that is used to divide a large data into smaller discrete pages. Commonly used in customer facing applications, pagination requires the notion of pointers on the dataset.
A random number generator is predictable if, after observing some of its “random” output, we can make accurate predictions about what “random values” are coming up next.
Randomness (entropy) is the cornerstone of cryptography. The more random the numbers, the more secure the cryptographic system. The challenge then, becomes one of generating true randomness.
Feel free to fork this repo and adjust the table content on google sheet. Contribution and discussions are welcome!