/cid

❄️cid ("content id") is a human-friendly (readable/typeable) unique ID function built for distributed/decentralised systems.

Primary LanguageElixirGNU General Public License v2.0GPL-2.0

excid

cid ("content id") is a human-friendly unique ID function used by mobile-first/distributed apps.

Snowflake-intro-image

GitHub Workflow Status codecov.io Hex.pm contributions welcome HitCount

Why?

We needed a way of running our App(s) across multiple servers/devices without fear/risk of collisions in the IDs of records.

How?

  1. Add excid to your list of deps in your mix.exs file:
def deps do
  [
    {:excid, "~> 1.0.1"}
  ]
end

then run mix deps.get to download.

  1. Define in your mix configuration which base you want to use to create the cid. The value of the base should be the either :base32 or :base58. If the base is not defined then excid will use the base32 by default.
config :excid, base: :base32
  1. Call the cid/1 function with a String, Map, or Struct as the single argument:
Cid.cid("hello")
"zb2rhZfjRh2FHHB2RkHVEvL2vJnCTcu7kwRqgVsf9gpkLgteo"

Cid.cid(%{key: "value"})
"zb2rhn1C6ZDoX6rdgiqkqsaeK7RPKTBgEi8scchkf3xdsi8Bj"

If the argument is not one of the ones listed above then the function will return "invalid data type".

Cid.cid([])
"invalid data type"

What?

The goal is to allow records to be created offline (e.g: on a mobile device) with the ID generated on the client and when the record is synched/saved to the server (database) it can be verified and the ID does not need to change. Furthermore because the ID is based on a cryptographic hash there is virtually zero chance of collision.

We could have used "Ye olde" UUID (random/nondeterministic) as the IDs and achieve the desired collision-avoidance, but we felt we could do much better and allow us to build a "checksum" mechanism into our app that will verify records created by any device to allow an effortless distributed/decentralised system.

This may all sound "esoteric" if you have not yet built an "offline-first" application. Our hope is that this README.md will clarify our reasoning for "reinventing the wheel".

Context

In a "basic" Relational Database (e.g: PostgreSQL, MySQL, MariaDB, etc.) the default way of identifying records is using an auto-incrementing integer: 1, 2, 3, etc. This is the optimal way of referencing records in a single (database) server setup where the counter for the number of records in a given table only needs to be known by one machine and there is no chance of conflict. e.g. if there are currently 100 blog posts in the database, the server will assign the next blog post the id 101; there is no chance that it will create a new post with id 99 which would conflict with the post you wrote yesterday.

Centralised = Single Point of Failure

As users of software or beginners building apps, we don't often think about failure. The reality however, is that systems fail all-the-time. All the most popular systems/services fail even Google who have thousands of world-class engineers dedicated to keeping the system up.

Most Apps are designed to be centralised because it's much easier to get started and for the most part there won't be any issues.

The problem comes when something in the system (e.g: the database) fails.

centralised-vs-distributed

If the single databases server handling all the requests fails, all users are affected by the outage for as long as it takes the team to revive it.

Having witnessed Distributed Denial of Service ("DDoS") attacks first-hand, we have experienced the pain of having to recover crashed servers with corrupted mission-critical customer data; it's bleak. Yes, there are services like CloudFlare or Imperva which promise to mitigate against DDoS attacks, however the reality is that they are only providing "frontend" protection; if for any reason your single-server database was to crash, your app will still be out-of-action regardless of having CloudFlare.

Why Decentralise?

If you have never had the experience of being offline or the service you are using being interrupted, then you either live in hyper-connected Paolo Alto (with backup/redundant networks and city-wide WiFi) or simply don't use the Internet "enough" to notice the outages.

All the work we do depends on having access to the Internet. We need to systematically reduce that dependency.

Network and hardware fault-tolerance is essential for many apps and enables a whole new "class" of apps to be created.

Specifically applications that are "federated". see: https://en.wikipedia.org/wiki/Federated_architecture

The Apps that we (@dwyl) are creating must be decentralised; there cannot be a single point of failure.

Decentralisation is not just "philosophical" argument, as creative technologists we are directly responsible for the technology we create. The lives of billions of people are at stake if we continue to allow the centralised control of our communication networks.

If you believe in the universal human right to privacy [Article 12] freedom from oppression and the Golden Rule, then logically this is the only thing to do.

Who?

Anyone who is techno-curious about the future of the Internet and wants to understand the way decentralised applications derive the IDs for content.

We feel that most apps can benefit from being decentralised/distributed by default because it means they work "offline" when any element fails and data can easily be "synched" (and verified) when connection is re-established.

If you want to build a mobile/offline-first progressive mobile web app (PWA) that feels native on both Android and iOS, then understanding CIDs is a good place to start.

If you are building apps that will use a single database instance for whatever reason (e.g: they aren't very "complex" or don't need to be distributed or work offline-first) keep enjoying the simplicity and maybe come back to this later when you feel you need this functionality.

What?

In a distributed database, we need a way of creating IDs for the records without any risk of "collision". We also need a consistent way of creating IDs both on the server and on the client (to allow for offline-first distributed apps).

Why Not Use UUIDs?

There are many ways of creating unique IDs, the most popular has historically been UUID (Universally Unique Identifier) https://en.wikipedia.org/wiki/Universally_unique_identifier

A UUID is a 128-bit number usually represented as base16 (hexadecimal) for example:

85594564-5be7-465f-b007-0fada384ed44

(via https://www.uuidgenerator.net )

Consider the following URL (featuring a UUID as the id of a record):

location-app.com/venues/85594564-5be7-465f-b007-0fada384ed44

It doesn't exactly roll off the tongue. 🙄

append-only log.

One of our ... readable/typeable1

123456789ABCDEFGHJKLMNPQRSTUVWXYZabcdefghjkmnpqrstuvwxyz

With a 56 characters in the set, we can create IDs of the following lengths:

  • 2: 56^2 = 3,136
  • 3: 56^3 = 175,616
  • 4: 56^4 = 9,834,496
  • 5: 56^5 = 550,731,776 ...
  • 22:

Crucially, we can create a system where the IDs start out at 2 characters and increase gradually as required (similar to how Instagram grew their IDs as they scaled, see below).

Real world usage

cid from a String

A real world use case for wanting a cid based on a String is a URL shortener. For example you have a long URL: https://github.com/dwyl/phoenix-ecto-append-only-log-example the cid of this url is:

require Cid

Cid.cid("https://github.com/dwyl/phoenix-ecto-append-only-log-example") # > "zb2rhjLfmD2trAmpEi3ZJSpFtuNokCrpfFido7QHCQRVnGoZW"

We can then create a URLs table in our URL shortening app/service with the following entry:

inserted_at URL (PK) cid short
1541609554 https://github.com/dwyl/phoenix-ecto-append-only-log-example zb2rhjLfmD2trAmpEi3ZJSpFtuNokCrpfFido7QHCQRVnGoZW zb2

So the "short" url would be dwyl.co/zb2

This is a relatively "boring" but still perfect valid use case. If someone attempts to create a short URL for this (same) long URL, the URL shortening app will simply return dwyl.co/zb2 the same short URL each time.

The reason we can abbreviate the URL to just zb2 is because our SHORT URL service has a centralised Database/store. If we wanted to run a decentralised content addressing system, we would simply link to the full cid: dwyl.co/zb2rhjLfmD2trAmpEi3ZJSpFtuNokCrpfFido7QHCQRVnGoZW

Where the chance of cid collision is less than 1 in "the number of atoms in the Universe". If we generated 1 Billion CIDs per second for the next Trillion years there would still be less than a 0.001% chance of collision.3

Tests

The tests for this module are a combination of doctests, unit tests and property based tests.

To run the property based tests you will need to install IPFS. See https://github.com/dwyl/learn-ipfs#how for details.

The property based tests will run by default. These tests are more comprehensive when compared to the "regular" tests. They will run the Cid.cid function on 100 randomly generated strings and maps, comparing the results of these to the IPFS generated cid, ensuring our function is correct in its implementation.

These tests take a little longer to run that the "regular" tests, so if you wish you can skip them with mix test --exclude ipfs

Research, Background & Relevant Reading




Notes

1 The quality of being "typeable" specifically on a mobile device, means that a human being can type an ID in a reasonable amount of time (e.g: fewer than 7 seconds) and

2 The list of Discontinued Google services continues to grow https://en.wikipedia.org/wiki/Category:Discontinued_Google_services

3 How to calculate collision probability in an ID system?

https://en.wikipedia.org/wiki/Universally_unique_identifier image

With a Base16 character set and 32 character of ID length,

base16-32-chars-probability

FAQs

How can we guarantee the the CID generated will be unique?

We will be using the SHA256 hash which to date has not incurred a single hash collision. 256 bits of data is used by most crypto currencies. Enough "smart people" have done the homework on this for us not to worry about it.

This is a good video on 256 bit hash collision probability: image https://youtu.be/S9JGmA5_unY

This video does not cover the "Birthday Paradox" see: nelsonic/nelsonic.github.io#576. But again, for the purposes of this answer and indeed any project we are likely to work on in our lifetime, when dealing with 256 bit hashes, the chance of a "birthday attack" creating a collision is "ignorable".

If all unique content creates a unique CID, how do we update content in our database?

The update version of content would be linked to the previous version using a prev field the way it happens in IPFS, Etherium and Bitcoin (so it will be familiar to people)

prev: previous_cid address example:

inserted cid(PK)1 name address prev
1541609554 gVSTedHFGBetxy Bruce Wane 1007 Mountain Drive, Gotham null
1541618643 smnELuCmEaX42 Bruce Wane Rua Goncalo Afonso, Vila Madalena, Sao Paulo, 05436-100, Brazil gVSTedHFGBetxy

When a row does not have a prev value then we know it is the first time that content has been inserted into the database. When a prev value is defined in a row we know this is a new version of a previously inserted content and we can "traverse the tree" to see all previous versions.