Use UUIDs for request identifiers

Question

Use UUIDs for request identifiers

hannahhoward opened this issue 3 years ago · 4 comments

What

We could unlock a number of capabilities for Graphsync and the whole transfer stack if we used UUIDs for identifiers.

Right now, we generate a ton of complexity in the data transfer stack simply tracking IDs between layers.

Going forward, we'd like graphsync to simply share an id with higher levels of the stack. Data Transfer and/or Retrieval should be able to pass an id that graphsync uses. Moreover, graphsync itself should be able to match up an incoming request with pre-existing data for higher levels of the stack (see push transfers).

RFC 4122 UUIDs have some interesting properties. They are unique not just on one system, but unique to all systems (with a non-zero but negligible chance of collision). If we used UUIDs, we could share an identifier between all layers of the transfer stack (all the way up to retrieval), and pass those around on the network.

Moreover, if an incoming Graphsync request has a UUID that matches say a previous data transfer push request, we can immediately match these two up and have them share state.

It would also unlock actually making a push mode in graphsync itself, since the id sent from another machine could be used. (this could remove the need for the data transfer libp2p protocol)

How

Potential Downsides

The wire format described will cost approximately 17 bytes on the wire (1 varint length + 16 bytes encoding).

Moreover, protobuf doesn't define a clear UUID encoding (protocolbuffers/protobuf#2224), so this requires the application developer for graphsync to know how to encode the UUID in the same manner we are.

There's a high likelyhood if we need to operate with backward compatibility for some time, there could be an interim phase of increased complexity.

Alternatives

We could put the UUIDs in an extension (no protocol change)

We could simply expand the size of request id to int64 (would at least allow us to accept data transfer channel numbers)

Answer 1 · 2021-11-22T17:39:21.000Z

Maybe a silly question, but I'm not overly familiar with the protocol, can you clarify how uuid's improve tracking here over the request id? I get the increased id space of uuid versus int32, but what is prohibiting us from doing this today with the current id? Is it just collision concerns or something else?

Answer 2 · 2021-11-22T18:25:08.000Z

@jacobheun can you clarify-- are you asking:

Why not just change the id field in the protobuf to a uuid, instead of adding a new field? -OR-
Why specifically use uuid as opposed to just a larger address space int?

Answer 3 · 2021-11-22T18:58:33.000Z

More so, why are we unable to track things today with Request.id and how will adding Request.uuid make that easier?

Going forward, we'd like graphsync to simply share an id with higher levels of the stack. Data Transfer and/or Retrieval should be able to pass an id that graphsync uses

Why can't Request.id do this ^ today, and how will making it/adding a UUID change that?

Moreover, if an incoming Graphsync request has a UUID that matches say a previous data transfer push request, we can immediately match these two up and have them share state.

It would also unlock actually making a push mode in graphsync itself, since the id sent from another machine could be used. (this could remove the need for the data transfer libp2p protocol)

With my limited understanding of the system, it sounds like the biggest gap to doing these ^ things, is that Request.id is not unique enough across numerous peers and graphsync requests. Is that correct?

Answer 4 · 2021-11-22T21:31:28.000Z

@jacobheun yes, essentially a single int32 is not enough to insure uniqueness. The problem isn't simply uniqueness within a single node, but uniqueness across all nodes, and across restarts of the same node. For example, we currently assigned graphsync Request IDs via an incrementing value starting at 1. That gets reset if the node restarts. Over in go-data-transfer, ID is int64 based on current time, to avoid problems across restarts. But that gets you uniqueness only for a single node.

UUID is the only thing that can do uniqueness with small likelyhood of collision across nodes. You can potentially get this uniqueness with other peer.ID + request ID as a single value (i.e. int64 based on time). But at that point we have a machine identifier + a time stamp, most of the building blocks of UUID, but in an extremely non-standard and unproven way. Not only that, but int64's are a mess in javascript. It feels like why should we reinvent something that has been well worked out.