rust-lang/cargo

Sparse registry indexes should be viewable atomically

Eh2406 opened this issue · 18 comments

Problem

With git-based registries any one commit from the registry represented an atomic picture of what all packages were up to at that point in time. This had a number of useful properties:

  • It was possible to list all available crates.
  • If a publish was visible, then all earlier publishes (in the servers timeline) would also be visible.
  • It was possible to list all crates that had changed between two versions of the index.
  • The commit hash was a unique identifier of that state of the world.
  • It would've been possible to sign the commit to show it came from a crates.io server.

None of these are possible with the current design of the sparse http-based registries.

Proposed Solution

Its possible to re-obtain the list of all available crates by adding a names-file that lists all crates alphabetically at the root of the index. For crates.io this file would be large but probably compress well enough to be usable.

If we add to that names-file the hash of each index file, then it uniquely identifies a snapshot of the index! Of course, being mostly pseudorandom numbers it will compress really badly. At crates.io scale it will be too big to live in one file. We could split it up into several files, say one per folder in the current index structure. In order to get an atomic picture we would now need a meta-file recording the hashes of the names-files. This works but would be a lot of overhead for smaller registries. (An RFC for this idea will require describing if/how the number of hash files is configurable.)

What happens if the index file is updated after the names-file cargo was looking at? cargo will get a index file whose hash does not match the atomic picture requested. So there needs to be some kind of cash buster that makes sure that the received version of the file matches the requested version. We could put this in the name 3/f/foo can become 3/f/foo-abcdef123456. However this would require crates.io to have the same content at both of those locations (assuming stabilization of the current design of #2789), which is just asking for things to come out of sync. We could use a query parameter, 3/f/foo get you the latest version no matter the hash and 3/f/foo?hash=abcdef123456 gets you the version with the hash abcdef123456. However this requires the backing server to be able to look up versions of a file by their hash. Crates.io is currently using S3, which does not provide this functionality; you can look up old versions of a file but only by the versionID S3 generated. So let's double the size of our names-files by recording for each crate a cash busting string (crates.io can use S3s versionID) and the hash. With the semantics that if you ask for that crate with the cash buster appended, you should always receive a response with that hash.

Notes

If this is so much better design than the current sparse registry proposal, why should we ever stabilize the current proposal?

Fundamentally this design is going to be more work for server to implement. The current sparse registry design has exactly the same contents as a git registry, and is therefore very easy for registries to upgrade to. There will be many users for whom the additional complexity is not needed, and the current sparse registry proposal is a better fit.

I did want to open this discussion to make sure we have a path to this more atomic design that is compatible with what we are stabilizing in sparse registries. If changes need to be made to sparse registries, now is the time.

Coming from withoutboats/rfcs#7 I was asked how The Update Framework (TUF)'s snapshots feature might help here. For context, TUF is a framework for cryptographically authenticating software updates, and its snapshots feature provides git-like consistency across several files.

TUF has 4 roles:

  • Targets: these are files you'd like authenticated, e.g. the files that make up the sparse index. You can think of them like git objects. The "targets" role makes a manifest of them, their hashes, a version counter which is monotonically incremented whenever changes are made to a given file, and then signs the manifest. For e.g. crates.io, this is a role that would probably be held by @bors
  • Snapshots: in TUF terminology, consistent snapshots are like git commits. The snapshot role creates manifests of targets and their version numbers which can be seen as a consistent grouping, kind of like how git commits name a manifest of object hashes. For e.g. crates io, this is another role that would probably be held by @bors
  • Timestamp: the timestamp role identifies the current snapshot. It's separate from the snapshot role because it has a shorter TTL to ensure clients check back frequently in order to ensure that a MitM isn't trying to block them from software updates. The timestamp metadata needs to periodically be resigned, even if the current snapshot hasn't changed, in order to ensure that it is "fresh". This is, once again, a role that would probably be held by @bors
  • Root: this is the root of trust for the system, which can rotate the keys held by the other roles. In the event @bors were to be compromised, the root role can be used to rotate keys. TUF supports a k-of-n threshold signing model, so you can imagine these keys being held by members of the core team and/or HSMs, with signatures from e.g. at least 3 different keys required to rotate the keys for the other roles held by @bors

Separately TUF also supports a notion of delegated targets where you can specify policies for how 3rd party keys are allowed to sign specific files, which would be useful for letting end users sign crates, but that's not necessary to implement to use TUF for the sparse index and something that can be pursued as followup work as part of e.g. a Sigstore integration (see https://internals.rust-lang.org/t/pre-rfc-using-sigstore-for-signing-and-verifying-crates/18115/2)

If you're curious if it's bad for @bors to hold three roles, not necessarily. TUF splits them up to give you the option of eventually factoring them into separate systems which reduce the overall attack surface of each system and cross-check each other. But you can assign several roles to @bors to get started, and just having the root role separate gives you survivable compromise of @bors.

Does TUF have any advice/recommendations on how we get from a bunch of files to a "snapshot"? The link suggests that the snapshot should include a list with all the files each with its hash, but this is impractical at the scale of crates.io. For GIT indexes we already have a Merkel tree, so the snapshot can just be the commit hash (or the tree hash) as suggested in your RFC. For sparse registries we will probably have to invent our own thing, but it might as well learn lessons from and take advice from those who went before us.

You'd probably need to talk to a TUF expert about that. I can point one at this issue if you'd like.

It might be possible to use delegated targets as something akin to git tree objects.

I'm a TUF maintainer and happy to answer any questions here.

Does TUF have any advice/recommendations on how we get from a bunch of files to a "snapshot"? The link suggests that the snapshot should include a list with all the files each with its hash, but this is impractical at the scale of crates.io. For GIT indexes we already have a Merkel tree, so the snapshot can just be the commit hash (or the tree hash) as suggested in your RFC. For sparse registries we will probably have to invent our own thing, but it might as well learn lessons from and take advice from those who went before us.

This is very similar to a proposal in TUF for this exact problem. The short version is that you can make separate "snapshots" for each file (with the name, version number, and hash), then combine them using a Merkle tree, and sign the root of that Merkle tree with the timestamp role. A sparse Merkle tree would likely make this more efficient, but give the same properties. The proposal is a draft that is mostly pending real-world implementation and validation.

That is an extremely interesting read. Thank you!
If I read it correctly it has the contents of the files (and the files metadata) put in a Merkel tree, and then one signature on the root of that tree is sufficient to sign the entire thing.
In that way it is similar to the original proposal in this issue.
The proposal uses a binary Merkel tree, where is this issue suggests a per folder Merkel trie. I understand that a trie only make sense for certain kinds of repository data. Are there significant advantages to a binary system over a trie system?
Similarly, I had not heard of sparse Merkel tree before. Can you speak to their benefits in this context?

Sparse Merkle trees work well for hashmap-like key/value data, and many are designed to scale well with constant updates.

Here's a Rust implementation of one:

https://github.com/diem/diem/blob/main/storage/jellyfish-merkle/src/lib.rs

This paper describes a similar idea:

https://eprint.iacr.org/2018/955.pdf

That is an extremely interesting read. Thank you! If I read it correctly it has the contents of the files (and the files metadata) put in a Merkel tree, and then one signature on the root of that tree is sufficient to sign the entire thing. In that way it is similar to the original proposal in this issue. The proposal uses a binary Merkel tree, where is this issue suggests a per folder Merkel trie. I understand that a trie only make sense for certain kinds of repository data. Are there significant advantages to a binary system over a trie system? Similarly, I had not heard of sparse Merkel tree before. Can you speak to their benefits in this context?

The tree shape should only affect performance, and not the security properties, so splitting the tree per folder would work as long as all data is included.

As @tarcieri mentioned, the sparse Merkle tree has has performance benefits for frequent updates as less of the nodes have to change when a leaf is updated.

I found sometime today to reread the TUF specification. I've been using some terminology wrong, I don't think it's technically the size of snapshot.json but the size of targets.json that is a scaling problem. The specification says

Each key of the TARGETS object is a TARGETPATH.

Which suggests that if TUF were applied directly, targets.json would list every file available from the registry each with a at least one hash a length and "custom" data. The custom data could be what's currently stored in the index. Crates.io serves 101k package versions, which would be a rather large targets.json file. But I must be missing something, because PyPi is at 428k. So I looked at PEP 458 which claims to have solved the scalability problem using Succinct hashed bin delegations (TAP 15). But when reading the specification path_hash_prefixes has to do with delegations. My understanding is that delegations describe which roles are allowed to sign the final artifacts. I don't see how path_hash_prefixes gets you out of targets being O(number of package versions). What am I misunderstanding about the specification?

@Eh2406 delegated targets are each stored in their own .json files which can be independently updated (and I believe also support nested delegations which would allow for a delegation hierarchy?)

Though their main use case is to delegate authority via different signing keys, I think it might also be possible to use them with the same signing keys to break up targets.json into many small files?

@Eh2406 TUF does the final artifact signing through targets roles. So targets.json only lists the files that this particular role is signing, and delegations to other roles. Hashed bin delegations can help with scalability when one role has a lot of files that it is signing.

None of these are possible with the current design of the sparse http-based registries.

Thanks for opening the discussion! To start with, could someone please point me to documentation on these sparse HTTP-based registries, so that at least I can help better think about the problem, and advise on how to use TUF here?

A couple of counter points:

  1. the git version of the index exists. Users who want to obtain the complete atomic snapshot of the index can still do so via the git protocol. Cargo will support git forever. I think crates.io also plans to keep the git version indefinitely.

  2. For the case of publishing and correctly resolving a set of dependent crates, a globally-consistent snapshot is not necessary. Correct dependency resolution only requires preserving a relative ordering between dependencies. This is a much smaller problem, with solutions suggested in the sparse registry RFC (in short, stale files can be detected and cachebusted).

  3. Even when using git registries, Cargo doesn't always end up with a "consistent" set of dependencies based on the same git index commit. When user adds new dependencies or increases their versions, and there's an existing Cargo.lock, Cargo only adds/updates a minimal set of dependencies, keeping other crates locked to a previous version when possible. This makes the set of dependencies a mix of resolutions done at different times, from different git commits of the index. An updated Cargo.lock may be different than what you'd get if you deleted Cargo.lock and resolved it again from scratch. So in Cargo a global consistency never existed, and that's a feature, not a bug.

Because git had an easy way to represent a snapshot, signing of the commit has was an easy way to prove authenticity of all the data in the commit. However, I don't think the logic necessarily goes the other way. To prove authenticity of all the data, you don't have to have a single snapshot.

For example, every individual file could be signed separately. This is still a guarantee that the file came from crates.io at some point, so it still prevents the CDN or some other network MITM from tampering with the data and publishing something that has never been published.

Files can come with a signed version or signed last-modified timestamp, so rollback attacks (using terminology from TUF) on them can still be prevented.

A global snapshot has an advantage that a freeze attack requires the attacker to freeze everything, rather than individual files, so such attack is easier to spot. However, minimum version requirements in Cargo.toml still limit how much an attacker can withhold. If you bump a vulnerable dependency requirement to a non-vulnerable version, that dependency has to exist and can't be withheld without breaking resolution. If this attack is important to protect against, the registry could let Cargo know about latest-known-to-exist versions of dependencies in the metadata, separately from minimum required versions, to require Cargo to refresh other files anyway. Also the proposed incremental changelog extension of the sparse registry protocol would also make a freeze attack harder to pull off.

Your counterpoints are, of course, all correct. Dependency resolution, which is the primary job of the index, does not require any of the properties asked for in this issue. And all of the properties requested in this issue can be obtained by falling back on the git index.

Nonetheless, some of these properties are needed for other desired functionality in cargo and under circumstances that make "just pull the git index" too slow. Specifically suggesting alternative names when a package is not found.

Also the proposed incremental changelog extension of the sparse registry protocol would also make a freeze attack harder to pull off.

Yes, at some level this is an alternative API for a changelog. For performance any changelog system needs to have a "snapshot" mechanism for compressing the history into a description of the current state. At the extreme position, the registry could just publish snapshots and the changes can be reconstructed by taking diffs. Which is the proposal being suggested here. Thank you for reminding us to think about whether a more explicit method for tracking changes leads to a better API.

I have collected some relevant data.
The plane names file for the current crates.io index is 1.5MB, but only 0.5MB when gziped. We thought this would compress well and it did.
Adding a Sha256 hash as hex balloons the file size to 8.1MB, and 4.6MB when gziped. As expected, it did not compress particularly well.
I have not yet looked at binary formats, let me know on zulipchat if there's something you want me to try.
Also for reference I looked at how many prefixes there are ab/cd there are about 27K. The original proposal here has a little more than one file per prefix. So downloading the entire tree will involve a lot of requests.

This also got me thinking about one of the primary use cases "suggesting alternative names when a package is not found". If the user typed in the package abcdefghij, we can look to the file ab/cd/abcdefghij to see if there is a package that normalizes to the same name. If we use a Merkel tree based on folders, we can look at the node ab/bc to get all the names that share that prefix. However if we want to search for packages that are within a small Levenstein distance of abcdefghij, we end up having to download almost all the files. If small >= 4, then I think it literally is all files. It feels like 27K is too many requests to be making while a user waits for a response on the command line.

I was recommended to try https://en.wikipedia.org/wiki/Incremental_encoding on the plain names file. With a naive implementation (which is only valuable for size comparisons) 0.65MB and 0.35MB after gziped. That is a nontrivial size savings. Of course, defining our own pre-compression step will require careful documentation and testing.

I believe that trying to correct typos on the client side would amplify dangers of typosquatting. The index lacks information about popularity/reputation/relevance, so it can't pick a better crate from among candidates, it can only pick based on names alone, without knowing which ones are spam, and use of mere edit distance opens up a risk of picking easy-to-register close names (e.g. instead of correcting git2-sys to libgit2-sys, it could pick git1-sys).

Because typo handling is not a high-traffic service critical for Cargo's uptime, and crates.io has all of the necessary information available server-side to make good and safe suggestions, I think this particular case should be handled with a server-side API instead. cargo search already set a precedent for using a server-side API instead of scanning the index.