bazelbuild/bazel-central-registry

Add additional module metadata.json fields

Opened this issue ยท 8 comments

Right now the data contained in the metadata.json fields is very focused on pure technical package management functionality.

Especially for discoverability, it would be nice also to have some additional metadata:

All sounds good, but for license, we probably want to coordinate with @aiuto who is maintaining rules_license. We had a discussion about whether we should declare license in the metadata.json or in the BUILD file with rules_license. I think rules_license was preferred since it can also be integrated with other module extensions that pulls third party dependencies.

fmeum commented

There is even a another appealing option the rules_license-based approach could offer: Definining a module-level default license by using a module tag provided by rules_license.

aiuto commented

We actually need both. BCR should have a place to hang canonical package name, version, copyright, license, and a few other things. Repository rules and bzlmod should be able to use that to build license() declarations into BUILD files. I'm working on adding the metadata to rules_license currently.

fmeum commented

I think that we should make the source repository required information: In order to prevent obviously malicious updates, we need to ensure that the download links for a new module version match the original repo.

aiuto commented

Another thing which might be interesting to detect malicious update is to splice in the IP address and reverse DNS on that of the uploader. That might be a PII problem though.

Note that trusting based on GitHub URL prefix gives absolutely no guarantees about source provenance, due to all forks of a repo sharing the same internal git state.

For example, this commit is on my fork of the Bazel repo but has not been merged to any branch on the upstream repo: aaliddell/bazel@bd1097d

Using this commit hash I can create a download URL on my fork: https://github.com/aaliddell/bazel/archive/bd1097d166c5ed20456602b4b236f1410b66ab90.zip

However, modifying the URL to use the upstream bazelbuild organisation repo also works: https://github.com/bazelbuild/bazel/archive/bd1097d166c5ed20456602b4b236f1410b66ab90.zip

Using the above method, any arbitrary code committed to a fork may be given a download URL that would pass a basic prefix check. To truly validate that the code has come from the repo in question, you would have to use the GitHub API to check that the commit hash referenced is merged into the master branch on the specified repo. This is just a quirk of how GitHub stores their data to minimise disk usage and fork times.

fmeum commented

Another thing which might be interesting to detect malicious update is to splice in the IP address and reverse DNS on that of the uploader. That might be a PII problem though.

A more reliable identifier for this would be the GitHub user ID. Modules could maintain a list of user IDs permitted to update the module.

Before we do anything, we should research other registries and their approaches to security. Doing this right requires a proper top-down design.

fmeum commented

Note that trusting based on GitHub URL prefix gives absolutely no guarantees about source provenance, due to all forks of a repo sharing the same internal git state.

For example, this commit is on my fork of the Bazel repo but has not been merged to any branch on the upstream repo: aaliddell/bazel@bd1097d

Using this commit hash I can create a download URL on my fork: https://github.com/aaliddell/bazel/archive/bd1097d166c5ed20456602b4b236f1410b66ab90.zip

However, modifying the URL to use the upstream bazelbuild organisation repo also works: https://github.com/bazelbuild/bazel/archive/bd1097d166c5ed20456602b4b236f1410b66ab90.zip

Using the above method, any arbitrary code committed to a fork may be given a download URL that would pass a basic prefix check. To truly validate that the code has come from the repo in question, you would have to use the GitHub API to check that the commit hash referenced is merged into the master branch on the specified repo. This is just a quirk of how GitHub stores their data to minimise disk usage and fork times.

AFAIK (but this should be checked) this is only a problem for /archive/<commit hash>.zip style URLs. The BCR already rejects such URLs in favor of /archive/refs/tags/<tag>.zip, which should only reflect tags created on the repository itself, not a fork.

As I said above, all of this is pretty subtle and should probably go through a proper security design and review process.