Add additional module metadata.json fields
Opened this issue ยท 8 comments
Right now the data contained in the metadata.json fields is very focused on pure technical package management functionality.
Especially for discoverability, it would be nice also to have some additional metadata:
- Source repository (#118)
- Documentation URL - Optional URL to a hosted rendered version of the documentation
- License - SPDX license expression for the license governing the module source
- Keywords - A set of up to 5 freetext keywords that don't fit the categories, that can be indexed for search. See https://doc.rust-lang.org/cargo/reference/manifest.html#the-keywords-field
- Categories - A set of up to 5 categories that can be chosen out of a set of fixed categories with assigned semantics (we would have to maintain that list). This would help us e.g. in separating "proper rulesets" from small modules that provide a wrapper for building a C library. See https://doc.rust-lang.org/cargo/reference/manifest.html#the-categories-field
All sounds good, but for license, we probably want to coordinate with @aiuto who is maintaining rules_license. We had a discussion about whether we should declare license in the metadata.json or in the BUILD file with rules_license. I think rules_license was preferred since it can also be integrated with other module extensions that pulls third party dependencies.
There is even a another appealing option the rules_license
-based approach could offer: Definining a module-level default license by using a module tag provided by rules_license
.
We actually need both. BCR should have a place to hang canonical package name, version, copyright, license, and a few other things. Repository rules and bzlmod should be able to use that to build license() declarations into BUILD files. I'm working on adding the metadata to rules_license currently.
I think that we should make the source repository required information: In order to prevent obviously malicious updates, we need to ensure that the download links for a new module version match the original repo.
Another thing which might be interesting to detect malicious update is to splice in the IP address and reverse DNS on that of the uploader. That might be a PII problem though.
Note that trusting based on GitHub URL prefix gives absolutely no guarantees about source provenance, due to all forks of a repo sharing the same internal git state.
For example, this commit is on my fork of the Bazel repo but has not been merged to any branch on the upstream repo: aaliddell/bazel@bd1097d
Using this commit hash I can create a download URL on my fork: https://github.com/aaliddell/bazel/archive/bd1097d166c5ed20456602b4b236f1410b66ab90.zip
However, modifying the URL to use the upstream bazelbuild organisation repo also works: https://github.com/bazelbuild/bazel/archive/bd1097d166c5ed20456602b4b236f1410b66ab90.zip
Using the above method, any arbitrary code committed to a fork may be given a download URL that would pass a basic prefix check. To truly validate that the code has come from the repo in question, you would have to use the GitHub API to check that the commit hash referenced is merged into the master branch on the specified repo. This is just a quirk of how GitHub stores their data to minimise disk usage and fork times.
Another thing which might be interesting to detect malicious update is to splice in the IP address and reverse DNS on that of the uploader. That might be a PII problem though.
A more reliable identifier for this would be the GitHub user ID. Modules could maintain a list of user IDs permitted to update the module.
Before we do anything, we should research other registries and their approaches to security. Doing this right requires a proper top-down design.
Note that trusting based on GitHub URL prefix gives absolutely no guarantees about source provenance, due to all forks of a repo sharing the same internal git state.
For example, this commit is on my fork of the Bazel repo but has not been merged to any branch on the upstream repo: aaliddell/bazel@bd1097d
Using this commit hash I can create a download URL on my fork: https://github.com/aaliddell/bazel/archive/bd1097d166c5ed20456602b4b236f1410b66ab90.zip
However, modifying the URL to use the upstream bazelbuild organisation repo also works: https://github.com/bazelbuild/bazel/archive/bd1097d166c5ed20456602b4b236f1410b66ab90.zip
Using the above method, any arbitrary code committed to a fork may be given a download URL that would pass a basic prefix check. To truly validate that the code has come from the repo in question, you would have to use the GitHub API to check that the commit hash referenced is merged into the master branch on the specified repo. This is just a quirk of how GitHub stores their data to minimise disk usage and fork times.
AFAIK (but this should be checked) this is only a problem for /archive/<commit hash>.zip
style URLs. The BCR already rejects such URLs in favor of /archive/refs/tags/<tag>.zip
, which should only reflect tags created on the repository itself, not a fork.
As I said above, all of this is pretty subtle and should probably go through a proper security design and review process.