rust-marker/marker

[Feat]: Introduce a lock file for lint crates

Opened this issue ยท 12 comments

Context

Today Cargo.lock file for the internal cargo workspace created for building the lint crates isn't exposed to the users of marker.
This means that every time cargo marker would run on a fresh CI environment the latest compatible versions of lint crates and their dependencies will be built.

Even though, ideally one would think it should not cause any problem, but the real world is cruel. People may release a breaking change by mistake, or some crate could rely on the behaviour not included in its dependency semver guarantees. For example, I remember some crates that don't bump their major version if they increase the MSRV, or there are crates that simply don't care about the MSRV. There are so many reasons one could fuck up the semver compatibility..

It's also a security concern. You don't want to always use the latest version of your dependencies. That increases the risk that any of the dependencies in your tree introduces a malicious build script in the new patch version that would leak CI credentials.

Summary

This all led to the creation of Cargo.lock concept. Marker needs to continue exposing that to the users for its lint crates.

I don't have a final design for this yet. What I see right now is the following.

Marker will generate a Marker.lock which is going to be just the copy of the Cargo.lock that marker uses internally. Marker will copy that lock file into the internal cargo workspace each time it does a build of the lint crates or better create a hardlink or a symlink.

We need to provide the tools for updating the lock file cargo update, generating it cargo generate-lockfile, reviewing the dependency tree cargo tree, auditing it cargo deny, etc.

Basically, we need to expose the users to managing the lock file for the internal marker's workspace.
I wouldn't like to have to write all the possible tools for managing the lock file in cargo-marker by hand. Besides the 3-rd party cargo plugins like cargo-deny or cargo-audit and any other ones that may arise in the future should also be supported.

So I think there should be a dynamic cargo command wrapper in cargo-marker to run them inside the internal marker cargo workspace and update the Marker.lock. For example a new cargo marker cargo subcommand that allows running cargo inside of the internal workspace and it just forwards whatever is passed after cargo marker cargo to cargo inside of that workspace.

cargo marker cargo update --package foo
cargo marker cargo tree
cargo marker cargo deny

Naming could probably be improved here, but this is my first thought


Another area to explore is how to include the lint crates in the regular Cargo.lock present in the linted workspace. I don't see good options for that yet. The only approach I see with this would burden the users with creating a dummy crate for building the lints themselves.

Big thumps up, this is something I haven't considered thus far!

I like the idea of providing a dynamic cargo command wrapper. For the naming, we could use something like lints, lint-crates, cargo-lints?

Marker will generate a Marker.lock which is going to be just the copy of the Cargo.lock that marker uses internally.

This sounds like a reasonable and easy solution. ๐Ÿ‘ If we go this route, I would like the file name and location top be configurable under workspace.metadata.marker in Cargo.toml.


This is a suggestion in parallel to this and related to #250. Do you think it's interesting, to add an option to use a Cargo.lock file provided by the lint-crates themselves? I imagine something like a locked = true field, which can be part of the lint declaration. In which case, we make sure that the lint crate is compiled with the Cargo.lock file of the lint.

Yeah, it may make sense to have --locked just like cargo install does. Although it will probably require that we run multiple cargo builds for each crate separately because we want to use each of their lock files

Yeah, that's sadly the trade off. Until #250, I didn't even consider the other option, of using a single command. We could try setting up a workspace with the crates in it. That might allow us to merge the lock files, while having a single build command?

I'll try to look into this more a bit later. I don't think it's reasonably possible to merge all lock files, because all crates in a workspace need to agree on a single lock file and non-conflicting versions of dependencies unless each crate in the workspace pins specific version of them with =x.y.z syntax in dependencies requirements. And even then... it's not always possible to do that. I know there are some really mind-boggling caveats when you have several versions of the same dependency in your tree. Cargo may refuse to compile that with the error "couldn't select package that meets the version requirements A and B from packages foo and bar" or smth like that.

I think a good place to strat is to study the behaviour of cargo install when multiple packages are requested

One way we could approach this, is to have a single build command for all crates by default with the Marker.lock file and only switch to single crate compilation, for lint crates with locked = true?

This sounds like the most flexible solution, but would also means that we might have to maintain two solutions. If I had to select one, I'd probably prefer having a single Marker.lock file.

Investigating cargo install first sounds good!

The more I think about this the more I think that marker should follow a similar behavior that cargo install does. This is because lint crates aren't actually library crates. They are the finished pieces of software like binaries. You should compile them once and reuse their DLLs across all your projects on your working station. For example, cargo install puts binaries into a shared directory ~/.cargo/bin.

By default cargo install compiles the binaries in a temporary directory, and moves the resulting binary into ~/.cargo/bin. It also doesn't respect the Cargo.lock file in the compiled package unless --locked flag is specified. So cargo install doesn't require users to maintain a separate lock file in their workspace, it gives you an option to either use the lock file that of the original binary, or ignore any lockfiles whatsoever (in which case you may shoot yourself in the foot, but you picked your poison!).


Based on that knowledge, and some personal thinking I am thinking of a model where marker has its own ~/.marker/bin directory where it maintains the compiled lint crates.

We may mimic the behaviour of cargo install such that by default no lock files are honoured (a single cargo build is used, more efficient and flexible, but risky approach), and with --locked we use the lock file of the original lint crate (less efficient, inflexible, but safer approach of reproducible builds).

We may also want to support downloading pre-compiled lint crates, right? I'm not yet confident on what the design for that should be, but having a shared ~/.marker directory for that will probably be required. For pre-compiled lint crates it's also worth looking at cargo-binstall.

Here it would be good to have some references, how a real project uses Marker. I imagined Marker to work on the project level, meaning that most lints would either be specific to one workspace or dependencies used in the workspace. Clippy and rustc, both have custom lints, which are included in the same binary (but cfged). They are very specific and not really usable for external projects.

Do you have the feeling that lint crates will be generally applicable? Or does this POV mostly come from the fact that "lint crates aren't actually library crates"? ๐Ÿค”

There were also two other factors, which I no longer believe to be important, but it's still work listing them:

  1. I wanted to support features for lint crates, to cfg unused lints or features.
    • I think supporting features can still be nice, but they won't really be important. And for expensive lints, it's better to move them into a separate lint crate instead.
  2. Storing lint crates on the project level, felt safer than storing them in the home directory.
    • Here I might just be to careful

This is not to say, I'm against it, just that I want to understand the reasoning behind it :)

We may also want to support downloading pre-compiled lint crates, right?

Yes, that would be really cool. It's not a must, but a nice to have

I'm thinking that lints will be shared by the library owners. Yes, they will still be project specific, but for open source libraries the owner of the library may maintain a separate lint crate that would check the usage of this library.

For example in tracing there could be a really useful lint that catches the bug of holding a span guard across an await point; in serde_json there could be lint for serialising a map with non string keys (that would lead to a panic); in rand a lint that suggests saving thread_rng in a variable if it's used multiple times in the same scope.

For closed source projects the same will be true, because not all closed source projects are in one repo. Many of them have multiple repos with shared libraries that need shared lints.

So in every workspace where you need the same lints for the same crate you would share them all from the marker's home.

The problem that I see with this approach is for the lints that are used only in a single cargo workspace, and they aren't shared, because that workspace is for a project with an end binary. In such case it probably makes sense to use the same cargo lock of that project and it's target dir just like cargo install --path does.

It does sound reasonable.

I would think that we should still restrict Marker to only load the lint crates, that are listed in Cargo.toml. I believe with time, there will be lints with different complexities. Some lint crates will be performance intensive and probably only intended to run in the CI. Here I'm thinking of global analysis and architecture lints. Is this in line, with how you imagined this?

Yes, lints need to be specified in Cargo.toml of the project. This doesn't need to change.

Another factoid, is that when having lints in a shared directory for reuse purposes they all need to be separated by their version, source (registry or git), etc. cargo-marker then would need to know to recompile the lint crate when the new compatible version of that lint crate is released. And.. still we should have the lock file for the version of the lint crate itself.

For example, if I specify

[workspace.metadata.marker.lints]
marker_lints = "0.2"

then cargo marker needs to select some specific latest patch version and put that into a lock file, so that next time someone in this repo runs cargo marker they also use the same version.

Although that Marker.lock could mention only the lint crates. It shouldn't include all their dependencies, because their dependencies are specified in Cargo.lock files on the per-lint crate basis.

There is quite a lot of complexity here. Having everything under a project-specific target dir should simplify the design and implementation at least initially. Having a shared ~/.marker dir will be more of an optimisation that we could have much later.

I know I'm switching ideas back-and-forth, but everything will settle when I work on code for this.

Comments from #283:

I added the docs --cfg=marker="lint_crate". Although, while writing them and thinking about this I found a potential problem with the --lints CLI parameter in light of #251. If we allow specifying lints in the CLI then how do these lints get into the Marker.lock? Maybe there won't be a problem here if we use the lock file of the lint itself and not have a Marker.lock. That is a good argument against Marker.lock

~ @Veetaha


For Marker.lock, you're right that this could be a problem. I have two ideas, which both feel hacky, but they might be a good inspiration:

  1. Compile each lint crate individually. And combine them all somehow, for example in a directory or by concatenating the Cargo.lock files in a Marker.lock file. This would support command line lints as well. However, it would require us to manually separate the lock file part for every compilation and will make it difficult to run simple cargo comments on the dummy fetch crate.
  2. Store the Marker.lock file for command line lints separately, maybe under a global ~/.marker folder?

~ @xFrednet


Maybe a lesser hack could be having a specialized feature for lint "profiles". This is similar to cargo build profiles or cargo-nextest profiles.

Users could define several profiles that would extend the default list of lints. They will then be able to select additional lints configured in Cargo.toml metadata using a --profile {profile_name} argument. This solution will then deprecate the --lints CLI parameter to avoid problems with Marker.lock as well.

Having everything specified in a config file allows Marker to know which lint crates the project may ever expect to be run with. So when marker is run with the default profile that doesn't include heavy lints Marker will still know that there are other lint crates that could participate in linting, and it could provide noop lint implementations for lints from those crates.

One problem with this is that Marker would need to compile all lint crates specified in the config including lint crates from other profiles to just list the lints that they have. Maybe if lint crates also had some metadata that could list their lints without compiling them this could solve this problem.

~ @Veetaha


Having lint profiles sounds like a good and flexible idea ๐Ÿ‘
~ @xFrednet

One more problem that I've found with the approach of maintaining a separate lock file is in the following setup.

  • Cargo.toml
  • .cargo/
    • config.toml contains a definition of a private crates registry e.g. registries.registry-name.index = "https://dl.cloudsmith.io/basic/org/repo-name/cargo/index.git"
  • lint-crate/
    • Cargo.toml contains a dependency on a crate from the private registry

Running cargo marker -l 'lint-crate.path = "./lint-crate"' doesn't work because we build the lint crate outside of its original workspace, and thus the repository's .cargo/config.toml file isn't used. The build fails to find the definition of the private registry.

This makes me more inclined to have the behavior of running multiple cargo builds and using the original package's lock files.

The discussions in this issue mention a bunch of edge cases and tradeoffs, and there are lot of them. I'll try to summarize everything in one rfc document to describe the design.