rust-lang/rust-roadmap-2017

Rust should integrate easily into large build systems

aturon opened this issue ยท 87 comments

Overview

When working with larger organizations interested in using Rust, one of the first hurdles we tend to run into is fitting into an existing build system. We've been exploring a number of different approaches, each of which ends up using Cargo (and sometimes rustc) in different ways, with different stories about how to incorporate crates from the broader crates.io ecosystem. Part of the issue seems to be a perceived overlap between functionality in Cargo (and its notion of compilation unit) and in ambient build systems, but we have yet to truly get to the bottom of the issuesโ€”and it may be that the problem is one of communication, rather than of some technical gap.

By the end of 2017, this kind of integration should be easy: as a community, we should have a strong understanding of best practices, and potentially build tooling in support of those practices. And of course, we want to approach this goal with Rust's values in mind, ensuring that first-class access to the crates.io ecosystem is a cornerstone of our eventual story.

Projects

At this point, we are still trying to assess the problems people face in this area. If you have experience here, please leave a comment with your thoughts!

luser commented

I started a repo not long ago to collect anecdotes about people integrating Rust into existing projects. I only have a few examples in there (and I keep seeing more crop up all the time). It'd be great to at least do a survey of all the examples we can find of people attacking this problem to see what the major issues were.

From the Firefox perspective, we have a pretty custom build frontend that generates Makefiles in the backend, so we would have had to write the custom integration bits regardless. We did find that life got a lot better once we started invoking cargo instead of rustc directly. I'm sure there are projects that would get value out of having individual Rust source files in their codebase, but it feels like a lot of the value in the Rust ecosystem comes from using cargo and leveraging crates.io (I doubt this is contentious).

I've worked on build systems to compile Rust components for Erlang applications, so I'm in a good position talk about a few issues. An overview of the building that takes place can be found in this README. Erlang can make use of bins, dylibs, and cdylibs.

  1. dylibs on OSX have special link requirements ("-- --codegen link-args='-flat_namespace -undefined suppress'") which creates a cascade of fussy work. Firstly, to provide that flag I need to "cargo rustc" instead of "cargo build", and to do that I need to detect all the binary/lib targets and build them one at a time. I really wish I could just "cargo build" and have cargo sort out the details for me. Maybe a "--extension-lib" flag for "cargo build" to apply this behavior? I understand this linking scenario is not unique to Erlang.

  2. Discovering and locating the output files is tricky. I have to "cargo read-manifest" to find the targets, form a platform-specific name from those results, then parse the flags for any "--target" flags to form a path for these files. I would love a flag of the form "--print-artifacts=[bin|lib|dylib|cdylib]" for "cargo build" to print the full output path and name to stdout.

This is great. Integrating Rust/Cargo with Bazel would be the first major hurdle to us using more Rust in our codebase at work, which is a pretty large mass of C++ code with some Python sprinkled throughout.

Since both Bazel and Cargo are fairly opinionated about how builds and package management should work, and since I am an expert in neither, it's not immediately clear to me which build system should be doing what or if we should just try to integrate rustc into Bazel without Cargo at all. Using strictly Cargo is (unfortunately) probably out of the question since most of the C++ and Python packages are dependency tracked with Bazel BUILD files, and any serious integration with the C++ code would require our Rust libraries and binaries to be dependency tracked by Bazel as well.

Bazel has support for local sources, which is to say it does not support crates.io. There's an issue open for it, but the project is relatively inactive bazelbuild/rules_rust#2. One would need to keep a local repository clone of a crate and all its dependencies to make bazel happy without modifications.

Bazel's rust targets also can't be depended on by c/c++ at the moment.

The kythe project has wrappers for bazel to call cargo, but I it's not the most robust approach to integration. For reference, kythe:tools/build_rules/rust.

disclaimer: I don't know too much about bazel, happened to learn the above looking into it last week.

Bazel's been mentioned several times this has come up. I believe Dropbox needs to integrate its Rust into a Bazel build process, and I think I might have been told Facebook uses it as well (but it might also have been that Facebook has an internal tool is similar to Bazel). It seems like a promising tool to look into for this issue.

Bazel Rust Rules author here. Apologies for the inactivity on the rules_rust project; I have been busy with other projects on my plate. I am planning to implement the workspace rules for pulling crates from Cargo (bazelbuild/rules_rust#2) by the end of Q1 with a stretch goal of also implementing tooling for automatically generating Bazel BUILD files from Cargo.toml (bazelbuild/rules_rust#3), though the latter will likely extend into Q2.

The Kythe project has rules that shell out to Cargo directly as quick stop-gap measure, but those rules are meant to be used internally in Kythe for now (since they are not hermetic for instance) and the plan is to replace them with the rules in rules_rust once features such as pulling from Cargo are supported.

Of course, if anyone is interested in helping with improving the Bazel rules for Rust, contributions are certainly welcome. :)

+cc @damienmg

Kythe contributor here. We'd definitely like to see better integration with Cargo from the upstream Bazel Rust rules. Our extant integration was very much a hack to allow our intern to make progress on the Rust indexer itself, rather than getting bogged down with Bazel integration.

Facebook uses Buck, and there is some early Rust support:

https://buckbuild.com/rule/rust_binary.html

Facebook vendors their dependencies in-tree.

jsgf commented

Facebook uses Buck, and there is some early rust support:

I've spent quite a bit of time on that over the last few months, and it's getting pretty solid now. It's well integrated with the overall build/test system and (most recently) can also interop with cxx rules.

I'm the main author of Meson build system that is being used by GStreamer and a bunch of other projects. We also have Rust support, which is a bit rudimentary but can be used to build stuff like a Python extension module that uses C, C++, Rust and Fortran in a single target. We aim to improve Rust support. This is especially important for mixed-language projects, since Cargo is nice for plain Rust projects but I'm fairly certain that Cargo developers do not want to add first class multiplatform C/C++ build support to Cargo.

A key part of doing this well will involve handling updates to the dependency graph in a deliberate and piecemeal way, particularly in scenarios where upstream "master" has moved to a new version (e.g., as in rust-lang/cargo#2649).

We (@nox, @SimonSapin, and I) have had many conversations with @alexcrichton and @wycats on this front, and I believe the leading contender from their point of view is for some extensions to fix up paths and avoid the abuse of [replace], as it runs immediately into version-related roadblocks for any non-trivial project.

Nix (which has a similar pure view of build systems as Bazel) uses a trick that involves cloning a well known version of the crates index git repository, see https://github.com/NixOS/nixpkgs/tree/master/pkgs/build-support/rust

Sadly the nixpkgs crates index is not hermetically sealed, it gave us many problems. So we implemented a crate index nixifier, which reads crates.io-index and spits out a nixfied crates index. This allows one to use nix to completely manage transitive crate dependencies of a project without needing cargo. The repo: https://github.com/fractalide/nixcrates

I've taken a look at teaching bazel how to digest Cargo's toml files to pull down third party crates.io dependencies automatically, and there don't seem to be very many sticking points.

A couple of rough patches I've seen though:

  • Cargo exports some environment variables that projects can sometimes come to depend on. This happend once with clap. This is super minor though.
  • target.*.dependencies rules can be hard to deal with since they (seem?) to use rust inlined into the toml. My current approach is to just include them all, and hack around any platform specific deps that don't play nice on my platform.
  • build.rs files seem antithetical to bazel's "hermetic ethos", since they can do pretty much anything. I think this will become less of an issue in the very near future since common usecases such as serde are being resolved with the stabilization of macros 1.1.

Just commenting here because I believe there is a lot of opportunity to appeal to the JVM eco-system, and incrementally replacing java/scala/etc code with rust if we are able to make it trivial to incorporate rust into ant/maven/etc builds, as well as go the other way and be able to add a jar to a Cargo.toml or build.rs, and have automatic rust bindings generated, maybe using a combination of https://github.com/kud1ing/rucaja/ and https://github.com/kenpratt/jvm-assembler as starting points. This could all happen outside core, obviously, but rust adoption by the very large, and as-of-yet untapped-by-rust, ecosystem of jvm code and developers would be quite beneficial.

I've proposed haskell/cabal#3882 for Cabal. The same thing can work for Cargo, and would solve this problem for everyone.

luser commented

@raphlinus and I were discussing some related issues on IRC not long ago. One of the ideas that he floated was a way to make cargo simply output the commands that it would run to do the build, so that we could leave parsing Cargo.toml to cargo, but allow other build systems like bazel to run the build like they expect.

Thanks for the comments everyone! I and a few others have thought a lot about this in the past as well, and I wanted to jot down some notes and conclusions that we've reached historically.

First and foremost we've historically concluded that build system integration is not finished until you've got access to crates.io crates. The standard library is purposefully conservative and small in size with the explicit intent of having rich functionality in the ecosystem on crates.io. If an integration doesn't allow easy access to crates.io, then there's more work yet to be done!

Today, of course, Cargo is the primary gateway into the crates.io ecosystem. Cargo is also the primary build tool for Rust, but there's typically perennial questions about how to integrate Cargo into existing build tools. Many issues have been solved over time in this vein, such as vendoring dependencies, workspaces, etc. Cargo also has the benefit of being friendly and familiar to existing Rust programmers with a shared workflow across the ecosystem.

Something else that we've concluded, however, is that preserving Cargo workflows should not necessarily be a hard constraint for build system integration. Existing projects already have a workflow associated with them, and Rust code should integrate as it is fit instead of imposing restrictions on how it works. Of course though preserving a Cargo-based workflow for the Rust-specific portions is a nice-to-have!

And finally one last thing we've talked about is compilation units. For example C/C++ have files as compilation units and that's typically what build systems for C/C++ are normally architected around. Rust, however, doesn't have this granularity of compilation unit. Fundamentally the compiler supports crates as the compilation unit. Moving up the stack to Cargo it ends up generally being the case that the entire crate graph is Cargo's compilation unit (one command outputs the entire crate graph). The question is then how does this integrate into an existing build system? Is the crate graph sufficient? Does the granularity need to be finer, such as crates? (I'm not sure!)

One part to consider about compilation units is that they typically heavily affect caching in build systems. For example distributed caches may cache entire compilation units but nothing more granular. This means that a DAG-as-a-unit would be probably too coarse for caching. On the other hand, though, crate-as-a-unit is not clear how to integrate between build systems and Cargo today.

So with all that I think we're faced with a few problems that may be thorny to solve:

  • If we go with DAG-as-a-unit, is this sufficient? Can Cargo hook into existing caching infrastructure adequately? I believe this is how projects like Gecko work today where the whole Rust DAG is a unit and cargo is used to build it. This may have problems, however, if there are multiple Rust projects to link together (e.g. stylo and spidermonkey in Gecko both independently having Rust code)

  • If we go with crate-as-a-unit, how can we get Cargo and build systems to cooperate? Does Cargo need finer-grained operation? Should Cargo generate build files for each target build system? My assumption is that supporting features like build scripts may be very difficult in foreign build systems, but can we get by without build scripts in some systems?

Unfortunately I don't have a whole lot of solutions just yet, I'm personally still at least trying to grapple with the problem space. @luser does the above sound accurate though for Gecko's Rust integrate at least on a high level? @jsgf could you detail some of the work you've done at a high level for Buck for Rust support?

I've found that each build system tends to have its own unique set of constraints for integration, but the more we know the easier we can accommodate everyone!

Oh one point I should also mention is that I personally think that it's at least relatively important to try to lean on Cargo as much as possible with build system integration. Cargo is the bread and butter of building Rust, and avoiding Cargo leaves build systems as massive number of features to reimplement. I'd much rather pursue avenues to add features and/or make Cargo more flexible to interoperate with existing build systems. For example I could imagine Cargo generating build files or working in a much more granular fashion assuming another process manages inputs/outputs.

jsgf commented

The things that make Cargo awkward to use in our environment are:

  • It downloads things at build time; we need to be able to nail down external dependencies and be explicit when they're updated.
  • It's awkward to have dependencies on C++ code. We have tons of existing C++ code, and abandoning it is a non-option. Restricting Rust to pure code would make it unsuitable for many potential users - as a result, I've been spending some time to make integration with C++ as straightforward as possible, which also means being able to take C++ as a dependency using its native build system
  • Cargo does too much. We have an extensive distributed caching system for build artifacts which tries to use the cache to avoid as much build work as possible. If we were to use Cargo to build it would be all or nothing - either it would not be invoked if nothing needs to be rebuilt, or run to build everything once
  • Cargo doesn't do enough. Each Cargo.toml has its own set of dependencies, and those dependent crates get rebuilt for each Cargo.toml that depends on them. Using workspaces allows those deps to be shared, but only with a very specific arrangement of directories. It doesn't scale to a large single source base with containing thousands of distinct projects, some of which may be rust, with an organizational scheme that's something other that their shared dependencies.
  • build.rs doesn't really work - building an executable then running it in the build infra is pretty awkward, and not well regarded. Making it work well in a semi-cross build environment is also tricky (not cross-arch, but different library versions in the build execution env vs the prod env).

What I think you're saying is that going directly to rustc is too low-level for your taste; you'd prefer to have a higher-level tool that's actually coordinating builds. But on the other hand, cargo is too high-level for our purposes. As a standalone tool I think its excellent, but it tries to impose too many opinions to interact well with other build environments.

Perhaps there's some scope for a mid-level tool that provides cargo's mechanisms, but not the UI, and a higher-level tool that presents a nicer UI/user experience when its used as the primary build mechanism. Or perhaps rustc itself is that interface, and it just needs to be designed accordingly?

Right now I'm handling all this by using cargo to download crates.io crates and manage all their dependencies, then prebuilding them and keeping all the build artifacts. All our internal builds are built with buck using its dependency management, and ultimately linked with the prebuilt crates.io code. That way cargo is a one-time operation rather than something that's involved with every build, while still taking advantage of it to manage all the code that's intended to be built with it.

@alexcrichton DAG-as-a-unit will always be too course grained. That is basically what we have now, and it is not good enough. For crate-as-a-unit though, I'd think build.rs would not be a problem because any dynamism from build.rs only affects the current crate, right? The dependency graph with crate-granularity is still static.

So yeah, what needs to be done at a minimum is making two new mode of operation for Cargo:

  1. Make a complete plan (DAG with crate nodes); this is way way more than a lockfile. Do impure things here like download crates.io indices and crates.io pkg sources here too.

  2. Build one crate/node from the pre-made plan. Assume every node gets its own $out directory, and paths to all (transitive) dependencies $out directories will by passed to Cargo in this mode.


A cool follow-up to this would be writing a dependency-management library that both allows serializing plans such that external tools can drive the build, or executing the plan in the current process. This would avoid any code duplication in Cargo and, I'd guess, be useful for rustbuild, too.

See haskell/cabal#4174 for analogous effort with Cabal and Shake (though, unfortunately, Shake does not allow exporting static dependency graphs like this).

I also echo @jsgf's point about build.rs not being ideal. build.rs is used by Rust projects to compile code in other languages, such as C++ libraries using gcc-rs, but this should really be handled by the build system itself.

For example, the Bazel rust rules supports interop with C/C++ code, meaning that you can have the following:

cc_library(
    name = "foo",
    srcs = ["foo.cc"],
    hdrs = ["foo.h"],
)

rust_library(
    name = "bar",
    srcs = ["src/lib.rs"],
    deps = [":foo"],
)

Also, of note, while one mode of using Bazel is to vendor all dependencies (which is the practice followed by Google internally), Bazel also supports fetching external dependencies (which are all done prior to build time) and provides an simple API for writing repository rules. As mentioned above, a rule that fetches crates from Crates.io is in the pipeline for users who do not prefer to vendor all dependencies.

Thanks for typing that up @jsgf!

It downloads things at build time

To clarify, I'm under the impression that this is a solve problem today. With multiple vendoring options available that was at least the intention! Did you find, though, that the vendoring support wasn't suitable for Buck's use case?

It's awkward to have dependencies on C++ code.

To clarify, this is from a build system perspective, not a language perspective, right? If possible I'd like to focus this thread at least on just the build system aspect and we can perhaps continue the language discussion over at #14 :)

It definitely makes sense to me that it's difficult to depend on C++ code in a build-system sense. Some of this I think is the granularity builds today (DAG vs crate) but in general I think it's just flat out unergonomic and difficult to plug in preexisting artifacts into a Cargo build.

We have an extensive distributed caching system for build artifacts which tries to use the cache to avoid as much build work as possible.

Definitely makes sense! I don't think it's out of the question though for Cargo to support custom caching though. In fact, with sccache we may get exactly this!

In general I'd like to keep an open mind to Cargo's current implementation today, and we can basically extend it in any way we see fit. For example Cargo's already got enough information to create a unique hash key for a crate and we could restructure it with custom caching to pull in artifacts on demand (or assume they're at a predetermined location) or something like that.

Not saying this is a silver bullet of course, but our options are still open!

Using workspaces allows those deps to be shared, but only with a very specific arrangement of directories. It doesn't scale to a large single source base with containing thousands of distinct projects, some of which may be rust, with an organizational scheme that's something other that their shared dependencies.

I'm not sure I quite understand this constraint, so I wonder if we could dig in a bit? I definitely agree that a workspace may not scale to thousands of crates and projects, but the idea of a Cargo.toml certainly should, right?

I guess I'm not fully understanding what's not scaling here. Are you thinking this is a fundamental compiler limitation? Or just something that needs working around in Cargo today? As with above, I'd like to keep in mind the possibility of changes to Cargo to make it more amenable to situations like this rather than assuming the functionality of today is impossible to change!

build.rs doesn't really work - building an executable then running it in the build infra is pretty awkward, and not well regarded

Yeah I can definitely understand how this may be nonstandard. I don't think this is something that can be sidestepped for too long, though, as a concept. Custom derive (macros 1.1) was stable in Rust 1.15, and that requires compiling a plugin a build-time to then run inside the compiler. I would draw such a practice as very similar to build.rs (in principle at least), and I'd expect that ergonomically using Rust will basically require using Serde in the near future (especially for communicating services).

In that sense, is it literally the build.rs with inputs/outputs itself? Or is it the concept of running code at build time that may cause problems? I'd definitely argue that macros 1.1 is more prinicpled than build.rs (defined set of inputs/outputs) but they naively to me at least don't seem fundamentally different.

Perhaps there's some scope for a mid-level tool that provides cargo's mechanisms, but not the UI, and a higher-level tool that presents a nicer UI/user experience when its used as the primary build mechanism.

I definitely agree! I do think that Cargo's too high level for Buck's use case today, and I personally feel that rustc will almost always be too "low level" to get real benefit. As we continue to add features to Rust, the compiler, and Cargo, the idiomatic and most ergonomic way to consume these features will be through Cargo. For example macros 1.1 might be an absolute nightmare if you had to manage all the builds yourself, especially when cross compiling.

I personally think that rustc sits at the right level of abstraction for Cargo to be calling, so we wouldn't want to soup it up too much. Taking Cargo down to be a bit lower level though I think is where there's a lot of benefit to be had. With that in place we'd then design new language features with such a tool in mind to ensure the experience is smooth for everyone, Cargo users and "this lower level Cargo" users alike.


@davidzchen thanks for the input about Bazel! I'm curious on your thoughts about my comments above related to build scripts as well. Do you agree that compiler plugins (e.g. macros 1.1) are along the same vein as build scripts, or is one much easier to support in Bazel than the other? Or are they both very difficult to support?

@alexcrichton Regarding compiler plugins and build scripts, the way I see it, the key difference between the two is that people write compiler plugins when they have a need to reuse functionality in the Rust compiler whereas people write build.rs to do anything they cannot do via the Cargo build system itself. As a result, compiler plugins are, in practice, used for much more niche use cases than build.rs.

One analogy that I can draw is that build.rs is similar to Bazel's genrule, which allows you to run any arbitrary shell command.

In any case, neither of these are difficult to support in Bazel:

  • For compiler plugins, Bazel has precedence for a feature like this: the java_plugin rule, which is used for running Java compiler plugins.
  • For build.rs, a quick and dirty solution would be to build it with a rust_binary rule and then run it in a genrule (and perhaps wrap these two in a rust_buildscript macro for ergonomics).

A main concern that I have with relying too heavily on build.rs is that what it runs is arbitrary and is (often) potentially non-hermetic. As a result, for projects that have a mix of different languages with interop between Rust and other languages, it would be better to rely more on the build system for this and have more fine-grained build targets and limit the use of build.rs as much as possible.

jsgf commented

@alexcrichton:

To clarify, I'm under the impression that this is a solve problem today

Mechanically it's solved because --frozen will prevent any attempts to download, and in practice I handle the whole problem by prebuilding all the parts of crates.io that our code needs. Is there an option to reverse this, so that all downloads are prohibited by default, and only allowed if there's an explicit option or command? Might be useful if not.

To clarify, this is from a build system perspective, not a language perspective, right?

Yes. We have C++ libraries that have their own complex dependency graphs managed by Buck that I'd like to add a Rust FFI bindings for, and make sure that everything gets rebuild properly. I'd also like to be able to expose Rust libraries to C/C++ code (mostly for things like python extensions), and again, make sure the build system knows all the deps. Trying to manage dependencies across build systems seems like it could be awkward.

I don't think it's out of the question though for Cargo to support custom caching though

Do you mean Cargo might be able to make use of Buck's cache? That poses lots of problems, not least because its unclear how Cargo would be able to compute the correct key. Buck's cache is indexed by both the immediate dependency (the source file contents), but also the keys of all its dependencies, with the goal of being able to skip as much of the dependency graph as possible. Cargo wouldn't have access to the information needed to either lookup or insert blobs into the cache.

Effectively Buck treats the compiler as a pure function of inputs -> output, and memoizes the result. If the build tool is more complex than that, then Buck can't cache its state well, and it complicates the interface to the build tool if its doing its own caching/memoization.

I haven't looked a sccache in detail, but this is quite different from how something like ccache works; ccache caches the results of individual compiles, but doesn't take the dependency graph into account.

I'm not sure I quite understand this constraint, so I wonder if we could dig in a bit? I definitely agree that a workspace may not scale to thousands of crates and projects, but the idea of a Cargo.toml certainly should, right?

Yeah, I was being pretty unclear.

Without workspaces, every binary cargo package has a dependency graph on other packages with library crates, and building that executable builds them all as needed. If multiple binaries share some or all of the same crates, then they all get rebuilt regardless.

You can use workspaces to effectively share the dependencies between multiple binary crates so that the library crates only get built once. But to achieve this, all the crates - binary and library - must be in a single workspace.

There's a few issues with this:

  • workspaces are strictly hierarchical, so effectively you need to configure a workspace to encompass the entire source tree
  • but the hierarchy is a single level, so all the packages have to be peers (at least logically)
  • the Cargo.lock file for the entire workspace might end up being massive if there's lots of dependencies, and it ends up being a flat dependency graph (I'm not actually sure if this is a problem)

I know there's been some discussion about loosening the constraints on workspaces to either allow nesting or have other relationships (esp with path dependencies), but its not clear to me they're the right way of modelling a complex dependency dag in a way to minimize building.

(Also I haven't really looked at workspaces in a while, so perhaps I'm completely out of date here, or just wrong.)

Yeah I can definitely understand how this may be nonstandard. I don't think this is something that can be sidestepped for too long, though, as a concept. Custom derive (macros 1.1) was stable in Rust 1.15, and that requires compiling a plugin a build-time to then run inside the compiler

I haven't looked at macros 1.1 yet, but are you saying that every crate that uses - say - serde will need to also build a compiler component, or is it just when serde itself is built? If its the latter then I can handle that when I pre-build the crates.io crates, and is basically no more difficult contraint than object files are compiler-version dependent.

If they need to be rebuilt for every user, then yeah, that's trickier.

The more general problem with build.rs is building an executable then running it as part of the build process. It's awkward to manage because it has unconstrained inputs and outputs (it can read and write arbitrary files) which means that it's opaque to the build system/dependency management.

Of the use cases listed in the docs, "Building a bundled C library", "Finding a C library on the host system" and "Performing any platform-specific configuration needed for the crate" are all pretty horrifying from a build integrity/reproducability perspective - they are strong antipatterns. The only one that makes any sense is "Generating a Rust module from a specification" (ie, generated source), but that could be done with a much more constrained interface (and perhaps macros 1.1 is that interface).

There's also the general security problem of just running random binaries on a build host that can do arbitrary things. It can be managed, but the less it happens the better.

I personally feel that rustc will almost always be too "low level" to get real benefit

rustc is about the right level for Buck, since its similar to gcc/javac/etc; certainly integrating at the rustc level (while not trivial) was more conceptually straightforward than trying to work out a conceptual mapping between buck and cargo.

What might be useful is:

  • "low-level cargo" provides an interface to a set of build recipes like "build me a .rlib", "build me an executable", "build me a compiler plugin", "tell me your dependencies/input sources"
  • a standardized format for dependency interchange, so that buck could emit - say - blob of json describing what it knows so that the cargo-like tool can make use of it, and perhaps vice-versa (this would be useful for other tools like RLS rather than relying on Cargo.toml/cargo directly)

Buck and Bazel are extremely similar in a lot of ways, so I expect that a solution that works for one will likely help with the other. As @davidzchen mentioned, the concept of a compiler plugin is very important for Java, so the Buck can also deal with it (since Java/Android is one of its primary use-cases); extending the concept to Rust is reasonably straightforward.

This thread is delight to my eyes.
All the stuff about Cargo's scalability and integration I wanted to talk about, but didn't have enough factual evidence and practical experience. @jsgf @davidzchen thanks a lot for details!
Even if rustc+stdlib themselves are not especially large as a project, these issues show up at rustbuild level already, which favors "leverage Cargo and Rust as much as possible" over build system best practices. I hope this discussion will benefit it as well in the end.

luser commented

If we go with DAG-as-a-unit, is this sufficient? Can Cargo hook into existing caching infrastructure adequately? I believe this is how projects like Gecko work today where the whole Rust DAG is a unit and cargo is used to build it. This may have problems, however, if there are multiple Rust projects to link together (e.g. stylo and spidermonkey in Gecko both independently having Rust code)

In Gecko we are effectively limiting things to a single crate per output binary in the build system. We haven't crossed the "Spidermonkey requires Rust" bridge yet, but when we do we will probably just have it behind the existing "building the JS engine standalone" flag, and otherwise have that code pulled in via the crate that gets linked into libxul.

We've discussed this in other forums when we first ran into this issue, I know. The core problem was that when outputting something other than an rlib rustc includes support code such as jemalloc, and you can only link one copy of that into a binary.

luser commented

Mechanically it's solved because --frozen will prevent any attempts to download, and in practice I handle the whole problem by prebuilding all the parts of crates.io that our code needs. Is there an option to reverse this, so that all downloads are prohibited by default, and only allowed if there's an explicit option or command? Might be useful if not.

We've discussed this before, but no, there's not. I have a cargo issue open for making our Gecko use case nicer. For Gecko we currently vendor all our crates.io dependencies with cargo vendor, and use a cargo config file to enable source replacement.

I haven't looked a sccache in detail, but this is quite different from how something like ccache works; ccache caches the results of individual compiles, but doesn't take the dependency graph into account.

This isn't implemented yet (I'm working on it this quarter), but we're planning on making sccache able to cache rust compilations at the crate level. There's a good writeup of the plan here. This is different from ccache, which operates at the object file level, but Rust compilation is fundamentally different from C compilation.

jsgf commented

This is different from ccache, which operates at the object file level, but Rust compilation is fundamentally different from C compilation

Yes and no - you can roughly model a Rust compilation as a single crate == a single object file, where lib.rs effectively #includes the rest of the sources (this breaks down when talking about dependencies on other crates, and probably incremental compilation).

eddyb commented

@jsgf That is, however, the intention, crates are compilation units, from the user's point of view.

Internally we already have "codegen units" which are multiple translation units per (crate) compilation, and it's plausible we might have a "fusion" mode where the translation units are all triggered by one compilation unit (the "final executable" in an app, for example), instead of by each dependency.

So the object file analogy is imperfect if you think .o files literally, because how those are split can be up to the compiler (based on heuristics that you can't arise at in the build system), but crates are still the compilation units and the #include analogy is fine if you use the --emit=dep-info make-ish dependency set to know what the sources are and you pass --extern for dependency crates so those are precise too.

@luser You might want to talk to @nikomatsakis and/or @michaelwoerister (if you haven't already) about incremental recompilation, there's potentially a scenario where sccache can track the internal rustc incremental state, which would let you piggyback on its object file reuse, but I'm not sure it's worth doing.

@eddyb so incremental recompilation can serialize the exact codegen-unit-level dependencies it needs for a codegen unit, and then be resumed against just that, incemetal recompilation will work beautifully with (Nic's) import from derivation.

This is icing on the cake of everything I mentioned before.

The approach for integration taken in cargo-bitbake and used in meta-rust to handle including cargo packages within the bitbake/OpenEmbedded/Yocto build-system might be relevant to folks in this thread. I believe @cardoe also has written an integration for cargo packages in gentoo's portage build system.

Not quite the same as other build systems mentioned as these are both intended for generating pieces of linux distributions, but some of the goals (avoiding network access at build time & ensuring a stable build) are similar.

Yes I have. In Gentoo we use cargo-build which behaves the same way that cargo-bitbake does.

We need the ability to fetch all the data necessary to perform the build before cargo executes since the package managers for both Yocto and Gentoo are responsible for verifying the integrity of the downloads and handling the downloads. The build process is done using dropped privileges where there is no network.

I'm not sure if there is any interest in this or not, but I've been working on Gradle plugins that use Cargo under the hood to build Rust code. The Rust plugins are largely inspired by Python plugins that I et al. wrote and open sourced last year (see engineering post and repository if you're interested in design). TL;DR: Gradle rocks in the enterprise, and makes plugging in new languages easy. We can do a good job using Gradle to just "orchestrate" the build, leaving all of the details to Cargo and Rust, thereby preserving idiomatic Rust development while integrating with large [Gradle] build systems.

If there is interest in seeing these Rust plugins, I can work on cleaning them up and open sourcing them. To date, they're mostly used for a few pet projects of mine.

We need the ability to fetch all the data necessary to perform the build before cargo executes since the package managers for both Yocto and Gentoo are responsible for verifying the integrity of the downloads and handling the downloads. The build process is done using dropped privileges where there is no network.

^ This is a particularly clear distillation of the requirements for integration into build systems with existing caches and reproducibility guarantees (including the one used at Twitter).

@davidzchen

ok cool, thanks for the info! I agree that totally arbitrary build.rs probably can't be supported, but I could imagine that for "actual build system integration" that (reasonable) restrictions are imposed on build scripts pulled in, perhaps described by metadata in Cargo.toml or something like that.

Good to know though that nothing is fundamentally unsupported!


@jsgf

Is there an option to reverse this, so that all downloads are prohibited by default, and only allowed if there's an explicit option or command? Might be useful if not.

Unfortunately not right now, but we could add a .cargo/config option to do so perhaps.

Effectively Buck treats the compiler as a pure function of inputs -> output, and memoizes the result.

Makes sense to me! Remember though that rustc/cargo are the same thing :). All the tools here are effectively pure functions, so what I'd like to figure out is the best way for rustc/cargo to fit in this model because it should mesh well!

That's a good point though about hashes taking dependencies into account. Cargo could do all it needs to do for Rust code, but it wouldn't be able to understand hashes on external C/C++ libraries used by Rust, however.

Also yeah, the intent of sccache is that it will look at all crates on the command line and use those as input for hashing.

If multiple binaries share some or all of the same crates, then they all get rebuilt regardless.

Indeed! I think this is a very surmountable problem, though. The first solution would be a workspace, but I agree that thousands of members probably isn't great for a workspace. Other solutions could include a shared output directory or simply a better caching solution. For example the intention with sccache is that it's a drop in for rustc and will automatically pull in caches for everything.

In that sense I don't think there's a fundamental blocker here, but it sounds like integrating into existing caching solutions is a high priority. That way as long as our dependency resolution/granularity is accurate then we'll get cached builds for free (no matter where you are in a tree).

If they need to be rebuilt for every user, then yeah, that's trickier.

Oh no so for serde at least what will happen is that a compiler plugin will be compiled, and then that same plugin is used for all crates. You wouldn't have to recompile a plugin for all downstream crates.

I also agree that completely unconstrained build scripts pose problems, but in practice this is never the case. I could imagine that a vendoring policy would require "well behaved" build scripts to be added to the repo only, for whatever definition is desired. This is sort of along the lines of what I mentioned with @davidzchen above.

I'm personally trying to push on including build.rs as much as possible because in many places it's quite integral to assembling crates. Losing out completely on build.rs I feel is untenable in terms of "acceptably being able to leverage crates.io", but imposing reasonable restrictions on build scripts (e.g. don't download things) seems totally plausible.

I agree that lots of the native library-geared build scripts probably don't need to be run, but they should all have proper escape hatches to use what's already on the system. Crates like openssl-sys need to run to detect what version of OpenSSL they're compiling against, and ignoring all crates on crates.io that use openssl-sys transitively would be a bummer!

rustc is about the right level for Buck, since its similar to gcc/javac/etc; certainly integrating at the rustc level (while not trivial) was more conceptually straightforward than trying to work out a conceptual mapping between buck and cargo.

Perhaps yeah, but my point is that rustc is way too low level for Rust and crates.io. I definitely agree that it's more difficult to map between buck and cargo, but I feel that the benefits are definitely worth it.

More generally, I originally commented:

First and foremost we've historically concluded that build system integration is not finished until you've got access to crates.io crates

Do you agree with this? Or do you think that the integration you've got today is suitable in terms of leveraging the existing Rust ecosystem?


@luser

The core problem was that when outputting something other than an rlib rustc includes support code such as jemalloc, and you can only link one copy of that into a binary.

I might rephrase the core problem as being libstd, but in general yeah it's because a staticlib includes all dependencies, and then dependencies duplicated across crate graphs will be included twice if two staticlibs are linked.


@cardoe

We need the ability to fetch all the data necessary to perform the build before cargo executes since the package managers for both Yocto and Gentoo are responsible for verifying the integrity of the downloads and handling the downloads.

Does this not work today? I'm under the impression that Cargo has enough features for this, but I'd just want to confirm that's the case!


@stuhood

This is a particularly clear distillation of the requirements for integration into build systems with existing caches and reproducibility guarantees

To reiterate some points from earlier:

  • Cargo is also built from the ground up for reproducible builds
  • Cargo now has support for vendoring. In other words it has a well documented format for pre-downloading dependencies to some location on the filesystem.

To that end, are there active points of further integration needed?

jsgf commented

@alexcrichton -

Thanks for the detailed response!

Makes sense to me! Remember though that rustc/cargo are the same thing :). All the tools here are effectively pure functions, so what I'd like to figure out is the best way for rustc/cargo to fit in this model because it should mesh well!

If cargo were simply a build tool that transforms inputs -> output, then I could simply invoke it instead of rustc. But it isn't simply that - it's also doing dependency management, making its own decisions when to build things, resolving dependencies in Cargo.lock (even creating Cargo.lock!).

The traditional failures of build systems can mostly be summarized as "no single source of truth". For example, "make" works fine on simple projects, but as soon as you have significant amounts of procedural code in a build system, you end up with problems with flakey builds because the Makefiles no longer fully describe the project's dependencies. Even if there's no explicit procedural code, even using nested Makefile/make invocations have the same problem, because the dependency information is scattered over multiple Makefiles.

Likewise, without build.rs, cargo is in a fairly good state as a stand-alone tool, but with build.rs it loses a large amount of control, as evidenced by its very conservative/heavy-handed treatments of build.rs (rebuilds eagerly, or the requirement to periodically cargo clean to make sure the build.rs does get rerun).

Builds systems like Buck (and Blaze/Bazel, though I haven't used them) attempt to resolve these problems by constructing a single source of truth dependency graph which includes everything, and by avoiding procedural build steps as possible (unless they can be modelled in the dependency graph).

So if Cargo is going to continue doing the things as it does today, then embedding it in another build system is inherently tricky - it amounts to having two sources of truth for dependency information, with the associated risks of getting them out of sync. Either Buck has to (somehow) delegate parts of its dependency management to Cargo, or has to maintain its own independent model of what Cargo is doing internally. That might be possible if the Cargo portion is on the edge of the graph, but becomes much harder if its embedded in the middle (ie, Cargo has dependencies on Buck-managed targets).

First and foremost we've historically concluded that build system integration is not finished until you've got access to crates.io crates

Do you agree with this? Or do you think that the integration you've got today is suitable in terms of leveraging the existing Rust ecosystem?

Having access to the crates.io is essential to being productive in Rust, and the better integrated it is - esp having low friction in using new crates - the more productive you can be.

But more broadly, that's also true of all the other open source in other language ecosystems. Their typical deployment is "unpack tar, run ./configure, build", which is completely incompatible with the Buck way of doing things. As a result, the options are 1) rewrite their build systems in Buck, or 2) special-case them by prebuilding with their preferred build system, then use that prebuilt artifact in Buck. 1) is impractical, so 2) is the answer.

Cargo - esp with the presence of build.rs - is effectively the same as "unpack and run configure", and so I'm using the same solution: run cargo in its own isolated environment with all that crate's dependencies vendored, then use the prebuilt .rlib (and soon, .so) in Buck. Unfortunately this process introduces quite a bit of friction, but it does have its upsides (if an upstream package like openssl gets updated, then others can automatically rebuild my Rust code to use it without me being involved).

The things that @davidzchen mentioned about having Bazel directly parse Cargo toml/lock files to download deps sounds very interesting, but I can't see it working so long as a build script is involved; if the package depends on build.rs, it has to be special-cased.

Its not really practical to manually audit build.rs files to see if they're "well behaved", esp since it would have to be redone every version, or they may depend on arbitrary complexity like invoking cmake/autoconf/etc. The only thing I can see working is if cargo has some way to sandbox build scripts so their actions can be well-defined in advance, and/or run in a way that shows exactly what their effects will be without making any changes - but I think that's out of scope for cargo.

<handwave>
What I could imagine being possible is if build.rs stops being general purpose code, and instead becomes something that's higher-level and more declarative. Rather than having code which invokes autoconf or probes around for a library, it should simply emit an event/request along the lines of "need external dependency X". For standalone use, that would be paired with something that simply uses the copy of X that's packaged with the crate and just build it in place, as happens now. But if cargo is integrated with something else, then that can be turned into something like "emit buck dependency on third-party package X" (and ideally this could be done without compiling and running arbitrary code).

Perhaps this could be done as an extension of the Cargo.toml [dependencies] section, so you can express dependencies on non-Rust code there, and express a locally packaged version via a workspace.
</handwave>

I'm personally trying to push on including build.rs as much as possible because in many places it's quite integral to assembling crates. Losing out completely on build.rs I feel is untenable in terms of "acceptably being able to leverage crates.io", but imposing reasonable restrictions on build scripts (e.g. don't download things) seems totally plausible.

As I've mentioned before, I think most of the uses of build.rs are incompatible with a large build system. The limitations need to be more like:

  • don't download things
  • don't probe for dependencies
  • don't configure anything
  • don't build anything

Really the only acceptable thing a build script can do of its current set of roles is generate sources from well-defined inputs, and that could be done with a much narrower interface.

I think another way of phrasing my handwave above is that a cargo package can request dependencies, but it must use whatever it's given as a result of that request. A possible implementation of that request might be to probe and build a local version (ie, the current mode of operation), but other implementations must be possible.

In that sense I don't think there's a fundamental blocker here, but it sounds like integrating into existing caching solutions is a high priority. That way as long as our dependency resolution/granularity is accurate then we'll get cached builds for free (no matter where you are in a tree).

I'm curious to know your thoughts about what this interface would look like in a bit more detail.

jsgf commented

<more handwave>
More generally, you could consider splitting cargo into several distinct parts:

  • crate dependency management
  • a build planner
  • a build execution engine

The planner would build up a graph of actions ("I need to build X from A, B, C because Y needs it"). In the normal (current) mode of operation the build execution engine would walk the graph and perform each action, possibly relying on cached state.

However, that action graph could also be turned into a set of rules for another build system (Buck, Bazel, etc) to perform the execution, including managing its own cache.

Crate dependency management spans both to some extent - if you have a dependency on a crate, then you can take an action like download it or check vendored sources, then embed that crate's action graph into this one (sharing any common subgraphs).

Dependencies on non-Rust code could also be handled in the action graph, where the execution engine is responsible for resolving things like "I need openssl". In the current standalone cargo mode, this would still be "invoke autoconf from build.rs", but it could be implemented as "depend on the standard 3rd party openssl".

(There's a ton of missing detail here.)

How does this system work with other build systems like Maven, Gradle, CMake, NPM, etc?

Note it's not necessary for Cargo to learn to spit out the graph/plan in every format. Just give us something that Cargo will rigorously follow (no leaky state), and we'll happily convert it to buck/bazel/nix/whatever.

Crate dependency management spans both to some extent

Eh, it's not that bad. As long as Cargo spits out what needs to be downloaded, and can be made to fail rather than redownload if its not fed what it wants, we're good.

Dependencies on non-Rust code could also be handled in the action graph

There was a proposal for declarative pkg-config I think? It would be easy to just forward this stuff into the plan and other tools to do what they will with it.

(There's a ton of missing detail here.)

I don't think so actually. It's a lot of implementation work, but pretty straightforward conceptually.

@sfackler Well if all the language-specific ones could spit out a plan and stick to it, it would be very easy to splice them together for any "dumb graph executor". Unfortunately, none of them do ATM.

jsgf commented

Note it's not necessary for Cargo to learn to spit out the graph/plan in every format. Just give us something that Cargo will rigorously follow (no leaky state), and we'll happily convert it to buck/bazel/nix/whatever.

Yes, exactly - cargo can already be convinced to emit a fair amount of metadata as json blobs; I was thinking of this as a continuation of that practice.

Eh, it's not that bad. As long as Cargo spits out what needs to be downloaded, and can be made to fail rather than redownload if its not fed what it wants, we're good.

So that means its the execution engine that's composing the action graphs, rather than relying on cargo do it?

@jsgf sure, I'm perfectly happy to take on the burden of transforming the plan --- much better than doing a partial reimplementation of the whole tool as us Nixers have done for Haskell's cabal-install!

@jsgf has some really good signal

So if Cargo is going to continue doing the things as it does today, then embedding it in another build system is inherently tricky - it amounts to having two sources of truth for dependency information, with the associated risks of getting them out of sync.

This is the reason why we dropped cargo and built nixcrates. Nix is declarative and is able to track sources beyond just crates into the OS level, hence adequately has one system view or state snapshot. Though every time we see crates with funky bash scripts and rust calling executables those crates definitely fail.

In my opinion, the right mid level is just a simple program that given a crate name/s as input generates a list of transitive deps are returned along with the download URL, the checksum etc. Other tools will be able to leverage that very well.

@sjmackenzie Well build plan solving is crucial. Us Nixers tend to lock things down anyways, but it's great to be able to automatically bump the plan when we want something new rather than fiddle with exact versions manually.

@Ericson2314 oh sure mate (you're preaching to the choir). Since we move to nixcrates most of our problems went away (but replaced with new problems :-) ). Now the challenge is to whittle down and massage those crates that don't build properly for whatever reason. From this line downwards https://github.com/fractalide/nixcrates/blob/master/default.nix#L112 everything that's in buildInputs isn't building for different reasons. This list was compiled by manually compiling the top 2500 top downloaded crates. They are first order fails, so many more dependent crates don't build due to these crates failing.

jsgf commented

@sfackler -

How does this system work with other build systems like Maven, Gradle, CMake, NPM, etc?

CMake is the only one of those I have any experience in, and it's also a "rule generator" like I'm proposing rather than something that actually executes builds. I suppose you could come up with some way of passing the cargo action graph through to the underlying build engine, but I think a preferred course of action is to burn cmake to the ground, bury the remains in a pit and cover it all in concrete and holy water.

@jsgf

More generally, you could consider splitting cargo into several distinct parts:

  • crate dependency management
  • a build planner
  • a build execution engine

I think you've cracked the nut!

I was talking with @alexcrichton about build system integration this morning, and we came to the exact same conclusion. The important role of Cargo is the first two bullets; that's the way that it shapes the Rust ecosystem. Actually executing the build, OTOH, is something that should be easily delegateable to other systems.

I agree wholeheartedly with your proposed strategy of using Cargo for build planning, and then exporting the plan for consumption by other tools. I also think the idea of representing things like C dependencies in a more "first-class" way (rather than encoding them through build scripts) will likely make everyone's life better.

As you say, there are a lot of details to be fleshed out here, but I think this is a very promising avenue for tackling this roadmap item!

I did want to raise one other question, though. While we clearly want and need to use Cargo for incorporating the crates.io ecosystem, does it have a role to play when working on internal projects? With your breakdown, we can ask more specifically whether there is useful dependency management and build planning work to be done for internal projects.

I know a lot of organizations use mono-repos internally, and have everyone working with a single version of all libraries -- internal or external. In that world, resolving Cargo.toml files into a dependency graph is perhaps not so useful, since there's only one possible version to use. (It might still be more convenient than writing the Buck rules directly, though). But there is at least one major downside to this approach: for the crates.io ecosystem, it means manually finding a subset of the ecosystem containing all the dependencies you want, while agreeing across the board on a single version of each. It's quite plausible that such a solution doesn't even exist!

OTOH, if you wanted to (say) allow multiple major versions of crates coming from crates.io, then there's a pretty strong reason to use Cargo even for internal projects as a dependency resolver and build planner. The Buck/whatever rules could then be auto-generated.

Using Cargo everywhere as the build planner for Rust gives you a Rust experience that's consistent with what you see in the docs and across the wider ecosystem. OTOH, people coming to Rust code in your organization who are used to writing Buck files directly may find it annoying to have to learn another tool to generate those files.

So in short: given the above, with the rough plan you sketched, do you imagine using Cargo for internal projects and exporting to Buck? How do you see the tradeoffs?

jsgf commented

Using Cargo everywhere as the build planner for Rust gives you a Rust experience that's consistent with what you see in the docs and across the wider ecosystem. OTOH, people coming to Rust code in your organization who are used to writing Buck files directly may find it annoying to have to learn another tool to generate those files.
So in short: given the above, with the rough plan you sketched, do you imagine using Cargo for internal projects and exporting to Buck? How do you see the tradeoffs?

TBH, I don't think that's likely at all. I'm envisaging using cargo's version resolution engine, but no Cargo.toml files committed to the main source base - instead, dependencies on crates.io (or cargo-managed in general) packages would be encoded as Buck rules. Cargo.{toml,lock} files would only be manifest as user-visible artifacts when doing things like exporting to open source projects.

Version management is tricky, and coming up with a unification between the Cargo model and a mono-version monorepo is not completely clear. Right now I'm prebuilding all the "supported" crates.io packages, and doing a single cargo version resolution over all of them to try and end up with a minimal number of versions for each package. However, only an explicitly enumerated subset of those packages are exported to the internal codebase, and there's only ever a single version of those. In general I use "*" as a version specifier unless a package has some breaking API change that's too hard to fix up on the fly.

With tighter integration between buck and cargo, I'm imagining a rust_cargo_library rule, but I'm not clear exactly how it would work yet (ie, how to specify versions, where the resolution from a version spec to a specific version happens, where does downloading happen, etc).

I speak for nixcrates, as nixcrates represents the "lowest" level solution other than manually compiling direction with rustc, it's low level in that we removed cargo completely and rely on rustc cli args via nix, this information might be interesting.

There should be enough information on crates.io-index alone to successfully build crates without parsing Cargo.toml files. Once this minimum bar is met via convention evolved to compile time enforcement and explicit information exposed in the crates.io-index then I have a strong feeling the exact most beneficial abstraction level to expose will become apparent. This will have the added benefit of simplifying Cargo.toml files.

One can compose a beautiful build plan with a deep understanding of crate dependencies but your build execution engine will be faced with issues like this:

Please the above rust error codes are an approximation, much still needs to be massaged into nixcrates!

I know a lot of organizations use mono-repos internally, and have everyone working with a single version of all libraries -- internal or external. In that world, resolving Cargo.toml files into a dependency graph is perhaps not so useful, since there's only one possible version to use. (It might still be more convenient than writing the Buck rules directly, though). But there is at least one major downside to this approach: for the crates.io ecosystem, it means manually finding a subset of the ecosystem containing all the dependencies you want, while agreeing across the board on a single version of each. It's quite plausible that such a solution doesn't even exist!

Haskell and Purescript make use of this concept you describe, which, I believe is called a package-set. Haskell people will need to comment on the efficacy of package-sets, but I'm dubious because the only thing that matters is that stable public contracts are not changed. This means signatures, arity etc. This calls for some kind of lookup table to check the evolution of the API over time ensuring it remains immutable. See section 2.6 of https://rfc.zeromq.org/spec:42/C4/

Regarding cargo publishing to crates.io. I have no strong opinions on the subject, except that cargo or the upstream server should ensure there is no change to public contracts before merging the new/updated crate into the index.

Ideally we drop semantic versioning completely, semver itself is under semantic versioning, if you major bump semantic versioning you completely destroy everything. This is insanity. If you have the time please read: The End of Software Versions

@jsgf

Right now I'm prebuilding all the "supported" crates.io packages, and doing a single cargo version resolution over all of them to try and end up with a minimal number of versions for each package. However, only an explicitly enumerated subset of those packages are exported to the internal codebase, and there's only ever a single version of those.

Ahha, that's a nice way to thread the needle here, and that means the Buck-facing side of things works in the mono-version style.

So, to be 100% clear: the possibility we're discussing of using Cargo purely for "build planning", in your case, would apply only to the way you incorporate crates.io packages; all internal packages would purely use Buck. In particular, we could generate Buck rules corresponding to all the "exported" crates.io packages and their dependencies, including bridging to C dependencies via the first-class replacement to build.rs for that use case.

Does that sound right?

To give a counterpoint, I would use Cargo.toml files internally. For the Haskell work I do, internal libraries could well be open sourced, and external libraries are frequently forked, so a slick path moving libraries in both directions is key.

There should be enough information on crates.io-index alone to successfully build crates without parsing Cargo.toml files.

I'd disagree. While there's nothing wrong with that, strictly speaking all that is needed in the index is enough information to select the packages/version from which the final plan can be generated.

I prefer to think in two phases of planning: first the "closed world" step of resolving packages and version (i.e. the lockfile) and then the second open-world step of resolving the actual fully elaborated plan (every call to every executable, in principle). The first step cannot be usefully transparently cached as a pure function since crates.io is always changing and, a priori, there is no way to rule out part of the index as irrelevant (I call it closed world then with cargo.io being the world). This is why we use a lockfile and do explicit invalidation. The second step however is a pure function from the lockfile and its mentioned Cargo.toml files, which is open-world in that we know packages not mentioned in the lockfile don't matter.

Insofar that it's useful to think of these two phases, and putting all Cargo.tomls in the index would make it a lot bigger, I think it's easier to keep the index just for the information needed for the first phase, and using Cargo.tomls for the second phase.


So while Buck and Bazel may be higher-profile potential consumers of Rust, I'd encourage the tools team to disproportionately focus on our tiny Nix community :) for a purely technical reason. The reason is that while those first two tools only support static dependency graphs, Nix supports full dynamic dependency graphs, in that nodes in the graph may produce more graph. In the end, avoiding dynamism is almost always good style and Bazel/Buck/Make/etc should absolutely be supported too, but the dynamism is a useful stepping stone to allow build system integration to be implemented more gracefully. Turning direct-style imperative building (as Cargo internally does today) into a dynamic plan is basically a dumb CPS transformation if you squint. Rearchictecting Cargo to make a single complete up-front plan, by contrast, is a rather all-or-nothing manual rewrite.

jsgf commented

So, to be 100% clear: the possibility we're discussing of using Cargo purely for "build planning", in your case, would apply only to the way you incorporate crates.io packages; all internal packages would purely use Buck. In particular, we could generate Buck rules corresponding to all the "exported" crates.io packages and their dependencies, including bridging to C dependencies via the first-class replacement to build.rs for that use case.

That actually pretty much describes what I'm doing now, in a sense. The side-effect of the prebuilding is a set of buck rules which describes those prebuilt libraries and their internal dependencies, so our internal code buck rules can use them in a straightforward way.

I think one of @alexcrichton's other concerns is that buck currently directly invokes rustc, but he'd prefer the point of contact between a builder and the toolchain be cargo. In this "new cargo" world, I'm not sure what that means. Does that mean that buck should invoke cargo to perform certain build actions? Or should it keep using rustc as it currently does?

I think part of the concern is not exposing implementation details like exactly how compiler plugins are constructed or incremental compilation, rather than plain old builds.

@Ericson2314 I agree with your 2 phase, closed world / open world assessment. Though the lockfile isn't committed to a crate repo so is not a reliable pure function input. No I do not want to generate a lockfile because my environment has had the http_proxy deliberately mangled during that build phase. Hence we use something that is guaranteed to be around: a nixified crates.io-index serves as the pure function input.

What would be nice is a nice little executable that purely parses the toml file for such things as the lib.rs entry point, the crate name etc. This bin can be executed during the build phase to correctly name things.

Though you missed my point, when I said "build the crates" I mean they successfully build. I'm able to derive the exact versions needed but the crates themselves but they won't build for the errors mentioned above (missing lib.rs etc). I'm suggesting that stuff like entry point names should be made explicit is the index iff the entry point is not made into a compile time error.

In other words by building without Cargo.toml/.lock and only with the index and rustc the tooling team will know exactly what information needs to be made explicit in the index or made into a compile error. This in turn makes it easier to build other tooling infrastructure.

I recall a talk online where someone said anyone can build a competing build tool. Atm rustc and cargo roles are starting to conflate, this is a bad thing for rustc. Please don't go the way for cargo to be a hard dependency. Or even the cargo.toml file. Rustc will be able to move into far far superior build environments than what cargo has to offers in the future.

IMHO the focus of the tooling team should be entirely on making a very nice index and a very nice rustc.
The software used to test the gaps can be turned into a regression tool on the server side to check that changes to cargo don't cause a regression that'll break other build tools.

It's this boundary information that is blurred between the open/closed phases.

@sjmackenzie

I whole-hardheartedly agree with Cargo depending on Rust and not vice-versa. but crates.io is definitely a Cargo resource, not Rust resource. Granted all libraries being there does create some Cargo lock-in, but I don't see a true solution for that, sorry.

My ideal for an index would actually just be to have all of crates.io be a lazily-downloadable merkle dag. The index would just be a map pointing (with hashes) to parsed and normalized Cargo.tomls which in turn point to the actual source. Any tool would just download as deep as it needs.

@jsgf

That actually pretty much describes what I'm doing now, in a sense. The side-effect of the prebuilding is a set of buck rules which describes those prebuilt libraries and their internal dependencies, so our internal code buck rules can use them in a straightforward way.

To further clarify: if we had some back-end that spat out rules which could be transformed into Buck rules (or other build system rules), would that have saved you work when doing the integration? Would you want to move to such a system? Did you write your own scripts to pull information out of e.g. lockfiles?

I think one of @alexcrichton's other concerns is that buck currently directly invokes rustc, but he'd prefer the point of contact between a builder and the toolchain be cargo. In this "new cargo" world, I'm not sure what that means. Does that mean that buck should invoke cargo to perform certain build actions? Or should it keep using rustc as it currently does?

I think it could work either way.

  • Since Cargo is emitting the Buck rules -- at least for crates.io -- it could instruct Buck precisely how to invoke rustc. So Cargo is still "in control" in that sense.

  • However, for internal projects where you intend to use Buck directly, either you have to encode direct knowledge of rustc invocation, or will need some way of operating Cargo while getting dependencies from Buck. I think the latter is totally feasible.

I think part of the concern is not exposing implementation details like exactly how compiler plugins are constructed or incremental compilation, rather than plain old builds.

Agreed. I feel like a key question here is how much of a role Cargo can play for purely internal projects. You've made clear that you want to write Buck rules rather than any Cargo.toml -- at least for dependency management. But there may be other aspects of Cargo that are still useful and could be fit into a "Buck drives" model for internal projects.

BTW: I wonder if at some point it'd be helpful to have a call to hash out a bit more of these details at high bandwidth?

or will need some way of operating Cargo while getting dependencies from Buck. I think the latter is totally feasible.

Again, dynamic dependency graphs are the simplest (only?) solution for integrating automatically a Cargo project "in the middle" of the dependency graph (i.e. Nix/Bazel/Buck/whatever both provide deps (e.g. C libs) for the Rust node upstream, and consume it downstream).

or will need some way of operating Cargo while getting dependencies from Buck. I think the latter is totally feasible.

Again, dynamic dependency graphs are the simplest (only?) solution for integrating automatically a Cargo project "in the middle" of the dependency graph (i.e. Nix/Bazel/Buck/whatever both provide deps (e.g. C libs) for the Rust node upstream, and consume it downstream).

I think there might be a misunderstanding here. I was just saying that we could provide a way of running Cargo purely for invoking rustc on the local crate, where the dependencies have already been compiled.

More generally, the direction we've been going with @jsgf is, as I understand it, somewhat in opposition to this "dynamic" point of view. In particular, I'm taking as a goal, for the moment, allowing Cargo to work smoothly with build systems that want a totally static set of build rules. That seems to be the expectation with build systems like Buck, and I think we should explore how smooth we can make things with the division of responsibility @jsgf proposed.

@aturon Hey I'm not in an opposing camp :). To be clear, I've in fact been advocating that division of responsibility the entire time. My first post in this thread (repeating myself from rust-lang/cargo#1997 (comment)) links to haskell/cabal#3882 which in turn links to http://lists.science.uu.nl/pipermail/nix-dev/2016-September/021765.html which begins with

After much pondering, I've decided the best way to work with language-specific build managers is for them to come up with the end-to-end build plan, and us [Nix] to build each package in that build plan.

As I've mentioned, I always want to use Cargo.toml for internal and external packages to facilitate moving them across that boundary (forking public projects or opening internal ones). Let me assume that's also a use-case you are interested in even if @jsgf isn't. That means the pipeline looks something like this:

  1. Cargo solves versions (inputs: index, workspace root, local Cargo.tomls; output Cargo.lock)
  2. Cargo generates complete build plan---all rustc invocations (inputs: Cargo.lock, Cargo.tomls; outputs: something for Buck/Bazel/Nix)
  3. Buck/Bazel/Nix executes that build plan

With Buck/Bazel, you have to run Cargo "out of band" to get those rules, and then commit them, being careful to keep them in sync with the Cargo.toml. That's a chore and that's probably why @jsgf doesn't want to use Cargo.tomls internally. The alternative is Buck/Bazel calls Cargo directly, but that arguably sacrifices the crucial "single source of truth" @jsgf emphasizes. One would really want to make sure Cargo isn't doing it's own caching, or anything else too clever, in that alternative case, and even then the dependency graph isn't as fine-grained as wit would be were Buck/Bazel invoking rustc directly.

With Nix, you can make that pipeline [or, in practice, the 2nd step; I wouldn't bother with the 1st and just commit the lockfile instead] also part of the build plan. This is the best of both worlds where (1) you have Cargo.tomls to play nice with the outside world, (2) there's no committed extra build rules to keep in sync, and*(3) Cargo doesn't actually touch any build artifacts/rustc but just comes up with plans, so there's definitely one source of truth

jsgf commented

@aturon:

To further clarify: if we had some back-end that spat out rules which could be transformed into Buck rules (or other build system rules), would that have saved you work when doing the integration? Would you want to move to such a system? Did you write your own scripts to pull information out of e.g. lockfiles?

Yeah, initially I extracted info from lockfiles and from cargo metadata, but then prodded @alexcrichton to add more detail to the output of cargo build --message-format=json to get details of the precise filenames of artifacts generated from the build. And that was an entirely cargo-managed build - all that was just to determine what cargo did after the fact.

Since Cargo is emitting the Buck rules -- at least for crates.io -- it could instruct Buck precisely how to invoke rustc. So Cargo is still "in control" in that sense.

I think there's still going to be some division of labour here - I can imagine cargo driving all the Rust-specific flags, but we'll still want to do things like override the linker and linker args.

Agreed. I feel like a key question here is how much of a role Cargo can play for purely internal projects. You've made clear that you want to write Buck rules rather than any Cargo.toml -- at least for dependency management. But there may be other aspects of Cargo that are still useful and could be fit into a "Buck drives" model for internal projects.

I think the most useful thing here would be a common interchange format for dependencie, metadata, etc so that:

  • buck (other build tool) can consume cargo dep info
  • buck can produce it, and cargo can use it to generate Cargo.toml/Cargo.lock files
  • other tools (RLS, etc) can use it to discover projects, etc

More generally, the direction we've been going with @jsgf is, as I understand it, somewhat in opposition to this "dynamic" point of view. In particular, I'm taking as a goal, for the moment, allowing Cargo to work smoothly with build systems that want a totally static set of build rules.

It's not really a question of "dynamic" vs "static". It's just that buck needs complete knowledge of all the dependencies, inputs and outputs. If that's derived from a composition of different dependency graphs then that's fine.

BTW: I wonder if at some point it'd be helpful to have a call to hash out a bit more of these details at high bandwidth?

Yep, that would be good.

With the current state of affairs, deriving and fixing the transitive closure of build dependencies is harder than it should be. To get hermetic builds, it is absolutely necessary to retrieve all dependencies beforehand and let cargo (or another build system) do its job in a strictly controlled environment. Nix' rust-support tries to achive this, but please look at the code... it is full of ugly hacks an doesn't even work in all cases.

I think that an obvious step to straighten integration issues would be to provide easy steps to get all dependencies in place and run cargo without network access. The Source Replacement page is not really clear about that. Even better would be a recipe to do so only with shell utilities, avoiding the chicken-egg-problem to require Rust builds already going to get Rust builds going.

Getting all dependencies beforehand is critical for a hermetic builds but when your project is broken up into 100s or 1000s of components it gets nasty (unusable and super slow). This means you have to download the same dependencies each time. Why? You might ask. In a pure build environment you won't have access to $HOME/.cargo at all.

It's a mistake to approach this problem from a narrow cargo viewpoint. Cargo is a source of divergent disk state, w.r.t. configuration management. One should approach a solution from a congruent configuration management point of view.

Cargo just cannot factor in all the dependencies on an operating system level, it shouldn't even try. So it would be just perfect if no 1.) the tooling team recognize this and not be so cargo centric and that targeting a pure deterministic build environment such as nix solves the problem for everyone 2) instead of just aiming for big names like Bazel, which is low hanging fruit and still cargo centric. 3.) Eric's point above about Nix supports full dynamic dependency graphs holds true.

A really good starting point is using nixcrates, much of the leg work of translating crates.io-index into nix has already been done (plus it's implemented in Rust!). Next a nice little cargo reading utility would be called in the build phase which returns important things like toml-reader ./Cargo.toml cratename which returns the name of the crate, or toml-reader ./Cargo.toml librs returns the entry point, etc.

Improving nixcrates to the point of not needed the above toml-reader program will highlight the flaws in crates.io-index, rustc and cargo. Steady progress to removing toml-reader levels the playing field so that any build tool operates without advantage.

Once toml-reader is removed it will be very easy to understand what the build plans are, all stray variables are absolutely factored in and ameliorated, indeed, at that stage you'll probably be able to emit a single simple build plan and any tool will be able to parse it and build all the deps. The build tool could inject dependency paths to 3rd party tools like open ssl etc.

This then becomes your regression test suite. If changes to cargo or rustc porcelain cause a crate build to fail it's a regression. The nixcrates can be integrated with hydra a nix CI solution which will run each time a change to cargo/rustc happens. which will then build all the crates and you can see very clearly which crates are failing.

@jsgf and I had a call this morning, and I wanted to try to summarize how I'm seeing things right now.

I want to emphasize that at the moment, I'm focusing squarely on the case of interop with a build system like Buck, in the context of a monorepo-based organization. While that's not a universal case by any stretch, it's representative of a big and important class of cases. I also think it imposes more constraints than, say, Nix, because Buck is comparatively limited. So I'd like to focus on it for the time being, so that we can dive into more concrete detail/problem statements, and once we've got that we can step back and see where we are.

Here's how I'd characterize this kind of use-case:

  • There's a single, organization-wide build system.
  • All internally-written code lives in a monorepo and is built using that build system.
  • The build system ultimately executes a pre-determined build plan, which cannot be influenced by the build.
    • Systems like Nix give greater flexibility here, but are therefore less constrained and should be easier to integrate with.
  • Libraries available to internal code generally only provide a single, fixed version.
    • This is a tradeoff: it requires global coordination, but means that integration across projects is easy, as is rolling out patches globally.
    • It also implies that semver/version resolution, for internal projects, is not relevant. (It may be very relevant to external code, e.g. the crates.io ecosystem)

A good bit of this is just summarizing previous discussion, but I think it's helpful to get it all in one place, as crisply as possible.


So for this use case, here are the goals.

Hard constraints:

  • Provide easy access to (at least a large majority of) crates.io.
    • Implication: the integration must handle plugins like Serde
    • Many build scripts, on the other hand, might be able to migrate to more delcarative specifications (see below)
  • All system dependencies must be provided/managed through Buck, even when using crates.io packages (single global source of truth).
  • Buck must ultimately manage the build process, i.e. tracking/caching/providing artifacts, determining when to rebuild, etc.
    • Note: this does not preclude Buck from invoking Cargo during a build.
  • The singly-versioned mono-repo scheme for internal projects is here to stay; even if we wanted to push toward poly-repos and semver, Rust simply lacks the leverage to do so, and the goal of this roadmap item is to spur, not obstruct, adoption.
    • By contrast, semver/poly-repos is critical for the crates.io ecosystem, and changing that is not on the table.
    • This creates an impedance mismatch that must be solved.
  • Internal Rust projects should specify their dependencies via the ambient build system, rather than via Cargo.toml
    • Part of the benefit of the single global build system for the organization is having a single shared workflow and source of truth. Again, Rust lacks the leverage to change this, even if we wanted to.
  • The ambient build system needs to have very fine-grained control over details like optimization level, linkage, etc, to allow integration with e.g. internal C++ libraries and compilation profiles.
  • There must be a reasonable way to know all of the inputs and output for compiling a given crate. Moreover, compilation must be deterministic.

Nice-to-haves:

  • We should avoid having to duplicate logic or functionality that Cargo provides as much as possible.
    • This is referring to adding new functionality to the ambient build system, not functionality that is already part of its remit like caching, executing a build plan, etc.
  • We should try to solve the problem in a way that will work with multiple build systems that resemble this use-case.
  • Ideally, it would be possible to "publish" an internal package as a standard
    crates.io package, somehow producing a Cargo.toml from the ambient build system specification.
  • Likewise, ideally internal packages would interoperate with tools like the RLS
    or rustfmt that expect a Cargo.toml.
  • It could be beneficial to allow the crates exported from crates.io to be at least partly implicitly determined by the stated dependencies from internal projects, rather than via some up-front whitelist.
    • This approach has downsides, though, since there may be some manual effort needed to produce a coherent subgraph of the crates.io ecosystem, and of course versions must be globally coordinated as per the hard constraint around internal project versioning.

How can we meet all the hard constraints and maximize the soft constraints?

A core insight, which @jsgf presented in a somewhat different form much earlier, is that Cargo encompasses two distinct stages:

  • Build planning
    • Specified via Cargo.toml files
    • Version/dependency resolution
  • Build execution
    • Caching
    • Change detection
    • Build plan traversal
    • rustc invocation

A cross-cutting aspect is source artifact management.

A build system like Buck, by contrast, is much more focused on build execution, and gives you a generic way of spelling out the build plan itself.

Crates.io

The most important aspect of build system integration is allowing easy access to (the majority of) crates.io. Putting all of the above together, that boils down to allowing Cargo to operate as build planner, while the ambient build system operates as build executor.

There are at least two aspects for making this work:

  • Providing a mode in which Cargo produces a build plan, but rather than executing it, it exports it in a way that can be consumed by other tools. We can then convert that build plan into a set of Buck rules. (That's in contrast to @jsgf's current approach, which involves actually doing a Cargo build and then extracting various metadata; we'd like something more direct and first-class).

  • However, crates.io isn't made up purely of Rust code; there are also system dependencies, which are usually hooked in via build scripts. Again, as previously proposed, we could benefit by designing a more first-class way of expressing these dependencies directly in Cargo.toml, which would then be emitted in the build plan. Cargo would execute that plan via something akin to build scripts, whereas Buck would interpret it as Buck dependencies.

There's also a question of how to determine what crates, and what versions thereof, to include. In @jsgf's current setup, you explicitly say what crates should be "exported" to Buck, and then try to resolve the resulting dependency graph. That's currently done from scratch every time the export list changes, which can require manual intervention, especially around minimizing the number of versions of transitive dependencies. It's plausible that Cargo can provide more tooling to help (or perhaps there's a way to use its existing tooling better). There's also the question of whether we can gather up the export list from Buck dependencies of internal projects -- that's a bit more speculative right now.

On the flip side, we'd like the ability to generate Cargo.toml files, or something equivalent, from the Buck rules of an internal Rust project. In other words, when importing from crates.io, Cargo.toml is the source of truth, and we ultimately generate Buck files; when exporting to crates.io, it goes the other way around. One possible way forward would be for the "build plan" to act as a kind of interchange format that can work in both directions, and in particular could either generate a Cargo.toml or be used directly by tools like the RLS. There are a lot of questions about how and whether that could work.

Internal projects

The main question for internal projects is: what role, if any, does Cargo have to play? After all, given the hard constraints, we are (1) not interested in semver/resolution, (2) want Buck to be the source of truth, (3) need very fine-grained control over low-level build details like linkage, (4) don't need Cargo's build execution.

On the other hand, we do want to support at least using plugins, and we want to avoid duplicating Cargo's logic or exposing unnecessary details about its interaction with rustc.

My sense from talking to @jsgf is that given all this is that (3) is a key point. In particular, we'd need to expose a lot of what rustc provides at the Cargo.toml level, which would really be a roundabout way of driving rustc the way we want to. Rather than expose all those details by duplicating them in Cargo, it seems better to call into rustc directly, and duplicate the smaller set of logic from Cargo (such as: how to hook in plugins).

However, it's probably a good idea to enumerate the full details on both sides to make a more precise comparison.

The compiler

A hard constraint is that the build process for a single crate must be deterministic with clear inputs and outputs. That means at least that the compiler must be deterministic, and must it must be easy to determine all of its inputs and outputs.

Currently @jsgf provides himself some assurance about the inputs by running the compiler in a sort of "sandboxed" setup, where only the files Buck thinks the compiler should need are available. But perhaps there are ways to augment the compiler itself to provide greater clarity here.

Plugins (e.g. custom derive) and build scripts

Finally, supporting plugins (which include custom derive) are a hard constraint. Plugins are potentially problematic because they are arbitrary code that's executed during the build, which introduces another point at which non-determinism and hidden inputs/outputs can be introduced.

However, while plugins are commonly used, they are not so commonly defined. @jsgf believes it's perfectly reasonable to include particular plugins as part of the crates.io whitelist while manually ensuring determinism and clear inputs/outputs.

Build scripts, on the other hand, are:

  • Defined much more often in crates.io than plugins are.
  • Vastly likelier to do I/O, requiring more work to determine the right Buck rules. And in particular, the generation of those rules cannot be automated.
    • In particular, build scripts are often, well, script quality, requiring a closer look and often modification to work in new environments. Plugins, in contrast, tend to be higher quality and more portable.
    • Build scripts that don't do I/O, e.g. purely generate code, are basically the same as plugins.

One of the most common uses of build scripts is discovering and configuring system dependencies. Currently when integrating with Buck, this requires a fair bit of roundabout setup, so that the build script will "discover" the artifacts that Buck will ultimately provide. A nicer approach would be to eschew build scripts for this purpose, in favor of some higher level/declarative specification of system dependencies which could be automatically integrated into systems like Buck. To be successful, this specification form would have to have strong uptake in the crates.io ecosystem.

It's not clear to me how amenable system dependencies are to such a specification language -- or, put differently, how complicated such specifications would have to be. But I think that it could be a net win for the crates.io ecosystem and for Rust programming in general.

What now?

The above is my attempt to capture the current constraints (for this use case) and a high-level approach to addressing them. In other words: stating the problem crisply, and a solution plausibly. If there's some consensus that the constraints are worth trying to solve, we can start to dig into the technical details of a solution. And again, I see this is a first round of design for one particular use case, after which we can circle back, lessons learned, and see if there are other constraint sets we want to tackle.

jsgf commented

Thanks very much for the detailed write-up! I don't really have anything to add right now.

brson commented

@aturon Good writeup. It reflects my understanding of the problem, though the desire to impose the dependencies from outside of cargo is new to me.

W/r/t the dynamism of build scripts, my understanding is that cargo can feasibly generate build plans ahead of time, one per configuration unfortunately, without actually running the compile. The biggest problem is that cargo doesn't capture the side effects of the build script - even the names of the output files. I could naively imagine an extension to build scripts that allows those side-effects to be specified accurately, and a way to enforce that transitive deps use the new scheme.

You've been using the term "internal project" and I'm not clear on what that means and why it's requirements are so special. Even internal projects are going to have non-internal deps, so I might hope that solving the problems affecting non-internal deps on these systems would allow internal projects to be handled similarly.

It sounds to me like you are suggesting that for "internal projects" cargo needs to be runnable without actually having a Cargo.toml file - that effectively all of Cargo.toml needs to be specifiable from without via some other mechanism. Is that accurate?

@aturon I generally prefer to upfront consider the problem in it's complete/hardest form, so going for a full static plan is OK with me. Certainly we in the Nix community do certainly like to avoid dynamism as much as cleanly possible too. [For example, we'd always want e.g. declarative pkgconfig dependencies in Cargo.toml.]

But I do share the concerns mentioned by @brson: one one hand "internal projects" are a bit nebulous concept, and on the other all the information we'll need from them in the end is probably no less than the everything in the Cargo.tomls that would have been written.

I say nebulous because while I certainly get the human / software engineering concept, I'd imagine on one hand there are different idioms for structuring large projects with Buck/Blaze (just as there are with Nix) due to those tools' judicious use of few low-level primitives. Trying to infer a Cargo.toml from these---and the goal is working from them, right?---reminds me decompilation: inherently a brittle, partial, and hacky process. It's a difficult, yet uninteresting chore.

So, while again I'm fine with "focusing squarely on the case of interop with a build system like Buck, in the context of a monorepo-based organization", I'd strongly encourage dropping the constraint of no local Cargo.tomls, at least for a first version of this. It should be fairly easy to whip up a Cargo workspace containing Cargo.tomls equivalent to the local code, use that for purposes of Cargo's planning, and simply drop (even manually prune) any part of the plan relating to building the local code. In fact, I think users will want to control the exact rustc invocations for crates.io code just as tightly as they do for their own---if for no other reason then consistency---so they may end up transforming the plan for foreign code, not just keep it as is once the local parts are discarded.

More broadly then, in either direction (Cargo.toml / crates.io <-> exact customized ambient build system / local code), I think it's hopeless to try to get Cargo to anticipate every last way one would want to customize plans and organized projects --- one Cargo vs an unbounded number of subtly varying development methodologies. It's most useful to just let Cargo continue to be opinionated, but also allow us to extract as much static planing information as we can. Being able to get information out (plans) is much more important than being picky on how we put information in (decompiling ambiant build system vs Cargo.toml). I hope @jsgf and @sjmackenzie would agree.

Ok I had some discussion with @aturon the other day and we felt that the discussion here has led to at least two concrete courses of action to start tackling this issue. To that end I've opened up two new issues on the Cargo issue tracker:

Our thinking is that specific technical discussion about those two issues should move towards those issues specifically. Otherwise more general discussion should of course continue here!

One concern I have is that the "build plan", as described, sounds like a "concrete build plan" - that is, one in which things like Cargo features have been bound, rather than remaining free.

This is not compatible with some distros. In particular, on Exherbo, it's desirable to surface the Cargo features of crates as Exheres options on the corresponding packages. As a result, generating Exheres from crates requires something closer to Haskell's GenericPackageDescription (where flags are free) than its PackageDescription, in which they are bound.

This corresponds to a somewhat different set of phases - the ones in the prior listing have been italicized:

  • crate dependency management
  • a build specializer
  • a build planner
  • a build execution engine

The new phase is performed by the system package manager, and this results in a very different workflow. The system PM thus asks Cargo to extract a data structure that has some number of free parameters that, when all are bound, can be queried for a build plan.

In this system, Cargo is used to extract this data structure. This data structure is then converted to a system package, which then has those free parameters bound by the system package manager (based on end user configuration). The system package manager then constructs a download plan (of crate sources, though the system PM would much prefer if there was a stable ABI, and it could thus construct the inter-crate build plan entirely by itself), and then invokes Cargo to plan and execute the build. The download plan and the Cargo build plan being mismatched is an error.

As a result, unlike the above buildsystem use cases (which basically call in to Cargo), Exherbo's use case basically wants to bookend its own model with exporting information from Cargo, and then reinjecting it to Cargo.

Big fan of official active and continued support for rust in bazel! This would be very timely for my company.

I'm working making the switch from sbt to bazel for one our most active repos and at the same time introduce curious engineers to rust. For engineers being onboarded to a new language within an organization it helps to use a familiar toolchain.

That said cargo is fantastic tool and does many things better for certain tasks. This is probably mentioned above but cargo is more than a build tool, it's also a package manager. Bazel is just a build tool and it really excels and just doing that. Losing the package manager aspect is a real drag after having used it. It would be great if cargo could split out the dependency management aspect into a standalone tool which the bazel rules for rust could use

I'm not sure if there is any interest in this or not, but I've been working on Gradle plugins that use Cargo under the hood to build Rust code.

@sholsapp, there certainly is, would love to try it out. It will be useful when building apps that have modules written in various languages (e.g., Java + Rust communicating over JNI or IPC).

Stephen, could you please let me know if you still plan to open-source your Gradle plugin?

Hi there. I would like to comment on this, as I have been working to integrate Rust into the existing 389 Directory Server project. We use gnu autotools as our build system, and subsequently have a strong requirement to operate with rpm and across a number of other platforms.

I think the summary of my position is that cargo is too "opinionated" about builds and projects. Cargo wants to control your source code, output targets and dependcies, and that conflicts most build systems.

Cargo will often attempt to download online (even with --frozen - this ruined some live demos for me actually). It also has strict ideas about output of artifacts, especially .so or binaries. A simple task like "cargo, build this bin, and output to this location and name" is difficult to achieve.

Cargo is a great tool, and awesome for green field applications, but it's suitability to integrate to other tools is just not there. To make cargo work as a child process to autotools or others, would require stripping cargo down significantly or adding so many switches and extra complexities to it's build process, that it may in turn become too fragile or hard to use.

For example, a simple requirement to "cargo install" an object or dylib to a location, is not possible today, because cargo's focus is 100% around rlibs or the final resultant executable. For example, we build a number of dylibs that are then installed and dlopened at run time. We can't use cargo to build these .so. as a result, so we produce .a and .o, then use autotools to finish them and install. This process is already pretty fiddly, and often causes problems. For example, when you edit a source file, autotools doesn't know to re-make it (so you have to mark all targets as PHONY).

Another subtle issue is that when we make a -release build we need debug symbols to remain so that rpm can extract them for debuginfo tools, but cargo is "-release" or "-debug". Again, this is because cargo has opinions of how it should be used.

As another example, cargo does duplicate work. Consider I have 3 plugins:

plugin_a
    \- cargo.toml
plugin_b
    \- cargo.toml
plugin_c
    \- cargo.toml

Were I to build all of these in my application, they would all pull their rlib deps individually, and would not share the output or work. So my other options is make cargo at the root, and have various specifications of libs and build targets, (which comes with it's own complexity). This then goes back to cargos "opinion". It wants to control the whole build, rather than allowing smaller translation units.

In the end, I have chosen to stop using cargo, and likely will not utilise it in projects.

I have solved this through the use of git submodules to check out crate dependencies, and careful use of rustc. You can see my example of this as:

RUSTC_FLAGS = @asan_rust_defs@ @debug_rust_defs@ -L $(abs_builddir)/.rlibs/

rlibs:
	-mkdir -p $(abs_builddir)/.rlibs/

.rs.a :
	rustc $(RUSTC_FLAGS) --crate-type staticlib --emit link -o $@ $<

.rs.o :
	rustc $(RUSTC_FLAGS) --crate-type cdylib --emit obj -o $@ $<

libdylib.rlib: rlibs
	rustc $(RUSTC_FLAGS) --crate-type rlib --crate-name dylib --cfg 'feature="libc"' -o $(abs_builddir)/.rlibs/$@ external/dylib/src/lib.rs

liblibrary.rlib: libdylib.rlib rlibs
	rustc $(RUSTC_FLAGS) --crate-type rlib --crate-name library -o $(abs_builddir)/.rlibs/$@ library/src/lib.rs

...

lib_LTLIBRARIES = 	liblibrary.la \
					libplugin_r.la

liblibrary_la_LIBADD = library/src/lib.a
am_liblibrary_la_OBJECTS = library/src/lib.o
liblibrary_la_SOURCES = ""

libplugin_r_la_LIBADD = plugin_r/src/lib.a
am_libplugin_r_la_OBJECTS = plugin_r/src/lib.o
libplugin_r_la_SOURCES = ""
libplugin_r_la_DEPENDENCIES = liblibrary.rlib

I still have the issue with make not knowing when to recompile, but this makes a much clearer and cleaner build for the project. With minimal effort I can now build many .so, they can share rlibs, I do not have to define a make target per object. As well, I gain the ability that autotools has to install shared objects to defined locations.

I hope this helps a bit, but I think that rather than investing in cargo to build rust ubiquitously, why not make cargo the build system for greenfields, and "improve" rustc to work for integration to build systems? This would give you the ability to balance the two, where you can have a feature rich system in Cargo, but the required integration components in rustc that external build tools need.

Thanks,

@Firstyear I think some of your issues can already be solved (most of this is in Cargo's docs):

  • Use cargo-vendor to pull and save all remote dependencies to disk.
  • To produce a .so file, add this under your Cargo.toml's [lib] section:
    crate-type = ["cdylib"]
  • Set this in your Cargo.toml to add debug symbols to a release optimized binary/lib:
    [profile.release]
    debug = true

I'll answer these points

  • cargo-vendor is new to me, so I will investigate it to see if it helps resolve the issues we face in distribution of sources.
  • While cargo can produce a .so. we are using autotools - autotools can not handle an external .so being created. It requires .a and .o files, which limits cargo and makes the invocation quite complex to use. To make it worse, lets say we did use cargo to make .so - there is no way for cargo to install that .so to a meaningful location (make install style), or to cleanly and properly.
  • that is a helpful flag for the release profile to know about.

My points about the opinionated nature of cargo still stand however, it feels like I'm fighting cargo to make it do something it doesn't want to - integration with rustc standalone was much easier for us with our existing sources.

Let me put the response a different way: why do we insist on building rust with cargo? why not make rustc a tier 1 build tool of it's own right, and help to improve the rustc build story alongside cargo?

Thanks,

@Firstyear:

autotools can not handle an external .so being created.

Hm? How so? My first instinct would be to put the following in Makefile.am:

LIBRARIES := libfoo.so
PROGRAMS := hello

libfoo.so: rust-foo/src/*.rs rust-foo/Cargo.toml rust-foo/Cargo.lock rust-foo/build.rs
    cd rust-foo && cargo build
    cp -a SO_LOCATION libfoo.so

hello_LDADD := libfoo.so
hello_DEPENDENCIES := libfoo.so

As noted in the manual:

As far as rules are concerned, a user-defined rule overrides any automake-defined rule for the same target.

As a result, you have the full power of make available to you for producing libfoo.so, and being declared in LIBRARIES allows Automake to take care of the install rules, etc.

My 2c, having spent the last very long time fighting with what you're fighting with @Firstyear, but for a different build tool....

It actually gets worse -- many crates are super coupled to Cargo. Once you solve the vendoring problem, and the Makefile code generation problem for normal crates, you also will get the fun of handling

  • Cargo env vars (CARGO_blah, OUT_DIR, and more)
  • Build.rs files (as a thing)
  • Build.rs helper crates that expect build.rs binary to be in a specific place, with source files in specific places
  • build.rs binaries emitting special cargo-only strings that cargo uses for build planning (cargo:rustc-link-lib=static=foo)

I consider the openssl crate my final boss, as it uses all of these features to tremendous effect. Take a look at that and see how you might try compiling it without Cargo.

Here are my thoughts:

  • Cargo needs to be able to produce .a and .o files.
  • cargo build should be split into two steps that can be run separately:
    1. Fetch all artifacts from crates.io. Store them to a user-specified location.

      This is the โ€œpackage managerโ€ part of Cargo that @Firstyear identified.

    2. Deterministically compile these artifacts. This can be done either by cargo, or by rustc itself.

Cargo can already create .a and .o, the issue is that for autotools projects we need .la and .lo. Without these, we get a lot of complaints.

As well, because autotools doesn't know "what changed", it doesn't know when to trigger the build (arguably, you can make the target .PHONY and let cargo handle it.). Cargo-vendor may alleviate the dist pain issues though.

@eternaleye That looks like a nice solution but it doesn't work: autotools doesn't like the raw .so, and doesn't know how to feed them to the linker correctly, so it fails to build :( need to produce .a and .o still sadly.

One of the biggest challenges is getting cargo to know when to rebuild source. autotools only knows that there is a ".a" or ".o", but not what constitutes them. So as a result, you have to make clean to trigger cargo to actually rebuild.

You can't add the target to .PHONY because then you end up building the targets multiple times pointlessly (I saw 8 builds in a complex environment here).

A subtle gripe is the output:

warning: due to multiple output types requested, the explicitly specified output file name will be adapted for each output type

Which no matter what options or changes I make (even just a single -o target) I can never manage to stop it. Similar, when you use staticlib, you get a very very verbose warning about links and no apparent way to quiet it.

There really should only be one build system that invokes compilers such as rustc directly. Having a build tool call into a different build tool (Cargo in this case) is flaky. Trying to force every build system in the world to go through Cargo rather than rustc directly is not going to work, sadly.

You're free to re-implement cargo in your own build system, of course. I tried that for a bit, but decided it was too much work. But those are the two choices really, calling cargo or re-implementing it, unless you want to maintain a custom build description for all of your rust-language dependencies.

In Firefox we do use the cargo vendor extension and then invoke cargo build --frozen to separate the package download step from the typical bulid. Updating is a bit fiddly but no worse than other package systems I've used.

A no-op cargo build takes a bit longer than make but it's still faster than compiling anything.

warning: due to multiple output types requested, the explicitly specified output file name will be adapted for each output type

I believe you can get rid of this warning by specifying all the output filenames directly. E.g. drop the -o option, pass --crate-name, and rely on standard output name generation. Or specify names explicitly like --emit=dep-info=foo.d,obj=foo.o,link=libfoo.a.

You can parse the dependent library output for a staticlib and feed it back into your eventual link line. This needs to be dynamic since it can be affected by conditional compilation. It would be nice if the compiler could write that to a file instead.

Being quite new to rust I may be missing many nuances in this discussion; but coming from python it strikes me that it is the conda (binary) package management that is sorely lacking here. Not that conda is not python-specific but language-agnostic, and also used for quite some C++ projects nowadays. The main 'problem' as it strikes me is that it has quite a bit of overlap with the functionality of cargo; but is there a way for these two tools to put their egos aside and combine elegantly in some manner?

Is it just the distribution of binary packages (as well as source, like cargo does) that you're interested in, e.g. to save waiting on build times? Or is there something else conda offers which cargo doesn't? I've never used conda, so it would help be understand what you mean if you could offer more details.

It would certainly be possible to add a binary caching layer to cargo or crates.io. Possibly by adding some kind of sccache integration to cargo. I think it hasn't been a priority so far because every rust release has an ABI change, but as applications written in rust get more complex and popular being able to install without the compile step gets more interesting. I believe there's also a few more steps to get to reproducible builds, which are important for verifying distributed binaries.

Build times are one thing; the last python project I worked on had 2gb of binary dependencies; building all of that from source would have been no fun whatsoever; perhaps bordering on practically infeasible.

But its mostly the fact that I end up getting dragged into more internal compilation details than I was used to with conda; for instance, the need to have a fortran compiler installed and configured and whatnot. I wasnt planning on getting involved into any of that; I just wanted to use netlib as a dependency. That kind of 'leaking' of concerns across package boundaries isnt going to scale to large and complex dependency graphs as well as conda did.

So its not so much the 'distribution' of the binaries that I think is the crucial part (though a considerable convenience); but rather it is the pushing of encapsulation of nontrivial concerns as far as possible which I think makes a large difference for the scalability of an ecosystem.

I see. It's more about needing a way for crates with build.rs components to package the additional environment components they need for each platform. That's a good point.

An eRFC on this topic is now up!

Okay, so since then I want to refine this: I have some success with .so for binaries, but trying to link two libraries is an issue.

We have a plugable system which generates .so which we dlopen, and those libraries link to other libraries. For example:

libuiduniq.so -> libslapd.so.

Now the issue I face is that we have a library for datastructures which is librsds.so - the library I have a rust poc in. The issue is linking libuiduniq.la to this can not be done. It's quite challenging to make this configuration play nicely.

Second - .so versioning. If we use cargo to make the .so, we lose the ability to provide these versions in our .so files, so having cargo output resources that can be used by autotools for .la, and allowing libtool to create the .so will really ease the process of integration.