Implement an equivalent to `CCACHE_BASEDIR`
luser opened this issue ยท 28 comments
CCACHE_BASEDIR
allows ccache to get cache hits for the same source files stored at different paths, which is useful for developers building from different source trees:
https://ccache.samba.org/manual.html#_configuration_settings
We want this so that we can get cross-branch cache hits on our buildbot builds, which have the branch name in their build directory, and we also want this so we can get cache hits from the S3 cache for local developers' builds.
Can't we always strip the cwd by default and have this for override? Maybe yet another env
I am looking at this seriously, and after doing some cleanup of my newbie Rust, it should be ready for a pull request tonight.
Do out of source object directories present a problem given the way the Firefox source is currently built. Say I have SCCACHE_BASEDIR=/home/jwatt
and the following directory structure:
home
jwatt
src
mozsrc1
mozsrc2
obj
mozsrc1-debug
mozsrc2-debug
When compiling /home/jwatt/obj/mozsrc1-debug/layout/svg/SVGImageContext.cpp
(or rather, the unified file that includes SVGImageContext.cpp), the compiler is invoked from /home/jwatt/obj/mozsrc1-debug/layout/svg
with the include path -I/home/jwatt/src/mozsrc1/layout/svg
. If that path is rewritten to the relative path -I../../../../mozsrc1/layout/svg
we are no better off since the mozsrc1
in the path will prevent a cache hit when the same file is built in mozsrc2
.
To avoid that it would seem like sccache
would need to actually cd
to SCCACHE_BASEDIR
, rewrite the paths to be relative to that directory, and invoke the compiler from there (if that's possible without causing knock-on issues).
In #104 (comment), @glandium said,
Furthermore, due to how sccache is used for Firefox builds, it would be better to have several base directories (an arbitrary number of them).
That seems incompatible with @jwatt's approach in #35 (comment).
@glandium, can you explain what you were thinking, and comment on @jwatt's approach?
In regards to @ncalexan's comment above:
I'm involved in a project that builds both locally for developers and using automated builders and we would like to share the cache among all users.
On Windows we currently pass full include paths via the compiler command-line. Automated builders download Visual Studio and Windows SDK as an archive and extract to a specific directory. Individual developers often have Visual Studio and Windows SDK installed in the default system locations. Our source code is then downloaded in a completely separate directory
For example, a value like this on the automated builders:
"-imsvcc:\\users\\User\\cache\\vs-professional-15.5.2-15063.468\\vc\\tools\\msvc\\14.12.25827\\atlmfc\\include"
Would be equivalent to this on a developer machine:
"-imsvcC:\\Program Files (x86)\\Microsoft Visual Studio\\2017\\Professional\\VC\\Tools\\MSVC\\14.12.25827\\atlmfc\\include"
Being able to specify multiple basedir values would allow us to share the cache without requiring everyone to install files in the same locations.
On Linux you could solve that quite easily by keeping your build tools in a docker image including sccache and mount the sources into the container an all your machines at the same location.
To add an additional use case to this discussion. I am currently experimenting with building our internal libraries and tools with conan and just realized that effectively using sccache in this environment would need support for something like multiple base directories or something like path ignore patterns.
In short, when building a package foo/1.2.3@user/channel
with conan the source is located in $CONAN_HOME/data/foo/1.2.3/user/channel/source
and the build directory is $CONAN_HOME/data/foo/1.2.3/user/channel/build
. Given that these paths also affect the compiler flags (i.e., include paths), I therefore only get cache hits when I build exactly the same package and version in the same namespace. Using $CONAN_HOME/data/<name>/<version>/<user>/<channel>
as base directory would enable cache hits, across different versions of the package.
More than one base directory/ignore pattern is needed in the common case where the package additionally depends on other conan packages. For example, when foo/1.2.3@user/channel
depends on bar/4.5.6@otheruser/otherchannel
, additional include and link paths to $CONAN_HOME/data/bar/.5.6/otheruser/otherchannel/package/<someSHA>/
are provided to the compiler. Again, these paths have to be ignored in order to enable sccache hits.
Question: Was it ever considered to use sha256 sums for the files instead of their full paths? I know this is a different approach then what's taken in the ccache project but I think it's much more reliable. The idea is that the same way that every file path is mapped to an object in the cache, it's sha256sum could be used instead.
The primary problem is that we hash commandline arguments to the compiler, and often full source paths wind up there. We did take a change in #208 to fix one issue around this where full paths wound up in the C compiler's preprocessor output. I suspect we could fix this fairly easily for most cases nowadays--it'd mostly just mean filtering commandline arguments that have full paths.
So does it mean that when sccache is wrapping GCC, the path arguments it is given are never absolute and when it wraps rustc it may or may not be given absolute paths?
My suggestion, is that whether the paths are absolute or not, we can read the files in the input (assuming we can know the caller's working directory in case the paths are relative) and compute a sha256 sum of all of them. Then, we can map this sha256 hash to the cached output instead of the command line arguments to the output. This way, whenever there's a match in the hash, the cache could be used.
I'm no Rust developer and I'm not familiar at all with the internals of this project so maybe my idea will be hard to implement but I hope my explanation is good enough.
BTW, implementing this idea will mean that existing caches of users will no longer be valid so in a certain sense it'll be backwards incompatible, but not in usage.
@doronbehar It's a bit more complicated than that and I'd suggest you read the existing implementation first. We are already using the input files as part of the hash. For C/C++ compilation the hash_key
function is the most interesting part:
Lines 482 to 511 in 5855673
For Rust compilation the generate_hash_key
function is what you want:
Lines 926 to 1129 in 5855673
@luser thanks for pointing to the hash_key
function in c.rs
, as I'm also looking into this.
From a limited test run that I've performed, the contents of the arguments
variable passed to hash_key
don't have paths in tem (such as -I
, or the path to the input file).
Instead, I can see that on the function call:
Lines 270 to 277 in 5855673
only "common args" are passed. When debugging my case, I can see that these are flags such as warning flags, optimization level,
-std
C++ standard and so on. None of these contain paths in them, although obviously that could be the case of the project I'm compiling.
Digging a bit deeper, I can see that the the flags with paths in them wind up as preprocessor flags. And the input file itself is also handled differently. Am I correct to assume the following:
-
- The input file itself is hashed, rather than its location, so it's location is not relevant for the purposes of hashing
-
- The flags used for preprocessing are not hashed, but rather then preprocessed output is. In which case, I suspect things like
__FILE__
macros expanding can end up resulting in a different hash when same sources are in different places?
- The flags used for preprocessing are not hashed, but rather then preprocessed output is. In which case, I suspect things like
Hi, I am wondering, what is the status of SCCACHE_BASEDIR
. I have two machines compiling the same project under different paths and was wondering how I can avoid cache misses
One way to avoid the problem altogether is to pass flags to your compiler that normalize file paths. e.g. -fdebug-prefix-map
with GCC or --remap-path-prefix
with rustc.
But aren't the command lines hashed in sccache? The paths would be still different
You mean like this?
sccache clang -c test.cpp --remap-pathprefix...
Is there any update to this?
Another user wanting to use sccache with Conan reporting in, thanks @niosHD for doing some groundwork!
I guess a working (easy to activate?) Conan integration for sccache could be a very useful feature for many users. As Conan uses often-changing absolute build-pathes, but always in the same manner, it should be possible to implement it in a standard way.
Conan is also getting traction, so a feature like that could be a win-win for both tools.
But aren't the command lines hashed in sccache? The paths would be still different
Presumably you'd also have to add special handling in sccache for those arguments so that, e.g., for -fdebug-prefix-map=OLD=NEW
, you'd effectively wind up hashing -fdebug-prefix-map=NEW
.
Conan is also getting traction, so a feature like that could be a win-win for both tools.
Mozilla currently has very little interest in a feature like this, so Mozilla is unlikely to do development on this particular issue.
I gave --remap-path-prefix a try and it causes everything to be a cache miss likely because of this code:
Line 984 in 385f738
I think one possible way to solve this issue would be to implement path prefix support for sccache:
- Require the path prefix mapping to be reversible (two mappings shouldn't collide) for caching
- Include the prefix mapping as needed in metadata for build reproduce-ability, but exclude the mapping from any hashes.
- Any paths included in hashes should be filtered through the prefix mapping.
- The prefix mapping would need to apply to outputs as well as inputs, this may require support / changes in rustc.
FWIW, the distributed compilation code already uses --remap-path-prefix
internally to make sure that paths in the compiler outputs match what would have been produced if the compilation had been run locally:
Line 1628 in 385f738
On Linux you could solve that quite easily by keeping your build tools in a docker image including sccache and mount the sources into the container an all your machines at the same location.
... or use symlinks
things like
__FILE__
macros expanding can end up resulting in a different hash when same sources are in different places?
__FILE__
expands to the input path given to the compiler
sccache
would need to actuallycd
toSCCACHE_BASEDIR
, rewrite the paths to be relative to that directory, and invoke the compiler from there (if that's possible without causing knock-on issues).
would produce prettier paths than ../../../
.
note: when -o path/to/output
is NOT set, gcc writes to the current workdir, so -o
should always be set explicitly (we must parse compiler options anyway ...)
I am currently experimenting with building our internal libraries and tools with conan and just realized that effectively using sccache in this environment would need support for something like multiple base directories or something like path ignore patterns.
looks good on paper. how do we handle collisions?
to compare, nix solves this with a build sandbox (/tmp/nix-build-name-version-drv-0
appears as /build
)
and by storing compiled dependencies in /nix/store
.
but this is overhead for large projects and many small iterations
Question: Was it ever considered to use sha256 sums for the files instead of their full paths?
the compiler needs the actual path (relative or absolute), since #include "path.h"
is relative to the source file.
but then we could add $(dirname $sourceFile)
to the include paths ...
We use --remap-path-prefix
in all of our projects so we always produce reproducible builds. It's unfortunate that this prevents sccache from being useful.
Anyone could share how to use --remap-path-prefix
to avoid the absolute paths problem when using sccahe
with local disk storage? Thanks!
Sccache would be really helpful for local builds of Rust projects, especially when you have a great many with common dependencies (๐โโ๏ธ) and especially in a team environment (๐โโ๏ธ).
Unfortunately the lack of this feature makes that quite painful in practice. So I'd love to better understand its status. In particular, is it stalled because:
- There's just nobody with time to implement it who is also invested in this use case?
- There are difficult design questions that don't yet have clear answers?
- There's some kind of technical or philosophical blocker to it being accepted by the primary maintainers?
I'm just having trouble figuring out which one (or N) it is.
Thanks! ๐โโ๏ธ
Rust projects, especially when you have a great many with common dependencies (๐โโ๏ธ) and especially in a team environment (๐โโ๏ธ).
While sccache is awesome, there are a few caveats with Rust and it is not all ๐ฆ unfortunately.
I think it is good to be aware of those before hoping it solves all the problems (sorry ๐ค ).
In a nutshell and unless you take good care of aligning things, sccache will likely not work as you'd expect if:
- members of your team use different OS
- members of your team have different user accounts (I bet it is the case...)
- even using the same OS, it will likley NOT work unless you take special measures on MacOS for instance (ie creating a hard link and letting Rust use that location for all members)
- If using a mix of Rustc versions, sccache will not help
That's a lot of caveats and in the meantime, the Rust compiler improved a lot. Nowadays, unless you have a very sopecific env that ticks all the checks above, you will likely be better not using sccache.
I am saying this based on tests I made in 2021, testing locally, using MacOS and tested with Redis, minio and memcached.
That may explain the low activity.
In the end, if sccache works for you, great, if not, you can also check cargo remote, which can also help for teams.
In a nutshell and unless you take good care of aligning things, sccache will likely not work as you'd expect if:
In my case all these things are in fact aligned except for user accounts and the other reasons for paths to differ. Is there a problem with having different user accounts other than it making paths different?
As for differing paths, I had thought that is what this whole issue is about. Perhaps I misunderstood?
I'm thinking of making my own rustc wrapper that takes care of setting up an environment for sccache (files in a consistent location, maybe even doing a chroot or something if necessary) that ensures a cache hit whenever it should be possible. It might end up being the simplest path forward, as it sidesteps the questions about what is the correct way to handle all the messy details of path remapping.
In my case all these things are in fact aligned except for user accounts and the other reasons for paths to differ. Is there a problem with having different user accounts other than it making paths different?
Yes, this is why I mentioned it :)
Cache won't hit ie Alice has projects under $HOME/foo
and Bob under $HOME/foo
.
On Linux, it is rather simple (technically) if you can align all users to having their projects in something like /path/to/projects
and not $HOME/some/path
.
On MacOS it is less trivial since MacOS does NOT let users created Hard links, especially not at the root unless you do some trickery that can be dangerous (ie brick temperary your disk, ask me how I know....).
I'm thinking of making my own rustc wrapper ...
That sounds interesting and I would love to hear about the journey. I definitely do NOT to discourage you, just set expectations so you know a few things that sounds trivial... will not be. I am not saying you cannot get it to work if indeed your team is not to "wildly spread" regarding how the env are set.
the simplest path forward,...
One of the main issue you will face is that Alice and Bob will not use (all the time) the same Rust version. They may even both be on nightly but Alice updates in the morning and Bob in the evening. That's however still an easy and good case: Alice will build up the cache abnd Bob will benefit.
Issues arise if for whatever reason, those users decides to use different versions and it is hard to force a team (presumably working on several projects...) to use a unified and synchronized version of Rust. Not impossible and some tools can help but it s often not the case by default.
That sounds interesting and I would love to hear about the journey.
Well, I guess if I have any success you're bound to hear about it. ๐ (But yes, I would report back here as well.) I have some experience building vaguely related sorts of tools, so I expect complications but am also confident I could find ways around them. It's more a question of getting around to it... (very limited time right now)
Rust version
In my case this is simple โ pretty much everything is in a monorepo with a single pinned version that gets bumped by a bot (via a pull request) when a new stable Rust release come out. We used to have some exceptions, but I'm pretty sure we deleted the last one recently. In this regard at least, you could say that we are playing on easy mode due to the rigid homogeneity of our dev environments.