mozilla/sccache

Implement an equivalent to `CCACHE_BASEDIR`

luser opened this issue ยท 28 comments

luser commented

CCACHE_BASEDIR allows ccache to get cache hits for the same source files stored at different paths, which is useful for developers building from different source trees:
https://ccache.samba.org/manual.html#_configuration_settings

We want this so that we can get cross-branch cache hits on our buildbot builds, which have the branch name in their build directory, and we also want this so we can get cache hits from the S3 cache for local developers' builds.

Can't we always strip the cwd by default and have this for override? Maybe yet another env

gumpt commented

I am looking at this seriously, and after doing some cleanup of my newbie Rust, it should be ready for a pull request tonight.

jwatt commented

Do out of source object directories present a problem given the way the Firefox source is currently built. Say I have SCCACHE_BASEDIR=/home/jwatt and the following directory structure:

home
  jwatt
    src
      mozsrc1
      mozsrc2
    obj
      mozsrc1-debug
      mozsrc2-debug

When compiling /home/jwatt/obj/mozsrc1-debug/layout/svg/SVGImageContext.cpp (or rather, the unified file that includes SVGImageContext.cpp), the compiler is invoked from /home/jwatt/obj/mozsrc1-debug/layout/svg with the include path -I/home/jwatt/src/mozsrc1/layout/svg. If that path is rewritten to the relative path -I../../../../mozsrc1/layout/svg we are no better off since the mozsrc1 in the path will prevent a cache hit when the same file is built in mozsrc2.

To avoid that it would seem like sccache would need to actually cd to SCCACHE_BASEDIR, rewrite the paths to be relative to that directory, and invoke the compiler from there (if that's possible without causing knock-on issues).

In #104 (comment), @glandium said,

Furthermore, due to how sccache is used for Firefox builds, it would be better to have several base directories (an arbitrary number of them).

That seems incompatible with @jwatt's approach in #35 (comment).

@glandium, can you explain what you were thinking, and comment on @jwatt's approach?

In regards to @ncalexan's comment above:

I'm involved in a project that builds both locally for developers and using automated builders and we would like to share the cache among all users.

On Windows we currently pass full include paths via the compiler command-line. Automated builders download Visual Studio and Windows SDK as an archive and extract to a specific directory. Individual developers often have Visual Studio and Windows SDK installed in the default system locations. Our source code is then downloaded in a completely separate directory

For example, a value like this on the automated builders:

"-imsvcc:\\users\\User\\cache\\vs-professional-15.5.2-15063.468\\vc\\tools\\msvc\\14.12.25827\\atlmfc\\include"

Would be equivalent to this on a developer machine:

"-imsvcC:\\Program Files (x86)\\Microsoft Visual Studio\\2017\\Professional\\VC\\Tools\\MSVC\\14.12.25827\\atlmfc\\include"

Being able to specify multiple basedir values would allow us to share the cache without requiring everyone to install files in the same locations.

On Linux you could solve that quite easily by keeping your build tools in a docker image including sccache and mount the sources into the container an all your machines at the same location.

To add an additional use case to this discussion. I am currently experimenting with building our internal libraries and tools with conan and just realized that effectively using sccache in this environment would need support for something like multiple base directories or something like path ignore patterns.

In short, when building a package foo/1.2.3@user/channel with conan the source is located in $CONAN_HOME/data/foo/1.2.3/user/channel/source and the build directory is $CONAN_HOME/data/foo/1.2.3/user/channel/build. Given that these paths also affect the compiler flags (i.e., include paths), I therefore only get cache hits when I build exactly the same package and version in the same namespace. Using $CONAN_HOME/data/<name>/<version>/<user>/<channel> as base directory would enable cache hits, across different versions of the package.

More than one base directory/ignore pattern is needed in the common case where the package additionally depends on other conan packages. For example, when foo/1.2.3@user/channel depends on bar/4.5.6@otheruser/otherchannel, additional include and link paths to $CONAN_HOME/data/bar/.5.6/otheruser/otherchannel/package/<someSHA>/ are provided to the compiler. Again, these paths have to be ignored in order to enable sccache hits.

Question: Was it ever considered to use sha256 sums for the files instead of their full paths? I know this is a different approach then what's taken in the ccache project but I think it's much more reliable. The idea is that the same way that every file path is mapped to an object in the cache, it's sha256sum could be used instead.

luser commented

The primary problem is that we hash commandline arguments to the compiler, and often full source paths wind up there. We did take a change in #208 to fix one issue around this where full paths wound up in the C compiler's preprocessor output. I suspect we could fix this fairly easily for most cases nowadays--it'd mostly just mean filtering commandline arguments that have full paths.

So does it mean that when sccache is wrapping GCC, the path arguments it is given are never absolute and when it wraps rustc it may or may not be given absolute paths?

My suggestion, is that whether the paths are absolute or not, we can read the files in the input (assuming we can know the caller's working directory in case the paths are relative) and compute a sha256 sum of all of them. Then, we can map this sha256 hash to the cached output instead of the command line arguments to the output. This way, whenever there's a match in the hash, the cache could be used.

I'm no Rust developer and I'm not familiar at all with the internals of this project so maybe my idea will be hard to implement but I hope my explanation is good enough.

BTW, implementing this idea will mean that existing caches of users will no longer be valid so in a certain sense it'll be backwards incompatible, but not in usage.

luser commented

@doronbehar It's a bit more complicated than that and I'd suggest you read the existing implementation first. We are already using the input files as part of the hash. For C/C++ compilation the hash_key function is the most interesting part:

sccache/src/compiler/c.rs

Lines 482 to 511 in 5855673

/// Compute the hash key of `compiler` compiling `preprocessor_output` with `args`.
pub fn hash_key(compiler_digest: &str,
language: Language,
arguments: &[OsString],
extra_hashes: &[String],
env_vars: &[(OsString, OsString)],
preprocessor_output: &[u8]) -> String
{
// If you change any of the inputs to the hash, you should change `CACHE_VERSION`.
let mut m = Digest::new();
m.update(compiler_digest.as_bytes());
m.update(CACHE_VERSION);
m.update(language.as_str().as_bytes());
for arg in arguments {
arg.hash(&mut HashToDigest { digest: &mut m });
}
for hash in extra_hashes {
m.update(hash.as_bytes());
}
for &(ref var, ref val) in env_vars.iter() {
if CACHED_ENV_VARS.contains(var.as_os_str()) {
var.hash(&mut HashToDigest { digest: &mut m });
m.update(&b"="[..]);
val.hash(&mut HashToDigest { digest: &mut m });
}
}
m.update(preprocessor_output);
m.finish()
}

For Rust compilation the generate_hash_key function is what you want:

sccache/src/compiler/rust.rs

Lines 926 to 1129 in 5855673

fn generate_hash_key(self: Box<Self>,
creator: &T,
cwd: PathBuf,
env_vars: Vec<(OsString, OsString)>,
_may_dist: bool,
pool: &CpuPool)
-> SFuture<HashResult>
{
let me = *self;
let RustHasher {
executable,
host,
sysroot,
compiler_shlibs_digests, #[cfg(feature = "dist-client")]
rlib_dep_reader,
parsed_args:
ParsedArguments {
arguments,
output_dir,
externs,
crate_link_paths,
staticlibs,
crate_name,
crate_types,
dep_info,
emit,
color_mode: _,
},
} = me;
trace!("[{}]: generate_hash_key", crate_name);
// TODO: this doesn't produce correct arguments if they should be concatenated - should use iter_os_strings
let os_string_arguments: Vec<(OsString, Option<OsString>)> = arguments.iter()
.map(|arg| (arg.to_os_string(), arg.get_data().cloned().map(IntoArg::into_arg_os_string))).collect();
// `filtered_arguments` omits --emit and --out-dir arguments.
// It's used for invoking rustc with `--emit=dep-info` to get the list of
// source files for this crate.
let filtered_arguments = os_string_arguments.iter()
.filter_map(|&(ref arg, ref val)| {
if arg == "--emit" || arg == "--out-dir" {
None
} else {
Some((arg, val))
}
})
.flat_map(|(arg, val)| Some(arg).into_iter().chain(val))
.map(|a| a.clone())
.collect::<Vec<_>>();
// Find all the source files and hash them
let source_hashes_pool = pool.clone();
let source_files = get_source_files(creator, &crate_name, &executable, &filtered_arguments, &cwd, &env_vars, pool);
let source_files_and_hashes = source_files
.and_then(move |source_files| {
hash_all(&source_files, &source_hashes_pool).map(|source_hashes| (source_files, source_hashes))
});
// Hash the contents of the externs listed on the commandline.
trace!("[{}]: hashing {} externs", crate_name, externs.len());
let abs_externs = externs.iter().map(|e| cwd.join(e)).collect::<Vec<_>>();
let extern_hashes = hash_all(&abs_externs, pool);
// Hash the contents of the staticlibs listed on the commandline.
trace!("[{}]: hashing {} staticlibs", crate_name, staticlibs.len());
let abs_staticlibs = staticlibs.iter().map(|s| cwd.join(s)).collect::<Vec<_>>();
let staticlib_hashes = hash_all(&abs_staticlibs, pool);
let creator = creator.clone();
let hashes = source_files_and_hashes.join3(extern_hashes, staticlib_hashes);
Box::new(hashes.and_then(move |((source_files, source_hashes), extern_hashes, staticlib_hashes)|
-> SFuture<_> {
// If you change any of the inputs to the hash, you should change `CACHE_VERSION`.
let mut m = Digest::new();
// Hash inputs:
// 1. A version
m.update(CACHE_VERSION);
// 2. compiler_shlibs_digests
for d in compiler_shlibs_digests {
m.update(d.as_bytes());
}
let weak_toolchain_key = m.clone().finish();
// 3. The full commandline (self.arguments)
// TODO: there will be full paths here, it would be nice to
// normalize them so we can get cross-machine cache hits.
// A few argument types are not passed in a deterministic order
// by cargo: --extern, -L, --cfg. We'll filter those out, sort them,
// and append them to the rest of the arguments.
let args = {
let (mut sortables, rest): (Vec<_>, Vec<_>) = os_string_arguments.iter()
// We exclude a few arguments from the hash:
// -L, --extern, --out-dir
// These contain paths which aren't relevant to the output, and the compiler inputs
// in those paths (rlibs and static libs used in the compilation) are used as hash
// inputs below.
.filter(|&&(ref arg, _)| {
!(arg == "--extern" || arg == "-L" || arg == "--out-dir")
})
// A few argument types were not passed in a deterministic order
// by older versions of cargo: --extern, -L, --cfg. We'll filter the rest of those
// out, sort them, and append them to the rest of the arguments.
.partition(|&&(ref arg, _)| arg == "--cfg");
sortables.sort();
rest.into_iter()
.chain(sortables)
.flat_map(|&(ref arg, ref val)| {
iter::once(arg).chain(val.as_ref())
})
.fold(OsString::new(), |mut a, b| {
a.push(b);
a
})
};
args.hash(&mut HashToDigest { digest: &mut m });
// 4. The digest of all source files (this includes src file from cmdline).
// 5. The digest of all files listed on the commandline (self.externs).
// 6. The digest of all static libraries listed on the commandline (self.staticlibs).
for h in source_hashes.into_iter().chain(extern_hashes).chain(staticlib_hashes) {
m.update(h.as_bytes());
}
// 7. Environment variables. Ideally we'd use anything referenced
// via env! in the program, but we don't have a way to determine that
// currently, and hashing all environment variables is too much, so
// we'll just hash the CARGO_ env vars and hope that's sufficient.
// Upstream Rust issue tracking getting information about env! usage:
// https://github.com/rust-lang/rust/issues/40364
let mut env_vars: Vec<_> = env_vars.iter()
// Filter out RUSTC_COLOR since we control color usage with command line flags.
// rustc reports an error when both are present.
.filter(|(ref k, _)| k != "RUSTC_COLOR")
.cloned()
.collect();
env_vars.sort();
for &(ref var, ref val) in env_vars.iter() {
// CARGO_MAKEFLAGS will have jobserver info which is extremely non-cacheable.
if var.starts_with("CARGO_") && var != "CARGO_MAKEFLAGS" {
var.hash(&mut HashToDigest { digest: &mut m });
m.update(b"=");
val.hash(&mut HashToDigest { digest: &mut m });
}
}
// 8. The cwd of the compile. This will wind up in the rlib.
cwd.hash(&mut HashToDigest { digest: &mut m });
// Turn arguments into a simple Vec<OsString> to calculate outputs.
let flat_os_string_arguments: Vec<OsString> = os_string_arguments.into_iter()
.flat_map(|(arg, val)| iter::once(arg).into_iter().chain(val))
.collect();
Box::new(get_compiler_outputs(&creator, &executable, &flat_os_string_arguments, &cwd, &env_vars).map(move |mut outputs| {
if emit.contains("metadata") {
// rustc currently does not report rmeta outputs with --print file-names
// --emit metadata the rlib is printed, and with --emit metadata,link
// only the rlib is printed.
let rlibs: HashSet<_> = outputs.iter().cloned().filter(|p| {
p.ends_with(".rlib")
}).collect();
for lib in rlibs {
let rmeta = lib.replacen(".rlib", ".rmeta", 1);
// Do this defensively for future versions of rustc that may
// be fixed.
if !outputs.contains(&rmeta) {
outputs.push(rmeta);
}
if !emit.contains("link") {
outputs.retain(|p| *p != lib);
}
}
}
let output_dir = PathBuf::from(output_dir);
// Convert output files into a map of basename -> full path.
let mut outputs = outputs.into_iter()
.map(|o| {
let p = output_dir.join(&o);
(o, p)
})
.collect::<HashMap<_, _>>();
let dep_info = if let Some(dep_info) = dep_info {
let p = output_dir.join(&dep_info);
outputs.insert(dep_info.to_string_lossy().into_owned(), p.clone());
Some(p)
} else {
None
};
let mut arguments = arguments;
// Always request color output, the client will strip colors if needed.
arguments.push(Argument::WithValue("--color", ArgData::Color("always".into()), ArgDisposition::Separated));
let inputs = source_files.into_iter().chain(abs_externs).chain(abs_staticlibs).collect();
HashResult {
key: m.finish(),
compilation: Box::new(RustCompilation {
executable: executable,
host,
sysroot: sysroot,
arguments: arguments,
inputs: inputs,
outputs: outputs,
crate_link_paths,
crate_name,
crate_types,
dep_info,
cwd,
env_vars,
#[cfg(feature = "dist-client")]
rlib_dep_reader,
}),
weak_toolchain_key,
}
}))
}))
}

@luser thanks for pointing to the hash_key function in c.rs, as I'm also looking into this.

From a limited test run that I've performed, the contents of the arguments variable passed to hash_key don't have paths in tem (such as -I, or the path to the input file).

Instead, I can see that on the function call:

sccache/src/compiler/c.rs

Lines 270 to 277 in 5855673

let key = {
hash_key(&executable_digest,
parsed_args.language,
&parsed_args.common_args,
&extra_hashes,
&env_vars,
&preprocessor_result.stdout)
};

only "common args" are passed. When debugging my case, I can see that these are flags such as warning flags, optimization level, -std C++ standard and so on. None of these contain paths in them, although obviously that could be the case of the project I'm compiling.

Digging a bit deeper, I can see that the the flags with paths in them wind up as preprocessor flags. And the input file itself is also handled differently. Am I correct to assume the following:

    1. The input file itself is hashed, rather than its location, so it's location is not relevant for the purposes of hashing
    1. The flags used for preprocessing are not hashed, but rather then preprocessed output is. In which case, I suspect things like __FILE__ macros expanding can end up resulting in a different hash when same sources are in different places?

Hi, I am wondering, what is the status of SCCACHE_BASEDIR. I have two machines compiling the same project under different paths and was wondering how I can avoid cache misses

One way to avoid the problem altogether is to pass flags to your compiler that normalize file paths. e.g. -fdebug-prefix-map with GCC or --remap-path-prefix with rustc.

But aren't the command lines hashed in sccache? The paths would be still different

You mean like this?
sccache clang -c test.cpp --remap-pathprefix...

Is there any update to this?

Another user wanting to use sccache with Conan reporting in, thanks @niosHD for doing some groundwork!

I guess a working (easy to activate?) Conan integration for sccache could be a very useful feature for many users. As Conan uses often-changing absolute build-pathes, but always in the same manner, it should be possible to implement it in a standard way.

Conan is also getting traction, so a feature like that could be a win-win for both tools.

But aren't the command lines hashed in sccache? The paths would be still different

Presumably you'd also have to add special handling in sccache for those arguments so that, e.g., for -fdebug-prefix-map=OLD=NEW, you'd effectively wind up hashing -fdebug-prefix-map=NEW.

Conan is also getting traction, so a feature like that could be a win-win for both tools.

Mozilla currently has very little interest in a feature like this, so Mozilla is unlikely to do development on this particular issue.

I gave --remap-path-prefix a try and it causes everything to be a cache miss likely because of this code:

take_arg!("--remap-path-prefix", OsString, CanBeSeparated('='), TooHard),

I think one possible way to solve this issue would be to implement path prefix support for sccache:

  1. Require the path prefix mapping to be reversible (two mappings shouldn't collide) for caching
  2. Include the prefix mapping as needed in metadata for build reproduce-ability, but exclude the mapping from any hashes.
  3. Any paths included in hashes should be filtered through the prefix mapping.
  4. The prefix mapping would need to apply to outputs as well as inputs, this may require support / changes in rustc.
luser commented

FWIW, the distributed compilation code already uses --remap-path-prefix internally to make sure that paths in the compiler outputs match what would have been produced if the compilation had been run locally:

dist_arguments.push(format!("--remap-path-prefix={}={}", &dist_path, local_path));

On Linux you could solve that quite easily by keeping your build tools in a docker image including sccache and mount the sources into the container an all your machines at the same location.

... or use symlinks

things like __FILE__ macros expanding can end up resulting in a different hash when same sources are in different places?

__FILE__ expands to the input path given to the compiler

sccache would need to actually cd to SCCACHE_BASEDIR, rewrite the paths to be relative to that directory, and invoke the compiler from there (if that's possible without causing knock-on issues).

would produce prettier paths than ../../../.
note: when -o path/to/output is NOT set, gcc writes to the current workdir, so -o should always be set explicitly (we must parse compiler options anyway ...)

I am currently experimenting with building our internal libraries and tools with conan and just realized that effectively using sccache in this environment would need support for something like multiple base directories or something like path ignore patterns.

looks good on paper. how do we handle collisions?

to compare, nix solves this with a build sandbox (/tmp/nix-build-name-version-drv-0 appears as /build)
and by storing compiled dependencies in /nix/store.
but this is overhead for large projects and many small iterations

Question: Was it ever considered to use sha256 sums for the files instead of their full paths?

the compiler needs the actual path (relative or absolute), since #include "path.h" is relative to the source file.
but then we could add $(dirname $sourceFile) to the include paths ...

We use --remap-path-prefix in all of our projects so we always produce reproducible builds. It's unfortunate that this prevents sccache from being useful.

khssnv commented

Anyone could share how to use --remap-path-prefix to avoid the absolute paths problem when using sccahe with local disk storage? Thanks!

Sccache would be really helpful for local builds of Rust projects, especially when you have a great many with common dependencies (๐Ÿ™‹โ€โ™‚๏ธ) and especially in a team environment (๐Ÿ™‹โ€โ™‚๏ธ).

Unfortunately the lack of this feature makes that quite painful in practice. So I'd love to better understand its status. In particular, is it stalled because:

  • There's just nobody with time to implement it who is also invested in this use case?
  • There are difficult design questions that don't yet have clear answers?
  • There's some kind of technical or philosophical blocker to it being accepted by the primary maintainers?

I'm just having trouble figuring out which one (or N) it is.

Thanks! ๐Ÿ™‡โ€โ™‚๏ธ

Rust projects, especially when you have a great many with common dependencies (๐Ÿ™‹โ€โ™‚๏ธ) and especially in a team environment (๐Ÿ™‹โ€โ™‚๏ธ).

While sccache is awesome, there are a few caveats with Rust and it is not all ๐Ÿฆ„ unfortunately.
I think it is good to be aware of those before hoping it solves all the problems (sorry ๐Ÿค— ).

In a nutshell and unless you take good care of aligning things, sccache will likely not work as you'd expect if:

  • members of your team use different OS
  • members of your team have different user accounts (I bet it is the case...)
  • even using the same OS, it will likley NOT work unless you take special measures on MacOS for instance (ie creating a hard link and letting Rust use that location for all members)
  • If using a mix of Rustc versions, sccache will not help

That's a lot of caveats and in the meantime, the Rust compiler improved a lot. Nowadays, unless you have a very sopecific env that ticks all the checks above, you will likely be better not using sccache.
I am saying this based on tests I made in 2021, testing locally, using MacOS and tested with Redis, minio and memcached.

That may explain the low activity.

In the end, if sccache works for you, great, if not, you can also check cargo remote, which can also help for teams.

In a nutshell and unless you take good care of aligning things, sccache will likely not work as you'd expect if:

In my case all these things are in fact aligned except for user accounts and the other reasons for paths to differ. Is there a problem with having different user accounts other than it making paths different?

As for differing paths, I had thought that is what this whole issue is about. Perhaps I misunderstood?

I'm thinking of making my own rustc wrapper that takes care of setting up an environment for sccache (files in a consistent location, maybe even doing a chroot or something if necessary) that ensures a cache hit whenever it should be possible. It might end up being the simplest path forward, as it sidesteps the questions about what is the correct way to handle all the messy details of path remapping.

In my case all these things are in fact aligned except for user accounts and the other reasons for paths to differ. Is there a problem with having different user accounts other than it making paths different?

Yes, this is why I mentioned it :)
Cache won't hit ie Alice has projects under $HOME/foo and Bob under $HOME/foo.
On Linux, it is rather simple (technically) if you can align all users to having their projects in something like /path/to/projects and not $HOME/some/path.

On MacOS it is less trivial since MacOS does NOT let users created Hard links, especially not at the root unless you do some trickery that can be dangerous (ie brick temperary your disk, ask me how I know....).

I'm thinking of making my own rustc wrapper ...

That sounds interesting and I would love to hear about the journey. I definitely do NOT to discourage you, just set expectations so you know a few things that sounds trivial... will not be. I am not saying you cannot get it to work if indeed your team is not to "wildly spread" regarding how the env are set.

the simplest path forward,...

One of the main issue you will face is that Alice and Bob will not use (all the time) the same Rust version. They may even both be on nightly but Alice updates in the morning and Bob in the evening. That's however still an easy and good case: Alice will build up the cache abnd Bob will benefit.

Issues arise if for whatever reason, those users decides to use different versions and it is hard to force a team (presumably working on several projects...) to use a unified and synchronized version of Rust. Not impossible and some tools can help but it s often not the case by default.

That sounds interesting and I would love to hear about the journey.

Well, I guess if I have any success you're bound to hear about it. ๐Ÿ˜€ (But yes, I would report back here as well.) I have some experience building vaguely related sorts of tools, so I expect complications but am also confident I could find ways around them. It's more a question of getting around to it... (very limited time right now)

Rust version

In my case this is simple โ€” pretty much everything is in a monorepo with a single pinned version that gets bumped by a bot (via a pull request) when a new stable Rust release come out. We used to have some exceptions, but I'm pretty sure we deleted the last one recently. In this regard at least, you could say that we are playing on easy mode due to the rigid homogeneity of our dev environments.