leanprover/lean4

Nix builds have empty githash

Opened this issue · 14 comments

Prerequisites

  • Put an X between the brackets on this line if you have done all of the following:
    • Check that your issue is not already filed.
    • Reduce the issue to a minimal, self-contained, reproducible test case. Avoid dependencies to mathlib4 or std4.

Description

When installing lean4 via a flake, it's version string does not contain a git sha1 hash. When hashing lean files (for downloading mathlib's cache), these hashes depend on the git commit hash. Thus using the mathlib cache fails when using a lean/lake through nix.

Proposed Solution

I wanted to open a PR immediately but did not do so since the guidelines request an issue is opened first. I have a patch here that adds the git hash into CMakeLists.txt before building. There is some precedence for this.

Context

Here is a Zulip discussion and an investigation of the issue.

Steps to Reproduce

When installing lean4 via its nix flake, it's version string does not contain a git sha1 hash. For example, entering a devshell with this flake.nix file

{
  description = "...";
  inputs.nixpkgs.url = "github:NixOS/nixpkgs/nixpkgs-unstable";
  inputs.flake-utils.url = "github:numtide/flake-utils";

  inputs.lean4.url = "github:leanprover/lean4/v4.7.0";

  outputs = { self, nixpkgs, flake-utils, lean4 }:
    flake-utils.lib.eachDefaultSystem (system:
      let
        pkgs = nixpkgs.legacyPackages.${system};
        lean4-pkgs = lean4.packages.${system};
      in {
        devShells.default = pkgs.mkShell {
          packages = [ lean4-pkgs.lean-all lean4-pkgs.vscode ];
        };
      });
}

and running lean --version produces:

Lean (version 4.7.0, x86_64-unknown-linux-gnu, Release)

Notice there is no commit in the version.

Expected Behavior

A properly built lean binary outputs the following:

Lean (version 4.8.0, x86_64-unknown-linux-gnu, commit 23a038925adf, Release)

Impact

Add 👍 to issues you consider important. If others are impacted by this issue, please ask them to add 👍 to it.

Just adding the git hash here would defeat most of benefits of using nix, such as a cache hit if you build identical sources.

I'd propose to implement it differently: a nix'ed lean binary ought to look in its own nix store path, and use the nix hash therein as the “githash”. This should have the uniqueness guarantees lake wants, without causing unnecessary rebuilds.

Happy to review a PR

Just adding the git hash here would defeat most of benefits of using nix, such as a cache hit if you build identical sources.

I think this is not an option because Lean.githash is a value available to code compiled using the generated lean compiler. Not using the right githash causes these programs to behave incorrectly, and the githash is also embedded in .olean files and lean will not load files with the wrong githash. (I think this design is too strict, but) this is what we have, and it means that you can't really produce a correct lean build if it doesn't have the right githash. Maybe you can get away with this for dev builds but definitely not releases.

What is “this” you are referring to? You only quote the first sentence, but that’s not what I am proposing.

What is the correctness requirement for the githash value? Is the following sound: “If the sources that were used to built the lean executable differe, then the githash value differs” (I assumed it was).

So if the lean binary itself knows “a hash of all sources used to build it it” (which the nix store path provides), and the lean binary uses that hash as the “githash” everywhere (in oleans, as the value of Lean.githash, in lean --version), then all should be well, shouldn’t it? Or what could break?

Just adding the git hash here would defeat most of benefits of using nix, such as a cache hit if you build identical sources.

How would you have identical sources if the git hash is different? I'm probably misunderstanding this sentence.

What is the correctness requirement for the githash value?

The mathlib cached .olean files have a hash that is derived from the git hash of lean that was used to build them. If we were to use some nix hash as the "git hash", users of lean installed with nix would never be able to use the mathlib cache. This nix-hash lean would request files with a hash that do not exist in the remote cache.

Additionally, flakes expose the git hash used to build them through self.rev.

How would you have identical sources if the git hash is different? I'm probably misunderstanding this sentence.

git commit --allow-empty -m "New commit"

or some form of rebasing.

If we were to use some nix hash as the "git hash", users of lean installed with nix would never be able to use the mathlib cache.

Yes, that’s what I would expect: It’s a different build of lean and it seems actively dangerous to use the same build artifacts. So sounds like a feature to me, not a bug.

Additionally, flakes expose the git hash used to build them through self.rev.

Right, but that’s dangerous. Imagine someone using such a flake, but then applies a patch using .override. If we use the git hash rather than something derived from the nix derivation hash they will get cache hits when they (probably) shouldn’t.

Fair enough, I think this is enough rationale not to pursue adding the git hash like I proposed. Since nix installs are not technically supported, nix users can patch in the git hash if they wish.

Ah, I missed that the primary motivation was compatibility with mathlib’s cache (as opposed to, say, lake build rebuilding when necessary), sorry for that.

For the reasons above I am hesitant to make nix builds out of the box pretend that they equivalent to some particular release build.

Does setting LEAN_GITHASH on your end help? It is taken into account by lake, not sure if cache also uses it. It still is a bit tedious to find out the release’s hash and set the environment variable, but at least it would provide a way out for those who understand what they are doing.

Is LEAN_GITHASH read during runtime? Setting it to a garbage value and doing lake update still downloads the cache successfully (making me think it's not being used in the way I expect).

It is a lake concept, and a new one at it. Judging from the cache source it is currently not heeded. One could argue that maybe it should.

Yeah that would simplify using the cache through a nix install. The devshell could then export LEAN_GITHASH locked to the lean4 revision that was used as an input to the flake.

I think we just need to change this line to have Lean.gitHash be overridden by an env var if present.

Exactly. (Maybe a bit more robust to use a wrapper around the lake binary to set the environment variable – you wouldn’t want it to apply just because you happen to be in one project’s devshell but you run some lean from somewhere else.)

Yeah but I think the mathlib cache code runs lake (and lean) through elan, instead of letting the current PATH find those binaries.

It is important that cache uses Lean.githash i.e. the compile-time-known hash, rather than one produced from runtime data, because lean itself also checks the githash (it is stored in olean files) and will produce the "invalid header" message otherwise. So even if you hack cache to not check this value, it will result in cache downloading files that lean refuses to use.

Doesn't lean adhere to LEAN_GITHASH? The original use case for this is for developers to use a different version of lean without rebuilding everything (which is of course risky, but also very useful when debugging), so I'd assume it would take precedence over the value compiled in at build time. Away from the computer right now, so can't check easily.