coreinfrastructure/best-practices-badge

Give guidance on reproducible builds

david-a-wheeler opened this issue · 12 comments

Some projects have raised concerns about challenges meeting the build_reproducible gold criterion. The purpose of this criterion is to counter malicious builds, as happened in SolarWinds' Orion, by enabling verifiable reproducible builds. We still want to counter the attack, but we may be able to relax the requirement slightly while still countering the attack:

  • Many projects don't release built software at all (Linux kernel, Apache Software Foundation). In those cases this can be marked as N/A. The project may still explain how to do verified reproducible builds in those cases (and get credit for it), but they should be allowed to say N/A in those cases.
  • Many projects struggle only because of timestamps being different when there's a rebuild. Those produce differences in bit-for-bit comparisons, but I don't see how such date/timestamp differences by themselves lead to subverted software (there would have to be something else to act as a trigger). Timestamp differences are one of the most common causes for non-reproducibility, and since those differences by themselves aren't security-relevant, it seems overly harsh to demand them. I think it's good to do because it makes later comparison easier, but as I said, it's probably overly harsh.

So under the `build_reproducible' gold criterion, modify:

        description: >-
          The project MUST have a <a href="https://reproducible-builds.org/">reproducible
          build</a>. If no building occurs (e.g., scripting languages
          where the source code is used directly instead of being
          compiled), select "not applicable" (N/A).

Change the second sentence to read:

If the project does not release built results (such as an executable, package, or container), but instead only releases unbuilt source code, the project MAY select "not applicable" (N/A).

        details: >-
          A reproducible build means that multiple parties can
          independently redo the process of generating information
          from source files and get exactly the same bit-for-bit
          result.  In some cases, this can be resolved by forcing
          some sort order. JavaScript developers may consider
          using npm shrinkwrap and webpack OccurenceOrderPlugin.
          GCC and clang users may find the -frandom-seed option
          useful. The build environment (including the toolset)
          can often be defined for external parties by specifying
          the cryptographic hash of a specific container or virtual
          machine that they can use for rebuilding. The <a href="https://reproducible-builds.org/docs/">reproducible
          builds project has documentation on how to do this</a>.

Change "result" to "built result", and replace the final period with:

, for example, by adding sorts to enforce deterministic input ordering and setting date/timestamp values. For purposes of this badge, a project MAY consider a result a reproducible build if it produces the same bit-for-bit results except for timestamps. That is because it can be difficult to force consistent timestamps in some build environments and such differences typically cannot be used for attack (without some other additional subversion).

I think it would be a bad idea to water down the reproducibility criterion by permitting certain classes of differences. To put it glibly: either a checksum matches or it doesn't.1

Furthermore, eliminating timestamp pollution from the artifacts of a complex build process is often the absolute lowest-hanging fruit towards getting a stable reproducible environment running. There are many more tedious aspects to getting all the bits lined up just right in a way that can be replicated perfectly by others. So calling out timestamps as being explicitly excluded from matching criteria wouldn't much help anyone towards being able to claim build reproducibility for their project.

I do sympathize with projects who don't, themselves, publish binary artifacts for various reasons2. However, I think this could be addressed in other ways, such as by allowing community-driven reproducibility projects to confer gold status through some sort of trusted consensus mechanism. Free GH actions minutes for OSS projects would go a long way towards providing ready-made infrastructure for collaborative "build verification" services for various platforms.

Build reproducibility is becoming a cornerstone of security (see the recent USDOD Securing the Software Supply Chain: Recommended Practices for Developers). I think it should remain as part of the 🥇 gold standard of this project, or else be bumped up to some new 💎 "diamond standard".

Also, I do think that a watered-down goal of "an attempt at reproducibility" with some exceptions might make a good addition to the 🥈 silver criteria standard.

Footnotes

  1. RE: “bit-for-bit results except for timestamps”: a seemingly-random array of timestamp values in an executable binary could also be crafted as a series of operations that trigger a buffer overflow.

  2. Although Apache generally does, at least for all the Java projects I've contributed to there over the years, such as these downloads that include binary artifacts and published checksums.

Many projects don't release built software at all (Linux kernel, Apache Software Foundation). In those cases this can be marked as N/A. The project may still explain how to do verified reproducible builds in those cases (and get credit for it), but they should be allowed to say N/A in those cases.

I think we should differentiate between builds reproducible for software that only distributes source code and can be reproduced for software that also distributes binaries. If the project distributes binaries (for example attached to their github release, but also docker images), they should also publish documentation on how to build this exact binary, bit-for-bit identical, from source. Projects like i-probably-didnt-backdoor-this, Tails and bitcoin-core are doing it successfully, but there are very few standard tools available for this (lots of talk around SBOMs but very few tools to setup a build environment based on an SBOM for example).

For projects that only distribute source code, there is no binary that can be reproduced but the build should still need to provide a stable, deterministic output to be considered builds reproducible. A downstream Linux Distribution can then take care that the binary they build and ship can be reproduced. The Linux kernel specifically is currently very non-trivial to build reproducible, Arch Linux and Debian are both struggling with it, so I don't think it should be considered build_reproducible or N/A for example.

I don't think upstream projects that are known to produce binaries/artifacts that are difficult to secure further down the supply-chain should be allowed in Gold Tier.

Many projects struggle only because of timestamps being different when there's a rebuild. Those produce differences in bit-for-bit comparisons, but I don't see how such date/timestamp differences by themselves lead to subverted software (there would have to be something else to act as a trigger). Timestamp differences are one of the most common causes for non-reproducibility, and since those differences by themselves aren't security-relevant, it seems overly harsh to demand them. I think it's good to do because it makes later comparison easier, but as I said, it's probably overly harsh.

Normalizing timestamps in a build is a fairly trivial issue, it's much easier to just fix the differences there instead of having to write programs that try to tell benign differences and underhanded backdoors apart with 100%-reliability. Needing manual intervention to inspect diffs should be the exception, not the norm for software with the build_reproducible gold criterion.

@marcprux hit a lot of important points, thanks!

The problem of doing something "bit-for-bit identical except for ..." presents pragmatic challenges.

  • more complicated verification process becomes a software development project unto itself
  • more error-prone which could lead to false positives, false negatives, etc.
  • might actually be easier to fix projects than to develop the tooling to extract the bits you want to verify

In short, it is trivial to compare two artifacts, it presents a whole world of difficulties to compare only parts of two artifacts.

I would strongly caution against using "reproducible builds" in any way other than https://reproducible-builds.org/docs/definition/ which really comes down to bit-for-bit reproducible without exception.

The projects that I use from the Apache Software Foundation, such as Apache Maven and Apache NetBeans, publish a "convenience binary" along with the source release, all under the Apache name. Apache NetBeans even publishes a Snap package binary. The Maven build is reproducible, but the NetBeans build has quite some way to go before being reproducible.

I have found it surprisingly, and frustratingly, difficult to get changes related to reproducible builds accepted by upstream projects. Holding the Apache Software Foundation to a different standard than other open-source projects just makes that even more difficult. It would remove my incentive to make the changes and one of their incentives to accept them.

I would prefer the meaning of reproducible builds to remain bit-for-bit identical, including timestamps. Even for organizations that truly publish only a source release, one could argue that they should have the gold badge only if that source can be built in a reproducible manner.

  • more complicated verification process becomes a software development project unto itself

I will second this, @vagrantc. My own App Fair process creates verifiably reproducible iOS apps, but it is a constant struggle against ever-changing versions of the tools generating indeterminate output in insidious new ways (e.g., due to changes in Xcode's compiler parallelization). My main motivation for spending all these hours on tedious devops debugging is the stretch goal that these apps eventually achieve gold certification, and thereby serve as paragons of trust to the mobile community.

As a strong supporter of bit-for-bit integrity without compromises: when developers report challenges producing software to adhere to some evaluation criteria (be that testability, security, performance or other), often it helps to introduce tooling improvements so that flaws (and opportunities) can be detected earlier in the assembly process (for example: the "shift security left" mantra, and similarly with continuous integration in general).

I have a sense that the challenges experienced by many developers are a result of the fact that we generally have to inspect the output of builds (diffs - sometimes binary) to identify where non-reproducible elements have appeared, and then perform sometimes-mentally-challenging detective work to theorize and evaluate what could have caused those artifacts to appear.

Hermetic build environments that can detect changes as soon as they're introduced during assembly could - I think - be an area where improved tooling might help to stem the introduction of non-reproducible elements early during development, in a way that could be largely-ecosystem-agnostic, and help to win developer mindshare.

There could be practical challenges implementing fail-fast hermetic builds (are filesystem reads/writes the unit of integrity? is language-level and/or IDE-level support required? how would ephemeral and tempfiles be handled?) but I think they're manageable. And similar to test-driven-development: not everyone will want to adopt early-detection since it would add development friction - but for those who understand the value of it as an investment, the benefits should be clear.

(sorry for sidetracking a bit: again I would reiterate that there shouldn't be exceptions for timestamps, because it's not clear what even is simply a timestamp at a binary level, and it would open the door to the very risks that reproducible builds are intended to solve, and there are unanswered questions about how integrity verification could be performed on content that fundamentally differs -- all points that others have alluded to. however I want to state both my support and suggestion that there may be solutions to address concerns)

Bubu commented

Many projects struggle only because of timestamps being different when there's a rebuild. Those produce differences in bit-for-bit comparisons, but I don't see how such date/timestamp differences by themselves lead to subverted software (there would have to be something else to act as a trigger). Timestamp differences are one of the most common causes for non-reproducibility, and since those differences by themselves aren't security-relevant, it seems overly harsh to demand them.

I think this isn't a good idea because it can lead to the dangerous and false assumption that the rebuild with only a different embedded timestamp can be considered identical in behaviour. But any binary could change its behaviour when there's some specific embedded timestamp. Yes, that would be visible in the source, but might be intentionally hidden as well.

So while, when we get to the "only timestamp is different" level of almost-reproducibility, it's easy to go the last step (easiest: replace the timestamp in the binary you just built with the other one), this step is just as crucially important as all the other ones, so it can't get any special treatment.

(The Android App world has a related problem with it's embedded signatures, which you can never reproduce except by copying the signature from the original binary into your rebuild as a last step. But without this step and a different signature the app is expected to behave differently in many scenarios.)

pabs3 commented

While I understand the desire for reproducible builds, in case such as Java where timestamps are introduced in to the zip-file archives (.jar, .war, and .ear files for those not familiar with Java), blasting the timestamps so they are all set deterministically set actually can cause useful information to be lost.

Case in point: Soon after Oracle acquired Sun Microsystems, pretty much every one of Oracle's patch release notes for Java (including some of their corresponding CVE descriptions) were of the intentionally vague form of "multiple unspecified vulnerabilities were patched" or some such BS). My management would ask me, "Would you please analyze the patches and tell us if there's anything we urgently need to patch?" (This was way before SCA tools BTW.) So I would extract all the .class files from (typically) the rt.jar and look at the modification timestamps to see which had been updated since the last patch release we were using. Then I'd de-compile those .class files and do the same for the .class files from corresponding jar from the previous patch release and then finally diff the two versions to see what Oracle had actually fixed. (Don't miss that work at all!) However, had those timestamps all been identical because of deterministic reproducible builds, it would have made that task take a hundred fold times or so longer.

So while there are times when deterministic, reproducible builds might be useful (they never will be unless people decide to verify all of that in their CI/CD pipelines, which I think most companies will be reluctant to do because of the build time resource commitment involved), IMO, for most cases, it brings very little added value.

Just my $.02.

... blasting the timestamps so they are all set deterministically set actually can cause useful information to be lost.

Having reproducible builds does not preclude incremental updates to Java archives. It's just that the dates of the old and new class files would be meaningful, such as their separate release dates. OpenJDK builds don't use such incremental updates anymore, but they could, and they could do so in a reproducible manner, allowing your detective work to go on as before.

Reproducible builds is about blasting away all the useless, meaningless differences: the timestamps of files created during the build, the unsorted order of files in their directories, or the random build paths used in a transient container. When the useless differences are removed, the meaningful differences can be found.

.. IMO, for most cases, it brings very little added value.

Oh, but its value to OpenJDK is already apparent, even though its build has been reproducible only since May. For just one example, this old Javadoc bug, only tangentially related to reproducible builds, would have been impossible to find, and its fix impossible to verify, without the easy ability to create bit-for-bit identical builds.

While I understand the desire for reproducible builds, in case such as Java where timestamps are introduced in to the zip-file archives (.jar, .war, and .ear files for those not familiar with Java), blasting the timestamps so they are all set deterministically set actually can cause useful information to be lost.

Case in point: Soon after Oracle acquired Sun Microsystems, pretty much every one of Oracle's patch release notes for Java (including some of their corresponding CVE descriptions) were of the intentionally vague form of "multiple unspecified vulnerabilities were patched" or some such BS). My management would ask me, "Would you please analyze the patches and tell us if there's anything we urgently need to patch?" (This was way before SCA tools BTW.) So I would extract all the .class files from (typically) the rt.jar and look at the modification timestamps to see which had been updated since the last patch release we were using. Then I'd de-compile those .class files and do the same for the .class files from corresponding jar from the previous patch release and then finally diff the two versions to see what Oracle had actually fixed. (Don't miss that work at all!) However, had those timestamps all been identical because of deterministic reproducible builds, it would have made that task take a hundred fold times or so longer.

If the timestamps are not deterministic they could very well be entirely arbitrary; you might end up with timestamps of whatever checkout the build of those class files happened to be performed on, whatever timestamp the developer happened to use at the time, whatever wonky clock was used, which would actually prevent you from being able to compare the timestamps in the way you actually described...

Clamping the timestamps to the last source change or some other meaningful timestamp will more reliably get you the feature you described, presuming the other files actually retain meaningful timestamps (last modification in VCS, for example, rather than whatever happened to be the on-disk time), and prevents files generated during the build from needlessly differing. And if they don't preserve meaningful timestamps, then you're no worse off that you were.

No need to blindly reset them if the process otherwise maintains meaningful timestamps; embedding the current clock time will nearly always require a maximally detailed process of comparison.

So while there are times when deterministic, reproducible builds might be useful (they never will be unless people decide to verify all of that in their CI/CD pipelines, which I think most companies will be reluctant to do because of the build time resource commitment involved), IMO, for most cases, it brings very little added value.

Nothing is useful unless people actually try to do it, true.

If we are talking about a best practice gold standard, well, let us not set the sights too low either. Recognizing that some things are harder and take more effort, and demonstrating that "this project follows all known best practices" vs. "this project follows many best practices" vs. "this project follows some best practices" should be reflected in the levels.

dkg commented

Just chiming in here to discourage any relaxation of the gold standard. the gold standard should be clear: bit-for-bit identical reproducibility. Please do not carve out subtle exceptions for variable timestamps.

For a project that distributes only source code artifacts, i still think it's worth asking during the review whether generated artifact used by the end user can be built reproducibly. Obviously, we don't want to require source-only software projects to distribute binaries, but presumably the developers do actually have some practice in building some user-facing artifacts. Such a project should be able to concisely describe a particular toolchain and set of compilation/configuration options and dependencies that are known to provide a reproducible build that covers a substantial portion of the codebase.