OWASP/Software-Component-Verification-Standard

lvl 2 and lvl 3 is impossible due to requiring both reproducability and non-reproducability of SBOMs

Opened this issue · 11 comments

2.2 SBOM creation is automated and reproducible means the SBOM must be reproducible, a good requirement for lvl2 and lvl3.
2.7 SBOM is timestamped requires a timestamp for every level.

Timestamps are the bane of reproducibility.

Timestamps make the biggest source of reproducibility issues. Many build tools record the current date and time. The filesystem does, and most archive formats will happily record modification times on top of their own timestamps. It is also customary to record the date of the build in the software itself…

Timestamps are best avoided

https://reproducible-builds.org/docs/timestamps/

I can understand the desire for a timestamp but if it's included there needs to be details around the idea of SOURCE_DATE_EPOCH (a timestamp based on the last modification to any of the source or some other fixed timestamp). It needs to be clearly explained that this is allowed for the timestamp and is in fact required once you require reproducibility for lvl2+.

https://reproducible-builds.org/docs/source-date-epoch/

Tangentially related to #9 request for more explanation

SCVS does not reference the reproducible builds project as that project is counter to some of the requirements in SCVS.

It is not the intent of 2.2 to lead readers to assume reproducible builds. The word reproducible should likely be changed to repeatable to avoid confusion. The word repeatable is used elsewhere in the standard as well.

Properly created SBOMs are not compatible with reproducible builds because of the use of timestamps, detailed provenance information (which is nearly impossible to be reproducible), and, depending on the format, serial numbers.

As stated recently by NTIA:

Timestamp: Record of the date and time of the SBOM data assembly

and

Timestamp records when the data is assembled -- the point of the SBOM creation. These further
support the origin of the data, and help identify updated versions of the SBOM. These data fields
provide context to the SBOM data source, and can potentially be used to make trust
determinations.

https://www.ntia.gov/files/ntia/publications/sbom_minimum_elements_report.pdf

I think 2.7 can likely be more descriptive and incorporate Record of the date and time of the SBOM data assembly from the NTIA document so that 2.7 is less vague in meaning.

Thanks for the reply.

I find that repeatable here is just another word for reproducible. It's just reproducible excluding the timestamp again, if that's the position I think just saying that is the best option.

I'd be interested to hear some details on how the timestamp supports the origin of the data.

I don't quite understand "help identify updated versions of the SBOM". I might be misunderstanding the goal of that statement but is there not a schema version that helps with this?
Or is this to account for cases where the SBOM is being generated based on version pinning that asks for "^1.0.4" where it's not actually pinned and allows for "Minor releases: 1 or 1.x or ^1.0.4"? E.g. you generate the SBOM for the same source code but at X time the latest version of a dependency was 1.0.4 and now 1.1.8 is available and within the target range?

I'll try make a proper writeup around timestamps at some point

If we have to include timestamps in the SBOM itself and not as some sort of attestation where the SBOM is reference materials, it will make it hard to content address and validate SBOMs by hash when we are trying to check stuff like reproducible builds.

Properly created SBOMs are not compatible with reproducible builds because of the use of timestamps, detailed provenance information (which is nearly impossible to be reproducible), and, depending on the format, serial numbers.

I am genuinely interested how provenance information is nearly impossible to reproduce. Do you have an example?

@loewenstein

I am genuinely interested how provenance information is nearly impossible to reproduce. Do you have an example?

One example may be DNS resolution to a repository which is supplying open source components. There's no guarantee you'll be fetching components from the same mirror every time a build happens. An adversary, may simply use a mirror they control for recon while still providing artifacts that have the same checksums. Reproducible? Yes, But provenance has changed outside of the build.

An adversary, may simply use a mirror they control for recon while still providing artifacts that have the same checksums. Reproducible? Yes, But provenance has changed outside of the build.

Mmh. Assuming a secure hash was used as a checksum is the exact server really relevant or important as provenance information?

Mmh. Assuming a secure hash was used as a checksum is the exact server really relevant or important as provenance information?

Provenance = origin. Relevant, yes. As important may be a point of discussion for every organization. There are many security, legal, and regulatory use cases where provenance matters.

I guess I have too weak of an understanding of provenance for some contexts, especially legal, but I’d rather see “this was provided by a project of the Apache Foundation” as provenance rather than “this was delivered by a server with ip x.x.x.x via routers a, b and c”.

I.e. I’d see the artifacts hash as the way to guarantee the known origin, rather than the download location. Actually, I guess a signature would be even preferable.

Do you happen to have a link to a definition of provenance that could improve my understanding?

Thanks a lot for your time and patience.

Happy to help @loewenstein

The OWASP, MITRE, and NIST definition of provenance are in alignment and can be decomposed to the English language definition as well.

The OWASP definition of provenance can be found at: https://owasp.org/www-community/Component_Analysis which currently reads:

A component’s provenance refers to the traceability of all authorship, build, release, packaging, and distribution across the entire supply chain. In physical supply chains this is referred to as the chain of custody. Provenance may include individual and community authorship of software components, manufacturers, suppliers, software repositories, and country of origin. For high assurance applications, provenance plays an important role in determining Foreign Ownership, Control, or Influence (FOCI).

NOTE: If you're coming from the SLSA world, their definition of provenance is not aligned to the rest of the industry, so may be confusing at first.

So is OWASP/CycloneDX's approach to consider SBOM and Provenance as the same thing & your SBOMs will record the IP addresses queried during the generation?
Because if not, you've got separate SBOM & provenance which still means SBOM can be completely reproducible. (Since this issue is about SBOM reproducibility)
Also note you can still collect provenance around the generation of your SBOM as you would with any other artifact.

You can have a BOM that describes you want a screw that meets X requirements but any manufacturer works, or you want Y specific screw from a manufacturer but where you get it doesn't matter. Then via policy and gathering detailed provenance for that screw you could accept screws from company A or even specifically their factory B, but not from company C

For the vast majority of work cases people only want to validate "this came from GH" or "this came from npm" by validating https certs. The threat of "do I know this specific IP from their CDN" is pretty extreme and I'd be interested in hearing:

  1. who actually needs this & why
  2. have they actually solved all the higher risk problems already or are they just procrastinating/bikeshedding when they can make an improvement already

The difference between software and a screw is you may trust company A or have even tested their screws hold up to spec & quality control. But for company C you'd have to check again & establish some trust in their quality control etc.
Whereas with software if the checksum is identical you know the "screw" is (for our current purposes) "identical".
You don't need to consider "Could this screw not hold up to spec and break" "could it have fractures" "could it be made with cheaper materials" etc because you can't really "checksum" real products & the physical test's you'd do to try get equivalent confidence are non-trivial, time consuming, and expensive unlike a quick checksum

I could ramble a bunch more of my initial thoughts, but without an end to end example/scenario it's very hard to discuss what the threats actually are & how we'd solve those issues

the word mirror is pretty loaded so it's important to clarify there

Also is there a better forum for these kinds of discussions? 🙂

With CycloneDX specifically, for provenance it supports:

  • Author
  • Publisher
  • Supplier
  • Purl (has location information)
  • Build system
  • Distribution
  • SWID (tagCreator and softwareCreator in the event they are different)
  • VCS

For every component and service identified in the BOM along with the component for which the BOM describes.

In the case of DNS, that was just one example, however, a more common example would be use of npm.js vs the use of an internal proxy of npm.js such as Sonatype Nexus Repository, jFrog Artifactory, and AWS CodeArtifact.

Where something was obtained from is a requirement for most high-assurance environments. The project does have a Slack channel if you'd prefer that. https://owasp.org/slack/invite Once in, navigate to #project-scvs.

Going back to reproducibility of SBOMs, the NTIA minimum elements require a timestamp for when the SBOM was generated.