IMA: Extend filelists.xml to include a hash attribute inside the `file` elements.

Question

IMA: Extend filelists.xml to include a hash attribute inside the `file` elements.

aplanas opened this issue 3 years ago · 17 comments

Recently we are looking for ways to securely deliver the file hashes
of the installed RPMs, without downloading the RPMs and inspecting
them locally.

The repository metadata contains in the repomd.xml the list of the
different files that compose the repository, and this file can be
currently signed and validated. From here, we can also validate the
different contents of the "primary", "filelists" and "other" XML,
calculating the SHA256 and comparing it with the prefix of the file
name.

This makes a perfect channel to deliver validated information about
the content of the RPMs, without requiring the direct download,
validation and inspection of one and each of the RPMs inside the
repository. This is currently done for the file list, the changelog,
requirement or recommendation of the packages.

Can be very interesting if we extend "filelists.xml" to include one
attribute into the file element, that currently enumerate the
different files that belong to a RPM. This attribute will be hash,
and will contain the user defined hash (maybe sha256) of each file.

For example, currently a file element is like:

<file>/usr/bin/bash</file>

I propose to be extended as:

<file hash="7d301d1f90ac6e56f171d5de459b887eb6077c6e0ea340c04fe23855b1c6b9a3">/usr/bin/bash</file>

The hash function can be selected by the user with a parameter during
the createrepo call.

createrepo --hash sha256

Because currently the RPM header is already containing a hash for each
file, could make sense to use directly this value, making the hash
funcion parameter useless.

createrepo --hash

One use of this new feature is for remote attestation. In remote
attestation (for example, using services like Keylime), we can enable
the IMA logging in the kernel of a monitored system, that will
routinely send all the IMA hashes of the opened or executed files.

The remote node will compare those IMA hashes registered by the kernel
and compare it with a white list. One problem is how to generate this
list without accessing the monitored system. With this proposed
feature the white list can be generated knowing only the list of
repositories used during the installation.

Answer 1 · 2022-03-02T05:52:05.000Z

This does sound useful for some cases, but it seems like a high price to pay for 99.99% of users. RPM metadata already gets quite large, and adding a sizable quantity of difficult to compress checksum data will cause it to grow significantly larger.

I can ballpark how much. RHEL 7 filelists.xml is currently 690 megabytes decompressed, and 50 megabytes compressed. It contains 32,571 packages (currently) and ~7.3 million files listed for those packages.

[dalley@thinkpad repos]$ rg -c "<package" rhel7/repodata/filelists.xml 
32571
[dalley@thinkpad repos]$ rg -c "<file>" rhel7/repodata/filelists.xml 
7294832

7294832 * 64 bytes (hexdigest length of sha256 hash) = ~470 megabytes. So approaching double the size... although the compression ratio for the checksums is probably much worse than for the filenames (which are mostly similar to each other)

So in the worst case it could be adding hundreds of megabytes to every "dnf update". The download situation would probably be a bit better on CentOS or Fedora due to having the repos be split between release and update, plus zchunk, but it's not possible to avoid the on-disk metadata size bloat.

Answer 2 · 2022-03-02T05:54:37.000Z

If I can make a alternative suggestion. If your goal is to pre-index the list of allowed file checksums, I believe the RPM header stores digests for all of the files in the header, and the metadata does include the start and end byte of the RPM header within the package file. It might not be so difficult to make range requests which only download the header portion of the package, and use librpm to process that.

But I've never tried it, and I'm not sure if the checksums stored there are of the same type you would need when checking against the IMA hashes, though. Maybe @Conan-Kudo knows.

Answer 3 · 2022-03-02T18:06:41.000Z

7294832 * 64 bytes (hexdigest length of sha256 hash) = ~470 megabytes. So approaching double the size... although the compression ratio for the checksums is probably much worse than for the filenames (which are mostly similar)

So in the worst case it could be adding hundreds of megabytes to every "dnf update".

Yes. You are correct. Just for the sake of the argument I simulated the growth in the oss repo from openSUSE Tumbleweed that those are the numbers:

filelists	uncompressed	compressed
original	770368681 (735M)	49764993 (48M)
extended	1349033546 (1.3G)	377537054 (361M)
ratio	1.75	7.58

Was generated with:

cat original.xml | awk '/<file>/{c=sprintf("echo \"%s\" | sha256sum | cut -d\" \" -f1", $0);c | getline h;close(c);k=sprintf("<file hash=\"%s\">", h);gsub(/<file>/, k)};{print}' > extended.xml

IMHO this makes very bad idea to include directly in filelists.xml.

But still I think that can live in a different 3rd file, filelists-hash.xml, or filelists-ext.xml that is optionally generated by createrepo_c.

If I can make a alternative suggestion. If your goal is to pre-index the list of allowed file checksums, I believe the RPM header stores digests for all of the files in the header, and the metadata does include the start and end byte of the RPM header within the package file. It might not be so difficult to make range requests which only download the header portion of the package, and use librpm to process that.

Yes. We think about that. Let me write all the options that we put on the table, and some evaluations:

As a single singed RPM that contain a database (CSV) of all the hashes.
As a metadata inside the published repository.
As a web service with a REST API where an application can query the hashes of different files, depending on the release / architecture.
As a tool that download the RPM headers only from the repositories, and create locally the hashes database.

(1) is about generating an RPM that will contain all the hashes. This is not a very good idea, because the generation needs to be done at the end, and typically a build system do not make room for this can of RPM generated once the repository is present. Also, to be used needs to be installed in the system (or unrpm).

(2) is a good candidate. For one side provide clear delivery channel for the hashes, that can be validated. Inside this option there are two possibilities, (2.a) extend createrepo_c to add in filelists.xml or in a new file filelists-ext.xml, or (2.b) extend one of the post-processing brp scripts from the build system, to add a new metadata file with the hashes, and register it in the repomd.xml. We are doing it for the appdata.

Here (2.a) is a bit better, as can be reused for all distributions. For example, Fedora is also looking for remote attestation (https://fedoraproject.org/wiki/Changes/DIGLIM) using keylime, and should need something like that for collecting the hashes from a validated source.

(3) makes sense, but only when the database is created on demand, and not updated on batch. The downside is that there is no clear / standard way to reuse this for different distributions, as will be very dependent on the build system used.

(4) is the one also proposed in your comment, but can be very resource intensive for the repository server. Every time you need to download the list of packages from primary.xml, and download all the headers and parse them locally. With a nice cache system with can pay off, but basically we are proposing to scrape the server as a mechanism for delivery the hashes, and this can become expensive very soon.

If filelists-ext.xml (2.a) is not a good proposal, I think that I will try the variant (2.b) before the local application (4), as I am worried that the usage of RANGE can become an invitation for overloading the mirrors. But yes, is also a very valid proposal!

Answer 4 · 2022-03-02T18:15:32.000Z

Potential option 2c: add it to the filelists.sqlite metadata which createrepo_c is already capable of creating, but without making an equivalent XML change.

Not sure if that is acceptable. I've never quite understood the purpose of the sqlite metadata or why it's in 3 separate files rather than a unified database.

Answer 5 · 2022-03-02T19:39:55.000Z

Potential option 2c: add it to the filelists.sqlite metadata which createrepo_c is already capable of creating, but without making an equivalent XML change.

Ah. I do not know, is kind of asymmetrical.

Not sure if that is acceptable. I've never quite understood the purpose of the sqlite metadata or why it's in 3 separate files rather than a unified database.

Indeed. We do not publish any sqlite version in our repos.

Answer 6 · 2022-03-07T14:50:08.000Z

Inserting it to the filelists.xml doesn't seem like a good idea and I think we won't be adding any new metadata at this point so I believe 2.b is a good option for now.

However if there is a bigger interest for this in the future something like filelists-ext.xml seems possible to me.

Answer 7 · 2022-03-07T15:00:41.000Z

Inserting it to the filelists.xml doesn't seem like a good idea and I think we won't be adding any new metadata at this point so I believe 2.b is a good option for now.

Yes, after the measurement in the other comment the inclusion in filelists.xml should be discarded.

However if there is a bigger interest for this in the future something like filelists-ext.xml seems possible to me.

Right. Remote attestation is increasing in interest, and most of the current questions are about how is expected to collect the correct hash values, preferably from outside the monitored servers.

I do not trust my C, so I am prototyping it in Python + librpm, with the idea of having an intuition of the performance cost of generating this list. I guess that will be about the same for filelists.xml, but lets see.

Answer 8 · 2022-03-10T10:20:50.000Z

Prototype in Python using librpm here: https://github.com/aplanas/imadata

Answer 9 · 2022-03-17T15:09:10.000Z

Currently working on a PR to createrepo_c

Answer 10 · 2022-03-22T11:03:57.000Z

I would consider implementing this functionality on a package manger side. E.g. in a form of a plugin that gathers file checksums from downloaded RPM files prior executing the transaction. This way one would work only with RPMs that are really going to be installed on the system and no new metadata would be necessary.

Answer 11 · 2022-03-22T11:24:47.000Z

I would consider implementing this functionality on a package manger side

There is no need of that. The rpmdb already contain the header information, and this is including the hashes of the files. Later rpm can verify this matching locally.

The goal is to have this information from outside of the node. So an independent authority (for example, the keylime verifier) can build a database of good hashes for remote attestation, without accessing the local database of the node (nor the files, of course)

If this information is published in the repodata, that is signed, there is now a trust chain between the publisher (the creator of the repo) and the user of this data (the remote attestation tool, like keylime), that do not include the monitored node.

Answer 12 · 2022-04-01T20:27:37.000Z

Revised suggestion, depending on how this works out: Instead of adding it to filelists.sqlite, just make a brand new consolidated sqlite database containing all the repo metadata, including these file checksums.

That would make it easier to validate individual packages without parsing lots of XML.

The downside, currently sqlite metadata is kind of useless, and giving it a new purpose would mean it can't eventually be deprecated.

Answer 13 · 2022-04-01T20:30:00.000Z

I'd like to eventually remove the ability for createrepo to produce SQLite-based metadata, so I'd rather not do things to that format.

Answer 14 · 2022-04-01T20:35:08.000Z

Possibly it could be done with a separate tool and not part of the official metadata generation process per se. Purely for auditing purposes.

edit: I guess that doesn't really help with distributing the hashes to clients

Answer 15 · 2022-04-02T01:03:18.000Z

Possibly it could be done with a separate tool and not part of the official metadata generation process per se. Purely for auditing purposes.

That was my first approach: https://github.com/aplanas/imadata

The drawback is that with this we iterate two times over all the rpms, and that we miss the opportunity of standardizing the delivery of hashes.

edit: I guess that doesn't really help with distributing the hashes to clients

For local clients, no. Eventually the hashes will be there in the rpmdb. But for remote clients this is how it is useful.

The goal is that if we are using remote attestation, we do not need local access to the different clients to know the expected hashes.

Answer 16 · 2022-04-26T14:33:48.000Z

Since XML supports document merging, why aren't we just making overlay documents instead of duplicate documents?

Answer 17 · 2022-04-26T14:41:31.000Z

Since XML supports document merging, why aren't we just making overlay documents instead of duplicate documents?

How is this supported? I can only find references of tools that can merge XML under certain conditions outside any standard.

IMHO this will complicate a lot the user space tools. The idea is to curl the file alone (after signature validation of the root document) and ignore the rest.