rpm-software-management/createrepo_c

Brainstorm ways to shrink RPM metadata

dralley opened this issue · 5 comments

#395 and other recent PRs have brought up the topic of shrinking RPM metadata once again.

I'm not thrilled with such approaches (I can live with it, but it's yak-shaving over just a few percent)

Therefore I'd like to have a discussion about potentially more meaningful approaches.

This ancient wiki page basically suggests specifically excluding icons and documentation entries e.g. /usr/share/doc, /usr/share/icons from filelists.xml, given that they make up a huge proportion of the entries there, and in practice likely should never be used as dependencies.

The data is compelling (but from 2010, so recomputing it would be useful)

2.4 million files total in pkgs in rawhide
2.3 million of those are in /usr
1.8 million of those are /usr/share
Top 3 dirs by file count under /usr/share:
533046 /usr/share/doc
120555 /usr/share/javadoc
105591 /usr/share/icons
45 file-requires requiring something in /usr/share
none of those file-requires are in the top 3 /usr/share dirs
- most of them are fonts.

This 6 year old discussion brings up the same point:

AIUI @james-antill did some analysis versus Debian and he concluded that the "file dependencies" were a major part of the wire size. And yes holy cow, I just looked at a filelists.xml. I think my vote there would be to only do file entries for "entrypoints" like /usr/bin - there's really no sane scenario where an RPM package should Require: /usr/share/doc/GeographicLib-doc/html/C/annotated.html or whatever.

And makes a second suggestion also:

One idea I had is to "presolve" - a lot of this data is completely redundant dependencies. Take this chunk from the very first package I looked at, 0ad:

<rpm:requires>
  <rpm:entry name="libstdc++.so.6()(64bit)"/>
  <rpm:entry name="libstdc++.so.6(CXXABI_1.3)(64bit)"/>
  <rpm:entry name="libstdc++.so.6(CXXABI_1.3.5)(64bit)"/>
  <rpm:entry name="libstdc++.so.6(CXXABI_1.3.8)(64bit)"/>
  <rpm:entry name="libstdc++.so.6(CXXABI_1.3.9)(64bit)"/>
  <rpm:entry name="libstdc++.so.6(GLIBCXX_3.4)(64bit)"/>
  <rpm:entry name="libstdc++.so.6(GLIBCXX_3.4.11)(64bit)"/>
  <rpm:entry name="libstdc++.so.6(GLIBCXX_3.4.14)(64bit)"/>
  <rpm:entry name="libstdc++.so.6(GLIBCXX_3.4.15)(64bit)"/>
  <rpm:entry name="libstdc++.so.6(GLIBCXX_3.4.18)(64bit)"/>
  <rpm:entry name="libstdc++.so.6(GLIBCXX_3.4.19)(64bit)"/>
  <rpm:entry name="libstdc++.so.6(GLIBCXX_3.4.20)(64bit)"/>
  <rpm:entry name="libstdc++.so.6(GLIBCXX_3.4.21)(64bit)"/>
  <rpm:entry name="libstdc++.so.6(GLIBCXX_3.4.9)(64bit)"/>

...

But those are all provides of the libstdc++ package - and I don't think we're ever going to have different symbol versions provided by separate packages.

So doing a pass where we just drop redundant requires would probably make a notable difference.

@Conan-Kudo I know you had strong feelings on this a few years ago, what are your thoughts?

I know that lazy filelists downloading makes the subject less relevant for Fedora 40+, but if there's an obvious win here we should still take it.

According to fedora packaging guidlines, files outside of /usr/bin and /etc should not be used as requirements anyway, and files from /usr/bin and /etc are already part of the primary metadata. I'd be really happy if depsolving did not need filelists. Never. Actually, there are currently very few packages (in Fedora, not sure how the situation is in third party repos) that depends on such files, and lately issues have been filed for them to drop such dependencies.

My only occasional use-case for filelists is "Which package provides this file?" (dnf provides /this/file/i/need), and for this reason I would prefer filelists.xml contained all the files.

Third party repos tend to rely on file dependencies more because RPM distributions do not agree on packaging conventions. Fedora packaging guidelines should be ignored from an upstream RPM stack perspective (createrepo_c, dnf, etc.).

I know that lazy filelists downloading makes the subject less relevant for Fedora 40+, but if there's an obvious win here we should still take it.

Nobody has yet implemented lazy downloading. I've asked for this to be considered and provided a conceptual path to doing so, but nobody has responded to my comments about it.

I know that lazy filelists downloading makes the subject less relevant for Fedora 40+, but if there's an obvious win here we should still take it.

Nobody has yet implemented lazy downloading. I've asked for this to be considered and provided a conceptual path to doing so, but nobody has responded to my comments about it.

I've summarized information about implementation of lazy loading of filelists in rpm-software-management/dnf5#1053.