rpm-software-management/createrepo_c

createrepo_c does not warn or error when building a repository where the same package is present more than once

dralley opened this issue · 5 comments

My understanding is that "pkgid" and package NEVRA are meant to be unique in a repo. createrepo_c (the command line tool, at least) ought to provide a warning or error when commanded to build a repo with the same package present twice

This can occur when:

  • The user thinks they can just rename the RPM to make it a different package (the package metadata is in the header so obviously this doesn't work)
  • The user puts the same RPM in multiple sub-directories

This is frequently seen in the wild even in commonly used community repos (centos7-opstools) and professional repos (prometheus, gitlab, perf-sonar, RHEL6, microsoft azure RHEL additions)

This seems somewhat architecturally challenging.

It looks like the main thread produces a list of filenames, creates a thread pool, and starts creating tasks to read individual RPMs and dump them into the XML as fast as possible. The problem is there's interim spot where all the NEVRAs or package IDs are known in order to compare them, prior to dumping the data into the XML. The best thing you could do is create a shared hash set of NEVRAs, protect it with concurrency primitives, and let the first package that inserts a given NEVRA "win" and all others loose - or just fail immediately I suppose. Not the most satisfying solution.

@kontura Is that analysis accurate? Do you have any other ideas?

I think your analysis is spot on and I don't really see any other option how to make this happen if we want to keep processing the packages with multiple threads.

In practice I believe it should look something like this:

I would also make it an error so if there are duplicates set udata->had_errors. That way the repodata will still be created but createrepo_c will exit with 2.

I just briefly went through the description of the issue and I am not sure whether the change is a good idea. It is completely true that in standard distribution we do not have the same NEVRA multiple times, but in copr repository it is a different story. For example in our nightly repository we have multiple packages with the same NEVRA and pkgid. It means that such a change will brake a lot of things. Also I can have one package signed with multiple keys. They have the same NEVRA but different pkgid.

@j-mracek It's not a performance improvement or anything like that. There are correctness issues with having multiple packages with the same pkgid and/or nevra in the same repo. If you run createrepo_c --update on one of them it will just delete all the duplicated package entries -- if all of them have duplicates, it produces a completely empty repo. If only some of them are, you will probably end up with broken dependencies.

That is nonetheless probably the one of the better ways to handle the situation because it's easy to parse the metadata incorrectly using the createrepo_c APIs. That issue still exists, it's just that the new streaming / PackageIterator API can help get around it.

I don't understand why having the same package present multiple times, perhaps signed with different keys would ever be a desirable thing to do. DNF will (to my knowledge) just pick the one with the latest build timestamp and ignore the rest, but you'd be hard pressed to figure out which package it will be from just looking at the repo. Is there a reason why COPR repos must act this way, or is it just an implementation detail that can be changed? It feels like leaning on an informal heuristic that DNF uses currently but doesn't necessarily promise.

We want Pulp users to be able to know what packages they're getting, and every additional edge case makes that more challenging. We've also had some correctness issues with these repos entirely separate from the ones listed above.