cassava/repoctl

Repoctl is backing-up files without any good-reason

Closed this issue ยท 39 comments

Hi @cassava,

I just lot the entire repository twice today, the files were moved to the backup directory.

Tasks used includes (only) repoctl update and repoctl add <some kernels>.
I'm using the 0.21 release.

This time repoctl --debug status -mca doesn't show anything wrong ๐Ÿ˜ข

So sorry! I'm looking into it.

Luckily I logged the moment it backed up everything this second time:

Copying and adding to repository: linux-tkg-pds-bobcat-5.8.7-12-x86_64.pkg.tar.zst{,.sig}
Adding package to database: /srv/http/chaotic-aur/x86_64/linux-tkg-pds-bobcat-5.8.7-12-x86_64.pkg.tar.zst
error: read package /srv/http/chaotic-aur/x86_64/linux-tkg-pds-bobcat-5.8.7-12-x86_64.pkg.tar.zst: invalid input: magic number mismatch.

Could you post the output of repoctl version?

repoctl version

repoctl version 0.21 (30 August, 2020)
Copyright 2016-2020, Ben Morgan <cassava@iexu.de>

You may find repoctl on the Internet at
    https://github.com/cassava/repoctl
Please report any bugs you may encounter.

The source code of repoctl is licensed under the MIT license.

Current configuration:
    columnate = false
    color = "auto"
    quiet = false

    current_profile = "default"
    default_profile = "default"

    [profiles.default]
        repo = "/srv/http/chaotic-aur/x86_64/chaotic-aur.db.tar.zst"
        add_params = []
        rm_params = []
        ignore_aur = []
        require_signature = false
        backup = true
        backup_dir = "/srv/http/chaotic-aur/archive/"
        interactive = false
        pre_action = ""
        post_action = ""

That looks like the output of the bug that already got fixed... hmm. I wonder if repoctl-git version is different from the one I packaged.

๐Ÿ˜… this was with aur.archlinux.org/packages/repoctl

Oh boy...

Ah, you might consider using repoctl-0.21-3.

I may have missed updating it ๐Ÿ˜… I'll have to wait for the recompilation cycle to get in letter R.

My guess is that might be it. I might have messed up the PKGBUILD for the go module migration, which could result in local Go modules being used instead of the vendored ones... This was before I followed the updated Arch Go packaging guidelines for modules.

And the error you describe here is one that got fixed in one of the dependencies of repoctl. So there might be that mismatch.

Also, I downloaded the package that caused the trouble and followed your procedure locally and didn't have any trouble, so at least I can't reproduce it with 0.21-2 and 0.21-3.

One notice, when magic number mismatch happens with repoctl update no files are backed up, it just fails, when it happens with repoctl add it's catastrophic.
And another thing: What happens when the file didn't finish writing? I have some async tasks and they may be trying to add files before they finished writing...

EDIT: I've updated to 0.21-3, I'll keep you posted...

Oh interesting, I should look into that. I think I'm starting to understand the backup behavior. Has to do with how repoctl reads all data first and then tries to act on it.
I think I need to add an separate use-case for "new file exists and I can't read it".
Because I did not consider this originally. This would solve both problems at once.

Also, a partially-written archive would pose some problems, because repoctl only reads as much of a package as it needs to; currently I'm relying on repo-add to handle the case of an incomplete file.

Yeah, repo-add failing and consequentially repcotl exiting with a failure code too is enough. That's how it has been working the past year. And how it would work with pure repo-add too. The server will reattempt it later on failures...

So one thing I could definitely do is have repoctl add verify the packages before copying them to repository. Currently it just copies them over and trusts in repo-add.

Alright then that change is now on master with 4822d1f.

Luckily I logged the moment it backed up everything this second time:

Copying and adding to repository: linux-tkg-pds-bobcat-5.8.7-12-x86_64.pkg.tar.zst{,.sig}
Adding package to database: /srv/http/chaotic-aur/x86_64/linux-tkg-pds-bobcat-5.8.7-12-x86_64.pkg.tar.zst
error: read package /srv/http/chaotic-aur/x86_64/linux-tkg-pds-bobcat-5.8.7-12-x86_64.pkg.tar.zst: invalid input: magic number mismatch.

And you're saying that adding this package to database actually caused all the packages in the repository to be backed up?

Alright then that change is now on master with 4822d1f.

Moved to it ๐Ÿ‘

By the end of the recompilation cycles, I'll let you know if something goes wrong.
After changing to -3 no error happened yet!

And you're saying that adding this package to database actually caused all the packages in the repository to be backed up?

Yeah, The full logs have a bunch of "Backing up..." after this

So bizarre that I can't get that backup-behavior reproduced at all... :-(

I was using 0.21 since #57 was closed, and only now it happened (and then again, but after building 800 packages in a small-time period).
I think that sounds like a race condition issue.

Do you run repoctl update while building other packages?

I do, the server has 40 vCPUs, I don't like to leave any of these idle ๐Ÿ˜Š, so it's a chaos of down, update, and when files come from a third cluster add.

Ok... that explains a lot. ๐Ÿ˜„

This would have been very useful to know earlier. So far I haven't considered the ramifications of parallel updates and adds. This is a tricky one.

Actually I'd also be interested in hearing any pain-points you might have in building that many packages.

For example: I've always found the situation difficult where you need to build newer dependencies that then need to be installable for the next makepkg -s command.

I think it would be better to create a new issue specifically for the use-case "Support parallel execution of repoctl".

๐Ÿ˜… someway somehow I managed that, my first infra has a "batch" command, and I execute it like this:
chaotic-batchbuild somepackage anotherpackage -- apackagethatdepends
and I wrapped the "add" command, it waits for a lock to be deleted before running a secondary repoctl update

(And the second one has a db-bump command that does almost the same)

Sometimes I still get:

error: read package /srv/http/chaotic-aur/x86_64/hamsket-git-r1222.fe82ff7-1-x86_64.pkg.tar.zst: invalid input: magic number mismatch.
Adding package to database: /srv/http/chaotic-aur/x86_64/gstreamer0.10-base-0.10.36-13-x86_64.pkg.tar.zst
Adding package to database: /srv/http/chaotic-aur/x86_64/gstreamer0.10-base-plugins-0.10.36-13-x86_64.pkg.tar.zst

Sometimes is uglier:

error: read package /srv/http/chaotic-aur/x86_64/gnome-shell-extension-xrdesktop-git-0.14.0.29.9c5c0c3-1-any.pkg.tar.zst: cannot find file ".PKGINFO".
error: read package /srv/http/chaotic-aur/x86_64/mkinitcpio-openswap-0.1.0-3-any.pkg.tar.zst: cannot find file ".PKGINFO".
error: read package /srv/http/chaotic-aur/x86_64/pango-anydesk-1:1.43.0-3-x86_64.pkg.tar.zst: invalid input: magic number mismatch.
error: read package /srv/http/chaotic-aur/x86_64/perl-authen-simple-0.5-9-any.pkg.tar.zst: cannot find file ".PKGINFO".
error: read package /srv/http/chaotic-aur/x86_64/qomui-git-0.8.2.r22.23650ab-1-x86_64.pkg.tar.zst: invalid input: magic number mismatch.
error: read package /srv/http/chaotic-aur/x86_64/ripcord-arch-libs-0.4.26-1-x86_64.pkg.tar.zst: invalid input: magic number mismatch.
error: read package /srv/http/chaotic-aur/x86_64/woeusb-ng-0.2.5-3-any.pkg.tar.zst: invalid input: magic number mismatch.
Adding package to database: /srv/http/chaotic-aur/x86_64/tpmmanager-0.8.1-8-x86_64.pkg.tar.zst

But these packages are still being added to the repo...

As the catastrophic event seems to have ceased, I'm closing this issue.

Hey @PedroHLC, the errors you are seeing there are to be expected when repoctl reads tar.zst files that are still being written.

cannot find file ".PKGINFO" happens when the Zst decompression is successful far enough that the TAR reader can start processing the archive, but it doesn't find the .PKGINFO file that supposed to be in the TAR.

invalid input: magic number mismatch happens when the Zst decompression fails because not enough of the file has been written.

If repoctl encounters these files, it should just ignore them.

Further final thoughts from me:

  • Since this is hard to replicate, one way to reproduce this might be to truncate files at a certain number of bytes.
  • Optimally, files that are in the process of being written or copied should be given an extension that repoctl ignores.

Quick question: Do you run repo add and repo update in parallel?

Do you run repo add and repo update in parallel?

I observed it today, and repoctl is not running in parallel. My lock wrapper is working and probably has been the way the entire past year. I just don't avoid partially written files. But I'm considering using the same lock file for the copying operations...

Sadly it happened once more, with repoctl add (and not running parallel).

Good to know that it can also happen by itself! Debugging data-race issues are really really hard, because a lot of behavior is just undefined, which can mean basically anything. But if it happens without any other instance running in parallel, then I might just have a chance to observe it myself.

If you ever manage to reproduce it reliably, that is of course the absolute best, but from the sound of it that doesn't happen.

Do you know if anything else was running at the same time, e.g. Pacman? I opted to not use libalm, the Pacman libraries, because it was always annoying to have to recompile a tool like cower every time I updated pacman. But that means that I had to come up with the database reading myself, which isn't as battle-tested as that from Pacman.

Sadly it took me 40hrs to notice the packages were gone ๐Ÿ˜…
Thankfully one of the mirrors isn't syncing the packages deletes and I've been using it as a backup.

I had one entry showing as mixxx_beta-git: updated( -> r6814-1) in repoctl status.

This package wasn't even appearing in my dump with tar -tv --zstd -f chaotic-aur.db.tar.zst | awk '/^d/{print $6}'. And it was built near the time things went crazy.

I've added it with repo-add and now it's in the database and doesn't show in repoctl status anymore...

Do you know if anything else was running at the same time, e.g. Pacman?

It shouldn't be running, for except inside some containers...

AladW commented

Over the years I've also noticed (and had reports) of local repository database suddenly becoming empty. Never found out the cause either. In my case, aur-build does not seem to have data races either (all built packages are written to a random, private directory before being mv'd to the local repository, and repo-add has its own locking mechanism which I presume (?) to be functional).