intermittent failure in `ext.rpm-ostree.destructive.container-image`

Question

intermittent failure in `ext.rpm-ostree.destructive.container-image`

Opened this issue a year ago · 1 comments

[2023-08-30T15:00:16.554Z] Aug 30 14:59:54 qemu0 kola-runext-container-image[1134]: + rpm-ostree rebase ostree-unverified-image:containers-storage:localhost/fcos-derived
[2023-08-30T15:00:16.554Z] Aug 30 14:59:54 qemu0 kola-runext-container-image[1957]: Pulling manifest: ostree-unverified-image:containers-storage:localhost/fcos-derived
[2023-08-30T15:00:16.554Z] Aug 30 14:59:54 qemu0 kola-runext-container-image[1957]: Importing: ostree-unverified-image:containers-storage:localhost/fcos-derived (digest: sha256:db32a3de020c5f7c74191e50690ad520c2158ccae3b99734cd59f15e0d9b73da)
[2023-08-30T15:00:16.554Z] Aug 30 14:59:54 qemu0 kola-runext-container-image[1957]: ostree chunk layers needed: 1 (1.5?GB)
[2023-08-30T15:00:16.554Z] Aug 30 14:59:54 qemu0 kola-runext-container-image[1957]: custom layers needed: 1 (24.7?MB)
[2023-08-30T15:00:16.554Z] Aug 30 15:00:15 qemu0 kola-runext-container-image[1957]: error: Importing: Parsing layer blob sha256:00623c39da63781bdd3fb00fedb36f8b9ec95e42cdb4d389f692457f24c67144: Failed to invoke skopeo proxy method FinishPipe: remote error: write |1: broken pipe
[2023-08-30T15:00:16.554Z] Aug 30 15:00:15 qemu0 systemd[1]: kola-runext.service: Main process exited, code=exited, status=1/FAILURE
[2023-08-30T15:00:16.554Z] Aug 30 15:00:15 qemu0 systemd[1]: kola-runext.service: Failed with result 'exit-code'.
[2023-08-30T15:00:16.554Z] Aug 30 15:00:15 qemu0 systemd[1]: kola-runext.service: Consumed 36.730s CPU time.

This one is a bit concerning because it's been happening more frequently recently I think. Also, I think we may be running into something related to https://github.com/ostreedev/ostree-rs-ext/blob/bd77743c21280b0089c7146668e4c72f4d588143/lib/src/container/unencapsulate.rs#L143 which is masking the real error.

Answer 1 · 2023-09-01T16:59:06.000Z

Discoveries so far:

ostreedev/ostree-rs-ext#527 does not help (but doesn't hurt either)
This test ends up pulling a 1.5G layer
Retrying does work - so we're apparently writing the layer correctly

More generally it's definitely a race condition; I can sometimes reproduce this by doing
ostree refs --delete ostree/container and then re-running the rebase.

Also of note: kola defaults to a uniprocessor VM, which I think is more likely to expose this race.

I'm quite certain it has something to do with the scheduling of us closing the pipe vs calling FinishPipe.