cirruslabs/cirrus-cli

[worker] `podman system service` process accumulation

duckinator opened this issue ยท 11 comments

When running cirrus worker run with the container isolation on Debian 12, with the Podman backend, each task spawns a podman system service process that never exits.

At one point, the system had accumulated 158 of these processes, but hadn't run any tasks for over 3 hours.

WORKAROUND: Periodically close and re-opening cirrus worker run.

This line is what runs the process that doesn't exit:

cmd := exec.Command("podman", "system", "service", "-t", "0", socketURI)

However, my understanding is that this function should eventually be called to kill it:

func (backend *Podman) Close() error {
doneChan := make(chan error)
go func() {
doneChan <- backend.cmd.Wait()
}()
var interruptSent, killSent bool
for {
select {
case <-time.After(time.Second):
if !killSent {
if err := backend.cmd.Process.Kill(); err != nil {
return err
}
killSent = true
}
case err := <-doneChan:
return err
default:
if !interruptSent {
if err := backend.cmd.Process.Signal(os.Interrupt); err != nil {
return err
}
interruptSent = true
}
}
}
}

I'm unsure whether this function isn't being called, or if it's just not working for some reason.

I changed "-t", "0" to "-t", "1" and it spawns processes that eventually turn into zombie processes.

This seems to confirm that, for whatever reason, the podman system service processes are neither being waited on nor killed.

I've tried reproducing your issue on Cirrus CLI 0.122.0 and Podman 3.4.4 on a clean ghcr.io/cirruslabs/ubuntu:latest instance with the following configuration to no avail:

container:
  image: debian:latest

task:
  script: uname -a

#767 might help, though.

I'll test that PR later today and let you know. ๐Ÿ‘

For future reference on my part, are you running ghcr.io/cirruslabs/ubuntu:latest via vetu, or something else?

I'm running it via Tart on macOS, but that probably doesn't matter much for this reproduce attempt.

@duckinator how does your .cirrus.yml look like? Which Cirrus CLI and Podman versions do you run and on which Linux distribution?

We'll probably need to reproduce this somehow first in order to devise a fix (if it's needed at all).

Looking at the code it should cleanup an instance or throw an error:

defer func() {
if instanceCloseErr := task.Instance.Close(ctx); instanceCloseErr != nil {
e.logger.Warnf("failed to cleanup task %s's instance: %v",
task.String(), instanceCloseErr)
if err == nil {
err = instanceCloseErr
}
}
}()

Versions and such:

  • distro: Debian 12 (bookworm)
  • cirrus: 0.122.0-95e9f68
  • podman: 4.3.1

The config I'm using on the worker is:

token: "[...]"

security:
  allowed-isolations:
    container: {}

resources:
  cpu: 2

log:
  level: debug

The .cirrus.yml that triggers the worker is: https://github.com/duckinator/bork/blob/184e2c646d521bdfe8adef40c94082787e090944/.cirrus.yml (note that macOS_task, FreeBSD_task, and Windows_task are currently marked as skipped).

Please check out the new 0.122.1 release that will be available shortly, it should fix the issue you're encountering ๐Ÿ™Œ

Unfortunately with cirrus-cli 0.122.1-8ae0752, it's actually worse: the problem is still there, but now the podman processes linger even after I stop cirrus worker run. Previously, that was making it exit.

Unfortunately with cirrus-cli 0.122.1-8ae0752, it's actually worse: the problem is still there, but now the podman processes linger even after I stop cirrus worker run. Previously, that was making it exit.

Indeed, I was testing the fix using cirrus run, yet, Persistent Worker has slightly different code path.

This will be fixed in #769.

Sorry for the inconvenience!

Confirmed that using 0.122.2-6faa293 works. Thank you for fixing this so quickly, it's very appreciated!