[worker] `podman system service` process accumulation

When running cirrus worker run with the container isolation on Debian 12, with the Podman backend, each task spawns a podman system service process that never exits.

At one point, the system had accumulated 158 of these processes, but hadn't run any tasks for over 3 hours.

WORKAROUND: Periodically close and re-opening cirrus worker run.

This line is what runs the process that doesn't exit:

cirrus-cli/internal/executor/instance/containerbackend/podman_linux.go

Line 43 in 95e9f68

cmd := exec.Command("podman", "system", "service", "-t", "0", socketURI)

However, my understanding is that this function should eventually be called to kill it:

cirrus-cli/internal/executor/instance/containerbackend/podman_linux.go

Lines 92 to 121 in 95e9f68

    
           func (backend *Podman) Close() error { 
        
           	doneChan := make(chan error) 
        
           	go func() { 
        
           		doneChan <- backend.cmd.Wait() 
        
           	}() 
        
           	var interruptSent, killSent bool 
        
           	for { 
        
           		select { 
        
           		case <-time.After(time.Second): 
        
           			if !killSent { 
        
           				if err := backend.cmd.Process.Kill(); err != nil { 
        
           					return err 
        
           				} 
        
           				killSent = true 
        
           			} 
        
           		case err := <-doneChan: 
        
           			return err 
        
           		default: 
        
           			if !interruptSent { 
        
           				if err := backend.cmd.Process.Signal(os.Interrupt); err != nil { 
        
           					return err 
        
           				} 
        
           				interruptSent = true 
        
           			} 
        
           		} 
        
           	} 
        
           }

I'm unsure whether this function isn't being called, or if it's just not working for some reason.

I changed "-t", "0" to "-t", "1" and it spawns processes that eventually turn into zombie processes.

This seems to confirm that, for whatever reason, the podman system service processes are neither being waited on nor killed.

I've tried reproducing your issue on Cirrus CLI 0.122.0 and Podman 3.4.4 on a clean ghcr.io/cirruslabs/ubuntu:latest instance with the following configuration to no avail:

container:
  image: debian:latest

task:
  script: uname -a

#767 might help, though.

I'll test that PR later today and let you know. 👍

For future reference on my part, are you running ghcr.io/cirruslabs/ubuntu:latest via vetu, or something else?

I'm running it via Tart on macOS, but that probably doesn't matter much for this reproduce attempt.

@duckinator how does your .cirrus.yml look like? Which Cirrus CLI and Podman versions do you run and on which Linux distribution?

We'll probably need to reproduce this somehow first in order to devise a fix (if it's needed at all).

Looking at the code it should cleanup an instance or throw an error:

cirrus-cli/internal/executor/executor.go

Lines 208 to 217 in 2a07c02

    
           defer func() { 
        
           	if instanceCloseErr := task.Instance.Close(ctx); instanceCloseErr != nil { 
        
           		e.logger.Warnf("failed to cleanup task %s's instance: %v", 
        
           			task.String(), instanceCloseErr) 
        
           		if err == nil { 
        
           			err = instanceCloseErr 
        
           		} 
        
           	} 
        
           }()

Versions and such:

distro: Debian 12 (bookworm)
cirrus: 0.122.0-95e9f68
podman: 4.3.1

The config I'm using on the worker is:

token: "[...]"

security:
  allowed-isolations:
    container: {}

resources:
  cpu: 2

log:
  level: debug

The .cirrus.yml that triggers the worker is: https://github.com/duckinator/bork/blob/184e2c646d521bdfe8adef40c94082787e090944/.cirrus.yml (note that macOS_task, FreeBSD_task, and Windows_task are currently marked as skipped).

Please check out the new 0.122.1 release that will be available shortly, it should fix the issue you're encountering 🙌

Unfortunately with cirrus-cli 0.122.1-8ae0752, it's actually worse: the problem is still there, but now the podman processes linger even after I stop cirrus worker run. Previously, that was making it exit.

Unfortunately with cirrus-cli 0.122.1-8ae0752, it's actually worse: the problem is still there, but now the podman processes linger even after I stop cirrus worker run. Previously, that was making it exit.

Indeed, I was testing the fix using cirrus run, yet, Persistent Worker has slightly different code path.

This will be fixed in #769.

Sorry for the inconvenience!

Confirmed that using 0.122.2-6faa293 works. Thank you for fixing this so quickly, it's very appreciated!

	func (backend *Podman) Close() error {
	doneChan := make(chan error)

	go func() {
	doneChan <- backend.cmd.Wait()
	}()

	var interruptSent, killSent bool

	for {
	select {
	case <-time.After(time.Second):
	if !killSent {
	if err := backend.cmd.Process.Kill(); err != nil {
	return err
	}
	killSent = true
	}
	case err := <-doneChan:
	return err
	default:
	if !interruptSent {
	if err := backend.cmd.Process.Signal(os.Interrupt); err != nil {
	return err
	}
	interruptSent = true
	}
	}
	}
	}

	defer func() {
	if instanceCloseErr := task.Instance.Close(ctx); instanceCloseErr != nil {
	e.logger.Warnf("failed to cleanup task %s's instance: %v",
	task.String(), instanceCloseErr)

	if err == nil {
	err = instanceCloseErr
	}
	}
	}()