ansible/receptor

receptor: Error locating unit

anxstj opened this issue · 8 comments

Please confirm the following

  • I agree to follow this project's code of conduct.
  • I have checked the current issues for duplicates.
  • I understand that AWX is open source software provided for free and that I might not receive a timely response.

Bug Summary

My receptor services on my execution nodes show the following errors:

ERROR 2022/09/27 16:07:41 Error locating unit: SLpl8dHZ
ERROR 2022/09/27 16:07:41 unknown work unit SLpl8dHZ

It seems that it shows up whenever a job finishes. The jobs are working, though. And AWX doesn't show any additional error messages.

What could cause this? And how can I debug it?

I'm running AWX 21.5.0 and receptor 1.2.0+g72a97e5

Receptor is installed with the AWX image:

Dockerfile:

COPY --from={{ receptor_image }} /usr/bin/receptor /usr/bin/receptor

Makefile:

RECEPTOR_IMAGE ?= [quay.io/ansible/receptor:devel](http://quay.io/ansible/receptor:devel)

AWX version

21.5.0

Select the relevant components

  • UI
  • API
  • Docs
  • Collection
  • CLI
  • Other

Installation method

docker development environment

Modifications

no

Ansible version

2.12.2

Operating system

Debian 11

Web browser

Firefox

Steps to reproduce

Create a setup with two controller nodes and two execution nodes. Then execute a job on one of the execution nodes. The job should succeed, but receptor will log a similar error message as mentioned above with the end of the job.

Expected results

No error message.

Actual results

ERROR 2022/09/27 16:07:41 Error locating unit: SLpl8dHZ
ERROR 2022/09/27 16:07:41 unknown work unit SLpl8dHZ

Additional information

No response

@anxstj thanks for opening the ticket!
my hunch is that AWX is trying to release or cancel old receptor work units somewhere (i.e. reaper code). Needs some investigation

I just found out that old podman instances are not cleaned up successfully. They stay as zombies on the system:

ps faux
...
1000        7586  0.9  0.3 807216 57784 ?        Ssl  Sep27 270:41  \_ receptor --config /etc/receptor/receptor.conf
1000        8004  0.0  0.0      0     0 ?        Z    Sep27   0:00      \_ [podman] <defunct>
1000        8009  0.0  0.0   1088     0 ?        S    Sep27   0:00      \_ catatonit -P
1000        8669  0.0  0.0      0     0 ?        Z    Sep27   0:00      \_ [slirp4netns] <defunct>
1000        8691  0.0  0.0      0     0 ?        Zs   Sep27   0:09      \_ [fuse-overlayfs] <defunct>
1000        8699  0.0  0.0      0     0 ?        Zs   Sep27   0:00      \_ [conmon] <defunct>

In the long run, this will cause trouble, e.g. the systemd MaxTasks limit will be reached:

cgroup: fork rejected by pids controller in /system.slice/...

Could this be related to #439 ? (Just an uneducated guess)

anxstj commented

Could this be related to #439 ? (Just an uneducated guess)

FTR: the receptor container had a wrong entrypoint that prevented the container to be cleaned up.

I am running awx-operator:2.6.0 and facing the same issue while setting up executors on a VM .
Is there any workaround for it?
@ALL,please help.

Any update on this?

I was able to fix this.
my executor was running behind the firewall and podman was not able to fetch the image from the quay.io registry.
Either get your container launched using the image available in your environment or either make sure your executor is able to reach to the quay repos.
This issue can be closed.

Any update on this?

Yes this is a feature of awx.
This issue can be closed.