`wait_compose` module doesn't exit when compose finishes
sallyom opened this issue · 2 comments
Builder roles fail by timing out while waiting for the compose to finish, although the compose has already finished several minutes ago. The builder roles are running in ec2 rhel9.2 instance.
json output from vm, shows finished:
{
"method": "GET",
"path": "/compose/finished",
"status": 200,
"body": {
"finished": [
{
"blueprint": "rhde",
"compose_type": "edge-container",
"id": "01d2e66b-96bc-4477-8978-4d27e16e417f",
"image_size": 0,
"job_created": 1692152909.3148224,
"job_finished": 1692153570.499627,
"job_started": 1692152909.3239973,
"queue_status": "FINISHED",
"version": "0.0.1"
}
]
}
},
Run never progresses past the wait_compose.py
/ Wait for compose to finish
task.
TASK [infra.osbuild.builder : Wait for compose to finish] **********************
task path: /runner/requirements_collections/ansible_collections/infra/osbuild/roles/builder/tasks/main.yml:121
--- no useful info ---
Hey @sallyom I spun up an ec2 instance and wasn't able to reproduce this issue. I successfully built an edge-container and edge-commit with no issues.
Are you still experiencing this issue?
@matoval the issue happens when I'm running the multi-stage edge-installer
compose_type.
I'm running AAP in OpenShift, and I have a rhel9.2 builder VM in ec2 configured as the remote host.
The first stage, edge-commit
completes in the VM successfully. So I know the playbook/inventory/connection is a-ok - and also several weldr API calls happen successfully (the blueprint push, the start compose, etc). The playbook running from AAP never proceeds past this first edge-commit
stage because the request result that the edge-commit
compose is finished never gets through so the wait_compose task fails due to timeout (it hangs - there is no other error).
Here's the weird thing. I can watch the weldr socket API calls in the rhel9 vm - I see that the wait_compose checks every 20s (the default recheck frequency). The instant the compose finishes, the wait_compose goes silent - it no longer checks in every 20s. So something has triggered that the compose finished, but then silence - and the eventual timeout.
Here's the weirder thing. I can run the exact same playbook with the exact same vars to completion if I instead ssh into the rhel9.2 ec2 instance and configure a localhost inventory. When I run it directly on the host I see the multi-stage composes complete. First the edge-commit
and the commit is served as expected, then, an empty blueprint is created, then, the edge-installer
compose completes and I have the ISO image.