Investigate celery errors and task queue not working on first start
geerlingguy opened this issue · 7 comments
For example:
2019-11-08 22:34:58,030 ERROR celery.beat Removing corrupted schedule file '/var/lib/awx/beat.db': error(11, 'Resource temporarily unavailable')
Traceback (most recent call last):
File "/var/lib/awx/venv/awx/lib64/python3.6/site-packages/celery/beat.py", line 485, in setup_schedule
self._store = self._open_schedule()
File "/var/lib/awx/venv/awx/lib64/python3.6/site-packages/celery/beat.py", line 475, in _open_schedule
return self.persistence.open(self.schedule_filename, writeback=True)
File "/usr/lib64/python3.6/shelve.py", line 243, in open
return DbfilenameShelf(filename, flag, protocol, writeback)
File "/usr/lib64/python3.6/shelve.py", line 227, in __init__
Shelf.__init__(self, dbm.open(filename, flag), protocol, writeback)
File "/usr/lib64/python3.6/dbm/__init__.py", line 94, in open
return mod.open(file, flag, mode)
_gdbm.error: [Errno 11] Resource temporarily unavailable
Not sure if this is a big deal or not, but wanted to post it here and track down any other celery-related issues. Note that I don't think I'm running any instance of celery
independent of the RabbitMQ deployment or the AWX web deployment...
It looks like the official Kubernetes installer has a separate container running in the main web pod for celery: https://github.com/ansible/awx/blob/devel/installer/roles/kubernetes/templates/deployment.yml.j2#L233-L278
Follow-up to #1.
What I believe is happening is that task pod is coming up at the same time the main tower web pod comes up—both don't work yet because the database is not populated.
But the web pod handles this gracefully, since it just shows a status page "database is upgrading" until the database is populated.
The task pod goes into a loop where supervisor keeps respawning awx.main.dispatch
:
django.db.utils.ProgrammingError: column main_instancegroup.credential_id does not exist
LINE 1: ..."modified", "main_instancegroup"."controller_id", "main_inst...
^
2019-11-20 16:40:46,326 INFO exited: dispatcher (exit status 1; not expected)
2019-11-20 16:40:47,329 INFO spawned: 'dispatcher' with pid 397
2019-11-20 16:40:48,333 INFO success: dispatcher entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2019-11-20 16:40:50,780 WARNING awx.main.dispatch scaling up worker pid:405
2019-11-20 16:40:50,807 ERROR celery.beat Removing corrupted schedule file '/var/lib/awx/beat.db': error(11, 'Resource temporarily unavailable')
Eventually, once the database is populated, things start to work—seemingly:
2019-11-20 16:40:51,288 DEBUG awx.main.dispatch listening on ['awx_private_queue [direct]', 'example-tower-tower-task-f8845c46b-xl287 [direct]', 'tower_broadcast_all_example-tower-tower-task-f8845c46b-xl287 [fanout]']
RESULT 2
OKREADY
But for some reason it takes another restart of that container for things to go well.
What I might consider doing is using an initContainer
with the task container to do the run of awx-manage migrate --noinput
... that way it would always be run before the task container comes up. Plus I could drop the manual Migrate the database if the K8s resources were updated.
task in the tower role, because it would be part of the Kubernetes resource itself. (I'm not quite sure why it's not done that way in the official OpenShift installer...?).
When trying to use an initContainer for the task, I got:
raise dj_exc_value.with_traceback(traceback) from exc_value
File "/var/lib/awx/venv/awx/lib64/python3.6/site-packages/django/db/backends/base/base.py", line 217, in ensure_connection
self.connect()
File "/var/lib/awx/venv/awx/lib64/python3.6/site-packages/django/db/backends/base/base.py", line 195, in connect
self.connection = self.get_new_connection(conn_params)
File "/var/lib/awx/venv/awx/lib64/python3.6/site-packages/django/db/backends/postgresql/base.py", line 178, in get_new_connection
connection = Database.connect(**conn_params)
File "/var/lib/awx/venv/awx/lib64/python3.6/site-packages/psycopg2/__init__.py", line 126, in connect
conn = _connect(dsn, connection_factory=connection_factory, **kwasync)
django.db.utils.OperationalError: could not translate host name "example-tower-postgres.example-tower.svc.cluster.local" to address: Name or service not known
Trying another route—delete the task Pod any time a migration is run.
Deleting it causes a number of weird issues, and tons of annoying log entries in both the Ansible playbook output and the operator container. So trying another route—kubectl scale
to scale down to 0 replicas, then back to 1 replica.
Yay, so all I had to do was:
- name: Scale the tower_task deployment to 0 replicas after migration.
k8s:
definition: "{{ lookup('template', 'tower_task.yaml.j2') | from_yaml_all | list }}"
vars:
tower_task_replicas: "0"
when: migrate_result and migrate_result.changed
- name: Scale the tower_task deployment back to 1 replica after migration.
k8s:
definition: "{{ lookup('template', 'tower_task.yaml.j2') | from_yaml_all | list }}"
vars:
tower_task_replicas: "1"
when: migrate_result and migrate_result.changed
Basically, scale it down and up very quickly, and the Pod is replaced. The replacement Pod is very happy to start accepting and running jobs.
I still see the celery error, but that's not a big deal as it seems it gracefully recovers from it...? I noticed this in the 3.3.0 release notes 'known issues': https://docs.ansible.com/ansible-tower/3.3.0/html/release-notes/known_issues.html#celery-beat-database-on-disk-becomes-corrupted
Worst case, if it does become an issue in the future, we'll have to figure out how to delete that file on container start, maybe using a postStart lifecycle handler? https://kubernetes.io/docs/tasks/configure-pod-container/attach-handler-lifecycle-event/
Pushed version 0.1.1. First working version for AWX and Tower out-of-the-box!