geerlingguy/tower-operator

Investigate celery errors and task queue not working on first start

geerlingguy opened this issue · 7 comments

For example:

2019-11-08 22:34:58,030 ERROR    celery.beat Removing corrupted schedule file '/var/lib/awx/beat.db': error(11, 'Resource temporarily unavailable')
Traceback (most recent call last):
  File "/var/lib/awx/venv/awx/lib64/python3.6/site-packages/celery/beat.py", line 485, in setup_schedule
    self._store = self._open_schedule()
  File "/var/lib/awx/venv/awx/lib64/python3.6/site-packages/celery/beat.py", line 475, in _open_schedule
    return self.persistence.open(self.schedule_filename, writeback=True)
  File "/usr/lib64/python3.6/shelve.py", line 243, in open
    return DbfilenameShelf(filename, flag, protocol, writeback)
  File "/usr/lib64/python3.6/shelve.py", line 227, in __init__
    Shelf.__init__(self, dbm.open(filename, flag), protocol, writeback)
  File "/usr/lib64/python3.6/dbm/__init__.py", line 94, in open
    return mod.open(file, flag, mode)
_gdbm.error: [Errno 11] Resource temporarily unavailable

Not sure if this is a big deal or not, but wanted to post it here and track down any other celery-related issues. Note that I don't think I'm running any instance of celery independent of the RabbitMQ deployment or the AWX web deployment...

It looks like the official Kubernetes installer has a separate container running in the main web pod for celery: https://github.com/ansible/awx/blob/devel/installer/roles/kubernetes/templates/deployment.yml.j2#L233-L278

Follow-up to #1.

What I believe is happening is that task pod is coming up at the same time the main tower web pod comes up—both don't work yet because the database is not populated.

But the web pod handles this gracefully, since it just shows a status page "database is upgrading" until the database is populated.

The task pod goes into a loop where supervisor keeps respawning awx.main.dispatch:

django.db.utils.ProgrammingError: column main_instancegroup.credential_id does not exist
LINE 1: ..."modified", "main_instancegroup"."controller_id", "main_inst...
                                                             ^

2019-11-20 16:40:46,326 INFO exited: dispatcher (exit status 1; not expected)
2019-11-20 16:40:47,329 INFO spawned: 'dispatcher' with pid 397
2019-11-20 16:40:48,333 INFO success: dispatcher entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2019-11-20 16:40:50,780 WARNING  awx.main.dispatch scaling up worker pid:405
2019-11-20 16:40:50,807 ERROR    celery.beat Removing corrupted schedule file '/var/lib/awx/beat.db': error(11, 'Resource temporarily unavailable')

Eventually, once the database is populated, things start to work—seemingly:

2019-11-20 16:40:51,288 DEBUG    awx.main.dispatch listening on ['awx_private_queue [direct]', 'example-tower-tower-task-f8845c46b-xl287 [direct]', 'tower_broadcast_all_example-tower-tower-task-f8845c46b-xl287 [fanout]']
RESULT 2
OKREADY

But for some reason it takes another restart of that container for things to go well.

What I might consider doing is using an initContainer with the task container to do the run of awx-manage migrate --noinput... that way it would always be run before the task container comes up. Plus I could drop the manual Migrate the database if the K8s resources were updated. task in the tower role, because it would be part of the Kubernetes resource itself. (I'm not quite sure why it's not done that way in the official OpenShift installer...?).

When trying to use an initContainer for the task, I got:

    raise dj_exc_value.with_traceback(traceback) from exc_value
  File "/var/lib/awx/venv/awx/lib64/python3.6/site-packages/django/db/backends/base/base.py", line 217, in ensure_connection
    self.connect()
  File "/var/lib/awx/venv/awx/lib64/python3.6/site-packages/django/db/backends/base/base.py", line 195, in connect
    self.connection = self.get_new_connection(conn_params)
  File "/var/lib/awx/venv/awx/lib64/python3.6/site-packages/django/db/backends/postgresql/base.py", line 178, in get_new_connection
    connection = Database.connect(**conn_params)
  File "/var/lib/awx/venv/awx/lib64/python3.6/site-packages/psycopg2/__init__.py", line 126, in connect
    conn = _connect(dsn, connection_factory=connection_factory, **kwasync)
django.db.utils.OperationalError: could not translate host name "example-tower-postgres.example-tower.svc.cluster.local" to address: Name or service not known

Trying another route—delete the task Pod any time a migration is run.

Deleting it causes a number of weird issues, and tons of annoying log entries in both the Ansible playbook output and the operator container. So trying another route—kubectl scale to scale down to 0 replicas, then back to 1 replica.

Yay, so all I had to do was:

- name: Scale the tower_task deployment to 0 replicas after migration.
  k8s:
    definition: "{{ lookup('template', 'tower_task.yaml.j2') | from_yaml_all | list }}"
  vars:
    tower_task_replicas: "0"
  when: migrate_result and migrate_result.changed

- name: Scale the tower_task deployment back to 1 replica after migration.
  k8s:
    definition: "{{ lookup('template', 'tower_task.yaml.j2') | from_yaml_all | list }}"
  vars:
    tower_task_replicas: "1"
  when: migrate_result and migrate_result.changed

Basically, scale it down and up very quickly, and the Pod is replaced. The replacement Pod is very happy to start accepting and running jobs.

I still see the celery error, but that's not a big deal as it seems it gracefully recovers from it...? I noticed this in the 3.3.0 release notes 'known issues': https://docs.ansible.com/ansible-tower/3.3.0/html/release-notes/known_issues.html#celery-beat-database-on-disk-becomes-corrupted

Worst case, if it does become an issue in the future, we'll have to figure out how to delete that file on container start, maybe using a postStart lifecycle handler? https://kubernetes.io/docs/tasks/configure-pod-container/attach-handler-lifecycle-event/

Pushed version 0.1.1. First working version for AWX and Tower out-of-the-box!