svsticky/sadserver

Await backup success on migration

SilasPeters opened this issue · 2 comments

Koala's main task makes a backup of the database before migrating the database. However, the ansible script currently just ensures the task has started, and then immediatly starts migrating. We should both ensure that the backup was successful and await the backups' completion.

In order to achieve this, the systemd service must use an exit code, which it probably already uses but I wanted to type it here.

Problemetic code: [roles/koala/tasks/main.yml]

      - name: "make pre-upgrade database backup to S3"
        become_user: "root"
        become: true
        ansible.builtin.systemd:
          name: "backup-postgres.service"
          state: "started"

      - name: "check whether the database exists"
        ansible.builtin.shell: nix-shell --run 'dotenv rails db:version'
        args:
          chdir: "/var/www/koala.{{ canonical_hostname }}"
          executable: "/bin/bash"
        register: "koala_database_version"

      - name: "parse database version message"
        ansible.builtin.set_fact:
          database_exists: "{{ koala_database_version.stdout_lines[-1] != 'Current version: 0' }}"

      - name: "run database setup if database does not exist"
        ansible.builtin.shell: nix-shell --run 'dotenv rails db:setup'
        args:
          chdir: "/var/www/koala.{{ canonical_hostname }}"
          executable: "/bin/bash"
        when: "not database_exists"

backup-postgres.service is a oneshot service, which, according to the systemd manpages:

Note that if this option is used without RemainAfterExit= the service will never enter "active" unit state, but directly
transition from "activating" to "deactivating" or "dead" since no process is configured that shall run continuously. In particular this means that
after a service of this type ran (and which has RemainAfterExit= not set) it will not show up as started afterwards, but as dead.

which I assume means Ansible will only consider the service started when it successfully (exit code 0) finished the backup script. The backup script does use exit code 1 for specific error cases, and uses Bash 'strict mode', so I would reckon this all works.

I tried out several things, and I believe you are right! If the main process exits with an exit code, the service fails. And like you said, there is no 'started' state for one-shots, and you can tell Ansible awaits the service because there is a delay after running the task. Thanks for your help!