Room for improvement on handling failures in migrations when deploying Pycon
dpoirier opened this issue · 2 comments
I noticed this yesterday - running migrations during a deploy (highstate) is conditional on the code having changed. Which makes a lot of sense, given that the highstate runs many, many times a day. However, if the migration fails, it fails that highstate run, but subsequent runs don't try to run the migration again (because the code was updated in the previous run, so in the later run it's not changing), and so subsequent runs appear to succeed even though things are actually not in the proper state anymore.
I'm not sure what the best fix is though. We could hack up the way we run migrations so we run them when the code has changed or the previous migration failed (keeping track of that somehow), but that's pretty kludgey. And anyway, unless someone has fixed something manually, migrations aren't suddenly going to start working without a code change.
Or we could just bite the bullet and remove the condition, so Django checks whether any migrations need to run on each highstate. Maybe we should also consider whether deploys should run in a frequent periodic highstate... but at least this way, if something was wrong each highstate would fail until it was fixed.
What we'd want ideally would be for all the changes in a deploy to happen in a transaction (somehow), so if anything fails, no changes take effect. The previous system with Chef was set up that way with regard to the source code, but not the database or the virtualenv, so things could still get out of sync when there was a failure. And I don't think anyone has a great solution for that.
We deployed a fix on Friday afternoon to our staging server, but we were still getting errors showing tracebacks from the previous code this morning. I verified that the code checked out on the server was the updated code, so apparently the server processes (or at least one of them) were still running the old code.
Here's a bit of the minion log from Friday:
2015-09-18 21:32:09,753 [salt.loaded.int.module.cmdmod][ERROR ] Command '/srv/pycon/env/bin/python manage.py migrate --noinput && /srv/pycon/env/bin/python manage.py compress --force && /srv/pycon/env/bin/python manage.py collectstatic -v0 --noinput' failed with return code: 139
2015-09-18 21:32:09,757 [salt.loaded.int.module.cmdmod][ERROR ] stderr: /bin/bash: line 1: 9661 Segmentation fault /srv/pycon/env/bin/python manage.py migrate --noinput
2015-09-18 21:32:09,757 [salt.loaded.int.module.cmdmod][ERROR ] retcode: 139
2015-09-18 21:32:09,760 [salt.state ][ERROR ] {'pid': 9660, 'retcode': 139, 'stderr': '/bin/bash: line 1: 9661 Segmentation fault /srv/pycon/env/bin/python manage.py migrate --noinput', 'stdout': ''}
I think here's what happened:
- We updated the staging branch
- The next deploy checked out the updated code
- A later state segfaulted while trying to run migrations (I don't know why it segfaulted), so that Salt never got to the server restart step
- The next time the deploy ran, the code didn't need to be updated, so none of the rest of the steps that include restarting the server ever got run
I still don't have any wonderful ideas for fixing this kind of problem. See my previous comments in this issue.
We now deploy the pycon site via Heroku, so this is no longer relevant.