openstreetmap/operations

Make swapping the master DB easier

Opened this issue · 4 comments

The question of auto-failover has come up before #11, but this is about the non-auto version of that.

At the moment, flipping the master between sites is so onerous that we put off doing it #119 whenever possible.

What simple things can we do to make this an easier process which is less scary (although still manual)? Ideally, we'd be able to do this without downtime - is there a way to do this (e.g: have pgbouncer flip connections on the fly)?

It's such a high risk operation that I'm not sure I'd really want to try and automate - if anything goes wrong you can be stuck having to reload from backup.

Let's try and figure out if there's a way of making it less risky.

To be clear, I'm not suggesting that we'd want to do this on a regular basis. But I'm worried that it's currently such an onerous and risky task that we'd delay moving to karm for two months to avoid flipping twice. This makes me think that it would be too difficult and risky to attempt if we were suffering from issues at the master database's site.

Well it's more like one month really - it's not like we were about to move to karm tomorrow.

For the record this is the process I followed last time we did it to switch the master from ramoth to katla:

  • Stop chef on ramoth
  • Stop postgres on ramoth
  • Stop chef on katla
  • Stop postgres on katla
  • Switch ramoth and katla roles in chef
  • Start chef on katla and wait for run to complete
    • Check postgres is up and working on katla
  • Start chef on ramoth and wait for run to complete
    • Check postgres is up on ramoth and connected to master

So it's not actually that hard, just very nerve wracking...