davidbanham/kewpie_go

Postgres Locks not being released.

Closed this issue · 5 comments

Commit 0531daa introduced the change where if the context does not have a transaction, then a transaction is started or begun, and then committed in a deferred function call. What we observe because of this change is that Postgres is unable to obtain all the table locks (lwait.granted) to run migrations due to the open transaction started by this new change which appears to be a timing issue. If the migrations run before the subscriber starts up then there should be no problem. The following piece of SQL can be used to observe the locks when the Kewpie process starts up. It illustrates how locks are acquired and not released.

select sa.query, sa.query_start, sa.pid, sa.state, lwait.pid, lwait.mode, lwait.granted from pg_stat_activity sa inner join pg_locks lwait on (sa.pid = lwait.pid) where sa.query like '%kewpie%' and sa.query not like '%pg_locks%';

Okay so to restate my understanding, we can't run migrations while Kewpie subscribers are active because the migrations attempt to gain a lock which it cannot due to the subscribers holding their long running locks.

Long running locks are a pretty integral part of the design of the Postgres backend. The idea is that the subscriber opens a transaction in which it deletes the job from the queue. That creates a row level lock on that row. Future subscribers skip that locked row and go onto the next one.

In the case where a subscriber vanishes part way through a job, the transaction will be cancelled, the row level lock will be released and the job will be available to be consumed by another subscriber (retried).

In order to not hold these locks I think we'd need to come up with another strategy that achieves the same outcomes in terms of retry safety.

Other possible solutions to the migration issue might be:

a) Gracefully shut down the subscribers before running migrations
b) Don't attempt to gain locks on the kewpie tables unless there's actually a migration specifically for that table. Given that the tables are automatically generated and managed by the kewpie postgres backend I'd wager we'll never actually write a migration for those tables.

What do you think?

Yeah, I got how the subscribers work, and that all makes sense. But if we create a startup sequence, where firstly the migrations are run, and then then subscribers start, will that not fix this issue? Or is this yet another timing issue when multiple pods startup and attempt the initialisation process?

I think we're both saying the same thing here:

But if we create a startup sequence, where firstly the migrations are run, and then then subscribers start, will that not fix this issue?

a) Gracefully shut down the subscribers before running migrations

Oh right, then from what I understand option a and what I am suggesting are the same thing? If so that could work.

Yep. Shut down the subscribers from the previously deployed version, run the migration, start the subscribers from the new version.

I suspect that all the synchronisation plumbing could end up being pretty complex as opposed to option b which is just "Ignore these tables when migrating" but your call which way you choose to solve it.

In either case, I don't think we have any work to do in this repo. If you want to go with Option A we probably need to add a SIGTERM handler to Sonic that performs a graceful shutdown via the kewpie.Disconnect method.