graphile/migrate

Hangs when trying migration on production

markhalonen opened this issue · 2 comments

Hey Benjie, just looking for places to look on something that we haven't been able to reproduce.

Twice now on production with <10k requests per minute, we run an innocent migration such as

alter table steelhead.work_order add column if not exists archived_by_user_id integer references steelhead.user(id);
create index if not exists work_order_archived_by_user_id_idx on steelhead.work_order(archived_by_user_id);

alter table steelhead.work_order drop constraint if exists work_order_archived_by_user_id_domain_id_fkey;
alter table steelhead.received_order drop constraint if exists received_order_archived_by_user_id_domain_id_fkey;

alter table steelhead.user drop constraint if exists user_id_domain_id_unique;

alter table steelhead.user add constraint user_id_domain_id_unique unique (id, domain_id);

alter table steelhead.work_order add constraint work_order_archived_by_user_id_domain_id_fkey foreign key (archived_by_user_id, domain_id) references steelhead.user(id, domain_id);

alter table steelhead.received_order add column if not exists archived_by_user_id integer references steelhead.user(id);
create index if not exists received_order_archived_by_user_id_idx on steelhead.received_order(archived_by_user_id);

alter table steelhead.received_order add constraint received_order_archived_by_user_id_domain_id_fkey foreign key (archived_by_user_id, domain_id) references steelhead.user(id, domain_id);

and the migration hangs, and it stops all Postgres operations, and we have to restart our db and customers are impacted, etc.

Any ideas on what could be causing something like this? Long-running transactions or something?

Thanks for any ideas!

benjie commented

Most likely you have a long running transaction that this transaction is waiting on, yes.

a standard index build locks out writes (but not reads) on the table until it's done.
-- https://www.postgresql.org/docs/15/sql-createindex.html

Although most forms of ADD table_constraint require an ACCESS EXCLUSIVE lock, ADD FOREIGN KEY requires only a SHARE ROW EXCLUSIVE lock. Note that ADD FOREIGN KEY also acquires a SHARE ROW EXCLUSIVE lock on the referenced table, in addition to the lock on the table on which the constraint is declared.
-- https://www.postgresql.org/docs/15/sql-altertable.html

Here are some key lines from your migration:

-- create index: expensive
create index if not exists work_order_archived_by_user_id_idx
  on steelhead.work_order(archived_by_user_id);

-- create unique index: expensive
alter table steelhead.user add constraint user_id_domain_id_unique
  unique (id, domain_id);

-- Add foreign key - needs to check every single row in the table matches an
-- equivalent row on the other side - expensive
alter table steelhead.work_order add constraint work_order_archived_by_user_id_domain_id_fkey
  foreign key (archived_by_user_id, domain_id) references steelhead.user(id, domain_id);

-- Another index - expensive
create index if not exists received_order_archived_by_user_id_idx
  on steelhead.received_order(archived_by_user_id);

-- Another FK, expensive
alter table steelhead.received_order add constraint received_order_archived_by_user_id_domain_id_fkey
  foreign key (archived_by_user_id, domain_id) references steelhead.user(id, domain_id);

Creating indexes takes time, and foreign keys take significant time to check too. You might want to create the indexes CONCURRENTLY in one migration, wait for the indexes to complete, and then run the next migration. Alternatively, you might find that putting an explicit LOCK statement at the top of your migration would solve the "hang":

LOCK work_order, user, received_order IN SHARE ROW EXCLUSIVE MODE;

Ok great, thank you for the insight! The real answer here is probably a way to replay production traffic in a test environment. Will close. 👍