reorg/pg_repack

Data corruption in concurrent delete

dvarrazzo opened this issue · 1 comments

Analysis and bugfix by Josh. Opening the issue for reference: Josh email is not available in ML archive. The bug has already been fixed and ready to be released in pg_repack 1.2.0.

OK, I found some time to revisit this bug. Here's a test script with
which I managed to reproduce the problem fairly consistently (see
comments for initial table setup, etc.) I was testing with pg_repack
master, with just a few sleep calls like so sprinkled in to encourage
races (extra_sleeps.diff debugging patch attached). To reproduce, just
start up that repack_bug.py script, let it run for a few seconds to
start populating rows, then kick off an extra_sleeps-patched pg_repack
of table "parent", and afterwards you should get some broken FKs.

AFAICT, the problem only occurs when the test script is using more
than one child process, i.e. concurrent transactions are queueing rows
into the repack log table. So where's the bug? It seems the sql_pop
query used in repack_apply() is to blame. It tries to bulk-delete the
rows it has already processed from the log table via a query like:

DELETE FROM repack.log_22520137 WHERE id <= $1

assuming that any "id" value less than the first "id" it started
processing in its current batch is safe to delete. But this assumption
may not hold if we have say:

T1: BEGIN
T1: DELETE FROM table_being_repacked WHERE ...
T1: ... hangs out for a while ...

T2: BEGIN
T2: INSERT INTO table_being_repacked ...
T2: COMMIT

... later ...

T1: COMMIT
/* repack_apply's sql_pop may run here */

In this example repack_apply() may delete T1's entry in the repack log
table without having actually carried out T1's DELETE on the repack
temp table. Attached is a patch to turn that bulk-delete into a delete
for every row we process out of the log table
(delete_individual_rows.diff). That'll be slower of course, but it
could be optimized into a bulk delete by keeping track of all the IDs
we have processed, and performing a single DELETE FROM repack.log_xyz
WHERE id IN (...).

At least, I hope this is the bug which I got bitten by back in
February. Other thoughts are welcome of course.

Fixed in 5061046