pypgstac load items partially without exceptions
Opened this issue · 0 comments
The pypgstac load items
command does not handle item loading atomically. There are scenarios where some items fail to load without pypgstac
raising any exceptions. This is a serious issue because it creates the illusion that everything has been loaded correctly from the client's perspective, while in reality, some items are missing. Below is a reproducible test case that demonstrates this behavior:
- Create 2 collections with a month partition
- Prepare two files with items:
items1.njson
anditems2.njson
. Each file contains 10 items with the same datetime, belonging tocollection1
(items1.njson
) and 10 tocollection2
(items2.njson
). - Ingest them concurrently:
$ pypgstac load items items1.njson --dsn postgresql://username:password@pgstac:5432/postgis & \
pypgstac load items items2.njson --dsn postgresql://username:password@pgstac:5432/postgis
- Check the count. There should be 20 items, but only 10 were ingested:
postgis=# SELECT COUNT(1) FROM items;
count
-------
10
(1 row)
Logs indicate a deadlock during the update_partition_stats function execution:
pgstac | 2024-07-01 20:44:12.047 UTC [141] ERROR: deadlock detected
pgstac | 2024-07-01 20:44:12.047 UTC [141] DETAIL: Process 141 waits for AccessExclusiveLock on relation 18386 of database 16384; blocked by process 140.
pgstac | Process 140 waits for AccessExclusiveLock on relation 18972 of database 16384; blocked by process 141.
pgstac | Process 141: SELECT update_partition_stats_q($1);
pgstac | Process 140: SELECT update_partition_stats_q($1);
pgstac | 2024-07-01 20:44:12.047 UTC [141] HINT: See server log for query details.
pgstac | 2024-07-01 20:44:12.047 UTC [141] CONTEXT: SQL statement "REFRESH MATERIALIZED VIEW partitions"
pgstac | PL/pgSQL function update_partition_stats(text,boolean) line 35 at SQL statement
pgstac | SQL statement "SELECT update_partition_stats('_items_3_202003', 'f');"
pgstac | PL/pgSQL function run_or_queue(text) line 11 at EXECUTE
pgstac | SQL statement "SELECT run_or_queue(
pgstac | format('SELECT update_partition_stats(%L, %L);', _partition, istrigger)
pgstac | )"
pgstac | PL/pgSQL function update_partition_stats_q(text,boolean) line 4 at PERFORM
pgstac | 2024-07-01 20:44:12.047 UTC [141] STATEMENT: SELECT update_partition_stats_q($1);
I understand that concurrent data ingestion into the same partition is not recommended. However, it would be helpful to receive a clear exception when something goes wrong. Additionally, if I'm correct, in this example we are ingesting data into different partitions.