stac-utils/pgstac

pypgstac load items partially without exceptions

Opened this issue · 0 comments

The pypgstac load items command does not handle item loading atomically. There are scenarios where some items fail to load without pypgstac raising any exceptions. This is a serious issue because it creates the illusion that everything has been loaded correctly from the client's perspective, while in reality, some items are missing. Below is a reproducible test case that demonstrates this behavior:

  1. Create 2 collections with a month partition
  2. Prepare two files with items: items1.njson and items2.njson. Each file contains 10 items with the same datetime, belonging to collection1 (items1.njson) and 10 to collection2 (items2.njson).
  3. Ingest them concurrently:
$ pypgstac load items items1.njson --dsn postgresql://username:password@pgstac:5432/postgis & \
  pypgstac load items items2.njson --dsn postgresql://username:password@pgstac:5432/postgis
  1. Check the count. There should be 20 items, but only 10 were ingested:
postgis=# SELECT COUNT(1) FROM items;
 count 
-------
    10
(1 row)

Logs indicate a deadlock during the update_partition_stats function execution:

pgstac      | 2024-07-01 20:44:12.047 UTC [141] ERROR:  deadlock detected
pgstac      | 2024-07-01 20:44:12.047 UTC [141] DETAIL:  Process 141 waits for AccessExclusiveLock on relation 18386 of database 16384; blocked by process 140.
pgstac      | 	Process 140 waits for AccessExclusiveLock on relation 18972 of database 16384; blocked by process 141.
pgstac      | 	Process 141: SELECT update_partition_stats_q($1);
pgstac      | 	Process 140: SELECT update_partition_stats_q($1);
pgstac      | 2024-07-01 20:44:12.047 UTC [141] HINT:  See server log for query details.
pgstac      | 2024-07-01 20:44:12.047 UTC [141] CONTEXT:  SQL statement "REFRESH MATERIALIZED VIEW partitions"
pgstac      | 	PL/pgSQL function update_partition_stats(text,boolean) line 35 at SQL statement
pgstac      | 	SQL statement "SELECT update_partition_stats('_items_3_202003', 'f');"
pgstac      | 	PL/pgSQL function run_or_queue(text) line 11 at EXECUTE
pgstac      | 	SQL statement "SELECT run_or_queue(
pgstac      | 	        format('SELECT update_partition_stats(%L, %L);', _partition, istrigger)
pgstac      | 	    )"
pgstac      | 	PL/pgSQL function update_partition_stats_q(text,boolean) line 4 at PERFORM
pgstac      | 2024-07-01 20:44:12.047 UTC [141] STATEMENT:  SELECT update_partition_stats_q($1);

I understand that concurrent data ingestion into the same partition is not recommended. However, it would be helpful to receive a clear exception when something goes wrong. Additionally, if I'm correct, in this example we are ingesting data into different partitions.