paradedb/pg_analytics

Inconsistency in the FDW execution order

Opened this issue · 2 comments

data.csv

What happens?

using FDW on join tables. As you can see, the result is different with the same query.

pg_analytics=# set enable_nestloop = on;
SET
pg_analytics=# select count(*) from t1, t2 left join t3 on true  where t1.value = t2.value ;
 count
-------
     6
(1 row)

pg_analytics=# explain select count(*) from t1, t2 left join t3 on true  where t1.value = t2.value ;
                               QUERY PLAN
------------------------------------------------------------------------
 Aggregate  (cost=69.39..69.40 rows=1 width=8)
   ->  Nested Loop Left Join  (cost=0.00..63.01 rows=2550 width=0)
         ->  Nested Loop  (cost=0.00..2.01 rows=1 width=0)
               Join Filter: (t1.value = t2.value)
               ->  Foreign Scan on t1  (cost=0.00..1.00 rows=1 width=0)
                     DuckDB Scan: SELECT value FROM public.t1
               ->  Foreign Scan on t2  (cost=0.00..1.00 rows=1 width=0)
                     DuckDB Scan: SELECT value FROM public.t2
         ->  Seq Scan on t3  (cost=0.00..35.50 rows=2550 width=0)
(9 rows)
pg_analytics=# set enable_nestloop = off;
SET
pg_analytics=# explain select count(*) from t1, t2 left join t3 on true  where t1.value = t2.value ;
                                      QUERY PLAN
--------------------------------------------------------------------------------------
 Aggregate  (cost=10000000069.40..10000000069.41 rows=1 width=8)
   ->  Nested Loop Left Join  (cost=10000000001.01..10000000063.03 rows=2550 width=0)
         ->  Hash Join  (cost=1.01..2.03 rows=1 width=0)
               Hash Cond: (t1.value = t2.value)
               ->  Foreign Scan on t1  (cost=0.00..1.00 rows=1 width=0)
                     DuckDB Scan: SELECT value FROM public.t1
               ->  Hash  (cost=1.00..1.00 rows=1 width=0)
                     ->  Foreign Scan on t2  (cost=0.00..1.00 rows=1 width=0)
                           DuckDB Scan: SELECT value FROM public.t2
         ->  Seq Scan on t3  (cost=0.00..35.50 rows=2550 width=0)
(10 rows)

pg_analytics=# select count(*) from t1, t2 left join t3 on true  where t1.value = t2.value ;
 count
-------
  2077
(1 row)

To Reproduce

CREATE FOREIGN TABLE t1()
SERVER csv_server
OPTIONS (files '/home/kyss/labs/paradedb/data.csv');

CREATE FOREIGN TABLE t2()
SERVER csv_server
OPTIONS (files '/home/kyss/labs/paradedb/data.csv');

CREATE TABLE t3(a int);

see above

OS:

x86

ParadeDB Version:

0.9.0

Are you using ParadeDB Docker, Helm, or the extension(s) standalone?

ParadeDB Docker Image

Full Name:

kysshsy

Affiliation:

NA

Did you include all relevant data sets for reproducing the issue?

Yes

Did you include the code required to reproduce the issue?

  • Yes, I have

Did you include all relevant configurations (e.g., CPU architecture, PostgreSQL version, Linux distribution) to reproduce the issue?

  • Yes, I have

I think this is because now we have a global variable Arrow. This indicates the execution of a FDW scan fetch all tuples at a time. But this doesn't seem to be the case for Postgres. Different FDW scans might alternate execution.
(Please ensure that the join of foreign tables uses a nested loop join, as this will help guarantee alternate execution.)
Maybe we should change the global variable pattern of Statement and Arrow too. Store them in FDW state.

I think this is because now we have a global variable Arrow. This indicates the execution of a FDW scan fetch all tuples at a time. But this doesn't seem to be the case for Postgres. Different FDW scans might alternate execution. (Please ensure that the join of foreign tables uses a nested loop join, as this will help guarantee alternate execution.) Maybe we should change the global variable pattern of Statement and Arrow too. Store them in FDW state.

Hmmm. This is a good idea. We've had another user report this issue. I've bumped up the priority level