Inconsistency in the FDW execution order
Opened this issue · 2 comments
What happens?
using FDW on join tables. As you can see, the result is different with the same query.
pg_analytics=# set enable_nestloop = on;
SET
pg_analytics=# select count(*) from t1, t2 left join t3 on true where t1.value = t2.value ;
count
-------
6
(1 row)
pg_analytics=# explain select count(*) from t1, t2 left join t3 on true where t1.value = t2.value ;
QUERY PLAN
------------------------------------------------------------------------
Aggregate (cost=69.39..69.40 rows=1 width=8)
-> Nested Loop Left Join (cost=0.00..63.01 rows=2550 width=0)
-> Nested Loop (cost=0.00..2.01 rows=1 width=0)
Join Filter: (t1.value = t2.value)
-> Foreign Scan on t1 (cost=0.00..1.00 rows=1 width=0)
DuckDB Scan: SELECT value FROM public.t1
-> Foreign Scan on t2 (cost=0.00..1.00 rows=1 width=0)
DuckDB Scan: SELECT value FROM public.t2
-> Seq Scan on t3 (cost=0.00..35.50 rows=2550 width=0)
(9 rows)
pg_analytics=# set enable_nestloop = off;
SET
pg_analytics=# explain select count(*) from t1, t2 left join t3 on true where t1.value = t2.value ;
QUERY PLAN
--------------------------------------------------------------------------------------
Aggregate (cost=10000000069.40..10000000069.41 rows=1 width=8)
-> Nested Loop Left Join (cost=10000000001.01..10000000063.03 rows=2550 width=0)
-> Hash Join (cost=1.01..2.03 rows=1 width=0)
Hash Cond: (t1.value = t2.value)
-> Foreign Scan on t1 (cost=0.00..1.00 rows=1 width=0)
DuckDB Scan: SELECT value FROM public.t1
-> Hash (cost=1.00..1.00 rows=1 width=0)
-> Foreign Scan on t2 (cost=0.00..1.00 rows=1 width=0)
DuckDB Scan: SELECT value FROM public.t2
-> Seq Scan on t3 (cost=0.00..35.50 rows=2550 width=0)
(10 rows)
pg_analytics=# select count(*) from t1, t2 left join t3 on true where t1.value = t2.value ;
count
-------
2077
(1 row)
To Reproduce
CREATE FOREIGN TABLE t1()
SERVER csv_server
OPTIONS (files '/home/kyss/labs/paradedb/data.csv');
CREATE FOREIGN TABLE t2()
SERVER csv_server
OPTIONS (files '/home/kyss/labs/paradedb/data.csv');
CREATE TABLE t3(a int);
see above
OS:
x86
ParadeDB Version:
0.9.0
Are you using ParadeDB Docker, Helm, or the extension(s) standalone?
ParadeDB Docker Image
Full Name:
kysshsy
Affiliation:
NA
Did you include all relevant data sets for reproducing the issue?
Yes
Did you include the code required to reproduce the issue?
- Yes, I have
Did you include all relevant configurations (e.g., CPU architecture, PostgreSQL version, Linux distribution) to reproduce the issue?
- Yes, I have
I think this is because now we have a global variable Arrow. This indicates the execution of a FDW scan fetch all tuples at a time. But this doesn't seem to be the case for Postgres. Different FDW scans might alternate execution.
(Please ensure that the join of foreign tables uses a nested loop join, as this will help guarantee alternate execution.)
Maybe we should change the global variable pattern of Statement and Arrow too. Store them in FDW state.
I think this is because now we have a global variable Arrow. This indicates the execution of a FDW scan fetch all tuples at a time. But this doesn't seem to be the case for Postgres. Different FDW scans might alternate execution. (Please ensure that the join of foreign tables uses a nested loop join, as this will help guarantee alternate execution.) Maybe we should change the global variable pattern of Statement and Arrow too. Store them in FDW state.
Hmmm. This is a good idea. We've had another user report this issue. I've bumped up the priority level