yugabyte/taqo

Produce Leading and join method hints with ANY_subquery alias for IN-subquery

mtakahar opened this issue · 0 comments

Postgres optimizer tries to convert an ANY- and IN- subquery expression to an inner join when it satisfies certain conditions, and when it does so, it assigns an internally generated table alias: ANY_subquery to the subquery.

Example:
An IN-subquery (the same as =ANY (...) ) expression, c2 in (select c1 from t100 where ...) was converted to an inner join, but the Leading hint was not followed. The optimizer actually generated a plan: Leading (( t10000w ( t1000 t100 ) )) NestLoop(t10000w t100) NestLoop(t10000w t100 t1000).

explain /*+ Leading (( ( t10000w t100 ) t1000 ))  HashJoin(t10000w t100) HashJoin(t10000w t100 t1000) SeqScan(t10000w) SeqScan(t100) SeqScan(t1000) */select * from t10000w where c2 in (select c1 from t100 where exists (select 0 from t1000 where c1 = t100.c2));

                                    QUERY PLAN                                     
-----------------------------------------------------------------------------------
 Nested Loop Semi Join  (cost=20000000000.00..20000017610.50 rows=400 width=16396)
   Join Filter: (t10000w.c2 = t100.c1)
   ->  YB Seq Scan on t10000w  (cost=0.00..1000.00 rows=10000 width=16396)
   ->  Materialize  (cost=10000000000.00..10000001610.75 rows=100 width=4)
         ->  Nested Loop  (cost=10000000000.00..10000001610.25 rows=100 width=4)
               Join Filter: (t100.c2 = t1000.c1)
               ->  YB Seq Scan on t1000  (cost=0.00..100.00 rows=1000 width=4)
               ->  Materialize  (cost=0.00..10.50 rows=100 width=8)
                     ->  YB Seq Scan on t100  (cost=0.00..10.00 rows=100 width=8)
(9 rows)

We can get HashJoins by specifying ANY_subquery in place of t100, however, the Leading hint is still not followed:

explain /*+ Leading (( ( t10000w ANY_subquery ) t1000 ))  HashJoin(t10000w ANY_subquery) HashJoin(t10000w ANY_subquery t1000) SeqScan(t10000w) SeqScan(t100) SeqScan(t1000) */select * from t10000w where c2 in (select c1 from t100 where exists (select 0 from t1000 where c1 = t100.c2));
                                    QUERY PLAN                                    
----------------------------------------------------------------------------------
 Hash Semi Join  (cost=118.50..1149.20 rows=400 width=16396)
   Hash Cond: (t10000w.c2 = t100.c1)
   ->  YB Seq Scan on t10000w  (cost=0.00..1000.00 rows=10000 width=16396)
   ->  Hash  (cost=117.25..117.25 rows=100 width=4)
         ->  Hash Join  (cost=11.25..117.25 rows=100 width=4)
               Hash Cond: (t1000.c1 = t100.c2)
               ->  YB Seq Scan on t1000  (cost=0.00..100.00 rows=1000 width=4)
               ->  Hash  (cost=10.00..10.00 rows=100 width=8)
                     ->  YB Seq Scan on t100  (cost=0.00..10.00 rows=100 width=8)
(9 rows)

We need more investigations before we can implement this to the framework.