pingcap/tispark

[BUG] An exception occurred while hive table join tidb table

shanhai3000 opened this issue · 0 comments

Describe the bug

When executing a query of a hive partition table (big one) inner join a tidb table(small one), the hive partition table is auto broadcasted, which leads an error.
The query is somelike

select hive_table.col1,tidb_table.col2 from hive_table inner join tidb_table on hive_table.col2=tidb_table.col3 where ...

== Physical Plan ==
== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=false
+- Project [... 109 more fields]
+- Generate HiveGenericUDTF#udf.json.JsonExtractValueUDTF(xxx), [, ... 101 more fields], false, [...]
+- Project [, ... 102 more fields]
+- BroadcastHashJoin [xxx#94], [xxxx#475], Inner, BuildRight, false
:- TiKV CoprocessorRDD{[table: xxx] TableReader, Columns: xxxx(): { TableRangeScan: { RangeFilter: [], Range: [([t\200\000\000\000\000\000\004\253_r\200\000\000\000\000\000\000\000], [t\200\000\000\000\000\000\004\253_s\000\000\000\000\000\000\000\000])([t\200\000\000\000\000\000\004\253_r\000\000\000\000\000\000\000\000], [t\200\000\000\000\000\000\004\253_r\200\000\000\000\000\000\000\000])] } }, startTs: 440854942292115639} EstimatedCount:20837
+- BroadcastExchange HashedRelationBroadcastMode(List(input[107, string, false]),false), [plan_id=32]
+- Filter isnotnull(xxx#475)
+- Scan hive xx.xxxxxxxx [, ... 100 more fields], HiveTableRelation [xx.xxx, org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe, Data Cols: [..., Partition Cols: [#520, #521, #522, #523], Pruned Partitions: [(, , , )]], [isnotnull(), (), (xx = xx)]

Here I got some log info maybe helpful.
The plan.stats.sizeInBytes of the LogicalPlan of the hive table is too small and the plan.stats.sizeInBytes of LogicalPlan of the tidb table is too big.
The stats of the LogicalPlans of the two seems reversed.

Spark and TiSpark version info
Spark 3.2.3
TiSpark 3.1.2(with a profile of spark-3.2)
Additional context