tispark 3.x version read speed is slower than tispark 2.5

Question

tispark 3.x version read speed is slower than tispark 2.5

wfxxh opened this issue 2 years ago · 2 comments

tidb version is 5.4.2
spark version 3.0 to 3.2
tispark version 3.0 to 3.1
I used tispark2.5 before, but when i upgrade tispark version to 3.x ,I find it is slow when read tikv than tispark 2.5 version。

table info :

spark conf ：

tispark 2.5:

tispark 3.x:

Answer 1 · 2022-10-31T08:56:11.000Z

I have found the reasion.It is because the v3.x version nowhere to call the 'StatisticsManager.loadStatisticsInfo' method,so the 'statisticsMap' in StatisticsManager is not filled, it cause inside the method 'TiStrategy.filterToDAGRequest' val tblStatistics: TableStatistics = StatisticsManager.getTableStatistics(source.table.getId) get null，so 'TiKVScanAnalyzer.buildIndexScan' can not return correct value

CREATE TABLE `perio_art_project` (
  `record_id` int(11) DEFAULT NULL,
  `article_id` varchar(255) DEFAULT NULL,
  `project_seq` int(11) DEFAULT NULL,
  `project_id` varchar(255) DEFAULT NULL,
  `project_name` longtext DEFAULT NULL,
  `batch_id` int(11) DEFAULT NULL,
  `primary_partition` int(4) GENERATED ALWAYS AS ((crc32(`article_id`)) % 9999) STORED NOT NULL,
  `last_modify_time` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
  `spark_update_time` datetime DEFAULT NULL,
  KEY `article_id` (`article_id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_bin;

sparkSession
  .sql(
    """select * from tidb_catalog.qk_chi.perio_art_project
      |""".stripMargin)
  .groupBy("project_name")
  .count()
  .explain()

== 2.5 Physical Plan ==
AdaptiveSparkPlan isFinalPlan=false
+- HashAggregate(keys=[project_name#7], functions=[count(1)])
   +- Exchange hashpartitioning(project_name#7, 8192), true, [id=#14]
      +- HashAggregate(keys=[project_name#7], functions=[partial_count(1)])
         +- TiKV CoprocessorRDD{[table: perio_art_project] TableScan, Columns: project_name@VARCHAR(4294967295), KeyRange: [([t\200\000\000\000\000\000\003\017_r\000\000\000\000\000\000\000\000], [t\200\000\000\000\000\000\003\017_s\000\000\000\000\000\000\000\000])], startTs: 437048830927044635} EstimatedCount:18072492

== 3.x Physical Plan ==
AdaptiveSparkPlan isFinalPlan=false
+- HashAggregate(keys=[project_name#7], functions=[specialsum(count(1)#34L, LongType, 0)])
   +- Exchange hashpartitioning(project_name#7, 8192), true, [id=#13]
      +- HashAggregate(keys=[project_name#7], functions=[partial_specialsum(count(1)#34L, LongType, 0)])
         +- TiSpark RegionTaskExec{downgradeThreshold=1000000000,downgradeFilter=[]
            +- TiKV FetchHandleRDD{[table: perio_art_project] IndexLookUp, Columns: project_name@VARCHAR(4294967295): { {IndexRangeScan(Index:article_id(article_id)): { RangeFilter: [], Range: [([t\200\000\000\000\000\000\003\017_i\200\000\000\000\000\000\000\001\000], [t\200\000\000\000\000\000\003\017_i\200\000\000\000\000\000\000\001\372])] }}; {TableRowIDScan, Aggregates: Count(1), First(project_name@VARCHAR(4294967295)), Group By: [project_name@VARCHAR(4294967295) ASC]} }, startTs: 437048801724203020}

User can set spark.tispark.plan.allow_index_read=false to avoid

Answer 2 · 2022-11-03T02:41:02.000Z

So sounds like it's a bug?