delta-io/connectors

Performance Improvement: cache per-partition results during partition pruning

scottsand-db opened this issue · 5 comments

We perform partition pruning inside of FilteredDeltaScanImpl by

  • having some input query / expression expr
  • creating a PartitionRowRecord for a given AddFile partitionRowRecord
  • evaluating partitionRowRecord against expr

The result of this evaluation is a function of the input expr(fixed), the partitionSchema (fixed), and the AddFile partition values (variable per AddFile). We perform this evaluation here.

Well, many AddFiles will have the same partition values. Thus, we can easily cache and save the evaluation result per unique partition values, and save on some computation.

@scottsand-db I would like to contribute. can you assign this issue to me?

@sonhmai - great to hear! I will assign it to you. Cheers.

@scottsand-db
I assumed that the number of unique combinations of partition values can be contained in memory.

Hence, I'm thinking about just adding a mutable.Map[Map[String, String], Boolean] in FilteredDeltaScanImpl to cache the unique result of unique partitionValues combination (the immutable map from addFile.partitionValues. What do you think?

Hi @sonhmai - that SGTM. @tdas - I don't expect any memory issues here, unless the user has > hundreds of thousands of partitions.

@scottsand-db should this issue be closed as the MR was merged?