Performance Improvement: cache per-partition results during partition pruning
scottsand-db opened this issue · 5 comments
We perform partition pruning inside of FilteredDeltaScanImpl by
- having some input query / expression
expr
- creating a PartitionRowRecord for a given AddFile
partitionRowRecord
- evaluating
partitionRowRecord
againstexpr
The result of this evaluation is a function of the input expr
(fixed), the partitionSchema
(fixed), and the AddFile partition values (variable per AddFile). We perform this evaluation here.
Well, many AddFiles will have the same partition values. Thus, we can easily cache and save the evaluation result per unique partition values, and save on some computation.
@scottsand-db I would like to contribute. can you assign this issue to me?
@sonhmai - great to hear! I will assign it to you. Cheers.
@scottsand-db
I assumed that the number of unique combinations of partition values can be contained in memory.
Hence, I'm thinking about just adding a mutable.Map[Map[String, String], Boolean]
in FilteredDeltaScanImpl to cache the unique result
of unique partitionValues combination (the immutable map from addFile.partitionValues
. What do you think?
@scottsand-db should this issue be closed as the MR was merged?