awslabs/deequ

[FEATURE] Improve performance of KLLSketch and DataType Analyzer

zeotuan opened this issue · 0 comments

Is your feature request related to a problem? Please describe.
Currently, KLLSketch and DataType analyzer is implemented use the UserDefinedAggregateFunction

private[sql] class StatefulDataType extends UserDefinedAggregateFunction {

which is considered deprecated and should be replaced with Aggregator which offer much greater performance which was outlined here apache/spark#25024 (comment)

Describe the solution you'd like
Reimplement StatefulDataType and StatefulKLLSketch using Aggregator

I am happy to help with this implementation.