awslabs/deequ

Incremental profiling to be merged with older result

nihal-laliwala-a opened this issue · 0 comments

Spark: 3.3

My Scenario:

I am running column profile runner on full data consisting of 10 million records,

ColumnProfilerRunner()
          .onData(dataFrame)
          .withLowCardinalityHistogramThreshold(200)
          .run()

and storing output to some location.

But my data is incremental, on daily basis i am getting 100K records.

I don't want to recalculate data profiling for full data again, because its already done.

I want to run data profiling of newly arrived 100K records, whatever output get generated, i want to merge it with my full data profiling output which is already store somewhere.

dataprofiling consists few matrices like avg, count and approx count distinct, which you can't overwrite. it needs to get calculate accordingly.

Is this supported in Deequ? and what is the approach for this?