awslabs/deequ

Performance impact when trying to generate profiling report for more than 200 columns

eapframework opened this issue · 2 comments

Encountering performance issues when generating a profiling report for more than 200 columns across 5 million records. I am applying almost all the metrics to generate profiling report. Applied metrics such as datatype, entropy, minimum, maximum, sum, standard deviation, mean, maxlength, minlength, histogram, completeness, distinctness, uniquevalueratio, uniqueness, countdistinct, and correlation. I am trying to generate report similar to ydata-profiling(https://github.com/ydataai/ydata-profiling)

The job has been running for over 3 hours despite attempts to optimize Spark configuration. When checking the logs each metrics is calculated sequentially. Sequential computation of each metric is causing the prolonged runtime. Is it possible to parallelize this operation for improved efficiency?

Thanks for the feedback @eapframework
We will investigate this issue and get back to you with an update.

Hi @rdsharma26, I was doing more testing. By analyzing the spark execution tasks, I believe the performance issue is because for metrics such as CountDistinct, Histogram, each metrics calculation is done on each column in sequential manner. So more columns in dataframe is causing the job to run for more time. Parallelizing these calculations would enhance efficiency.