Qbeast-io/qbeast-spark

Add CDFQuantile type

Closed this issue · 0 comments

Right now, we split the Transformations (and Transformers) into:

  • LinearTransformation
  • HashTransformation
  • StringHistogramTransformation
  • NullToZeroTransformation
  • IdentityToZeroTransformation

We wanted to implement a QuantileTransformation (see closed issue #338), which will make the indexing more flexible by calling an streaming algorithm to update and provide the Rank of a specific point while writing new data. But, while trying to implement it, we notice few things:

  1. The HistogramTransformation was mapping the elements like they were Quantiles.
  2. For computing the histogram, we required an external method to be called before indexing:
import io.qbeast.spark.utils.QbeastUtils

val brandStats = QbeastUtils.computeHistogramForColumn(df, "brand", 50)
val statsStr = s"""{"brand_histogram":$brandStats}"""

(df
  .write
  .mode("overwrite")
  .format("qbeast")
  .option("columnsToIndex", "brand:histogram")
  .option("columnStats", statsStr)
  .save(targetPath))
  1. For computing the quantiles in PR #413 , we were also implementing the same methodology.
  2. We require a major abstraction for both Histogram and Quantiles, and other algorithms related to a CDF or Cumulative Distribution Function.

This issue is to reorganize the Transformers and Transformations to have the following nomenclatures:

  • CDFQuantilesTransformation
  • CDF<implementation>Transformation

In which we only would have implementation for QuantilesTransformation in both String and Numeric cases, which different initialization of the bins for each case.

With an API such as:

df.write.format("qbeast").option("columnsToIndex", "id:quantiles").save(..)