Add CDFQuantile type
Closed this issue · 0 comments
osopardo1 commented
Right now, we split the Transformations
(and Transformers
) into:
LinearTransformation
HashTransformation
StringHistogramTransformation
NullToZeroTransformation
IdentityToZeroTransformation
We wanted to implement a QuantileTransformation
(see closed issue #338), which will make the indexing more flexible by calling an streaming algorithm to update and provide the Rank
of a specific point while writing new data. But, while trying to implement it, we notice few things:
- The
HistogramTransformation
was mapping the elements like they wereQuantiles
. - For computing the
histogram
, we required an external method to be called before indexing:
import io.qbeast.spark.utils.QbeastUtils
val brandStats = QbeastUtils.computeHistogramForColumn(df, "brand", 50)
val statsStr = s"""{"brand_histogram":$brandStats}"""
(df
.write
.mode("overwrite")
.format("qbeast")
.option("columnsToIndex", "brand:histogram")
.option("columnStats", statsStr)
.save(targetPath))
- For computing the
quantiles
in PR #413 , we were also implementing the same methodology. - We require a major abstraction for both Histogram and Quantiles, and other algorithms related to a CDF or Cumulative Distribution Function.
This issue is to reorganize the Transformers and Transformations to have the following nomenclatures:
CDFQuantilesTransformation
CDF<implementation>Transformation
In which we only would have implementation for QuantilesTransformation
in both String and Numeric cases, which different initialization of the bins for each case.
With an API such as:
df.write.format("qbeast").option("columnsToIndex", "id:quantiles").save(..)