swoop-inc/spark-alchemy

hll_cardinality always return 1 when running with pyspark

Closed this issue · 2 comments

When I am running the hll_cardinality function with pyspark, this library always return 1.

Code example:

def test_process():
    spark = SparkSession.builder.master('local[1]') \
        .config('spark.jars.packages',
                'com.swoop:spark-alchemy_2.12:1.1.0,') \
        .getOrCreate()

    sc = SparkContext.getOrCreate()
    sc._jvm.com.swoop.alchemy.spark.expressions.hll.HLLFunctionRegistration.registerFunctions(spark._jsparkSession)

    df = spark.range(5).toDF("id").select(expr("hll_cardinality(hll_init(id)) as cd"))
    df.show(truncate=False)

will display:

+---+
|cd |
+---+
|1  |
|1  |
|1  |
|1  |
|1  |
+---+

Env:

spark: 3.1.2
pyspark: 3.1.2
scala: 2.12.10
pidge commented

Hi @dwyzlic-ias, thanks for the clear example. I think it's behaving as expected here, you're basically doing:

+---------------------------+
|cd                         |
+---------------------------+
|hll_cardinality(hll_init(0)|
|hll_cardinality(hll_init(1)|
|hll_cardinality(hll_init(2)|
|hll_cardinality(hll_init(3)|
|hll_cardinality(hll_init(4)|
+---------------------------+

You need to throw an hll_merge in there as an aggregation to combine the sketches.

pidge commented

Also in this case you can just use hll_cardinality(hll_init_agg(id)) to combine the hll_init & hll_merge operations.