swoop-inc/spark-alchemy

Outdated documentation

Closed this issue ยท 5 comments

Hi,

I'm just looking for some clarification regarding Postgres interoperability:

Native HyperLogLog functions that offer reaggregatable fast approximate distinct counting capabilities far beyond those in OSS Spark with interoperability to Postgres and even JavaScript.
https://github.com/swoop-inc/spark-alchemy#for-spark-users

They enable interoperability at the HLL sketch level with other data processing systems. We use an open-source HLL library with an independent storage specification and built-in support for Postgres-compatible databases and even JavaScript. This allows Spark to serve as a universal data (pre-)processing platform for systems that require fast query turnaround times, e.g., portals & dashboards.
https://github.com/swoop-inc/spark-alchemy/wiki/Spark-HyperLogLog-Functions#spark-alchemys-approach-to-approximate-counting

It doesn't look like you are still using aggregateknowledge/java-hll, but rather migrated to com.clearspring.analytics:stream (Spark's dependency). As far as I can see this one isn't using the same storage spec and isn't compatible for that reason. Right?

I quite liked the idea of quickly reaggregating data in a more performant data store.
Sounds like you've moved onto a different approach?

Cheers,
Moritz

Hi!

I'm currently exploring different HLL implementations and am also interested in this. I found the same as OP, and from scanning the code for the addthis/clearspring lib it does not look like they are using the same serialization implementation as the java-hll storage spec.

I'm slightly confused as I found this repo via a post on databricks which was published pretty recently. Is this repo out of sync with the implementation referenced in that post, or are we missing something?

Thanks,
Alex

Hi, guys.

First of all, thanks for open sourcing this work. The lib works like a charm.

However, on the interoperability with Postgres, I've been trying to write Spark DataFrames containing spark-alchemy's HLL sketches to postgresql-hll-equiped Postgres database without success. It's not very clear how the implementations on these two projects are interoperable.

By "without success" I mean that I'm not able to operate with basic functionality on the Postgres side, such as applying hll_cardinality() to the HLL column coming from Spark. More specifically, trying that results in this schema version error.

Any hint would be greatly appreciated.

We are in the process of updating the implementation to use two HLL libraries--a PG-compatible one that is HLL only and an HLL++ one that is not PG compatible--and a way to (a) specify which one is used + a way to convert sketches from the non-PG-compatible one to the PG-compatible one. This way library users will be able to get the best possible precision when not requiring PG compatibility without losing the option of PG compatibility.

The docs in the current version of the code are, unfortunately, misleading as the only HLL library hooked now is the more precise but non PG-compatible one: we were hoping to sync the changes sooner but were delayed.

cozos commented

Hi,

This project looks amazing and thanks for open-sourcing it.

Is there any update on Postgres-HLL compatibility? It seems like something is already in progress, but I'd be willing to help out if required (really desperate for this feature :P)

pidge commented

We've just released a version (0.5.5) with support for the Aggregate Knowledge implementation, which enables Postgres-HLL compatibility.

See the HLL wiki page for usage info. In particular, you'll likely want to switch the library's default implementation from StreamLib to Aggregate Knowledge:

spark.conf.set("com.swoop.alchemy.hll.implementation", "AGGREGATE_KNOWLEDGE")

There's also a new hll_convert function you can use to convert existing StreamLib sketches to the Aggregate Knowledge representation, but note that you SHOULD NOT add new values to the sketches post-conversion as it will result in double-counting. (Stick to re-aggregation and cardinality estimation.)

Happy counting!