/spark-to-clickhouse-sink

Primary LanguageJavaApache License 2.0Apache-2.0

spark-to-clickhouse-sink

A thick-write-only-client for writing across several ClickHouse MergeTree tables located in different shards.
It is a good alternative to writing via Clickhouse Distributed Engine which has been proven to be a bad idea for several reasons.

The core functionality is the writer. It works on top of Apache Spark and takes DataFrame as an input.

Streaming and Exactly-Once Samantic

The writer can also write data w/o duplicates from repeatable source. For example it can be very useful in achieving EOS semantic when Kafka-Clickhouse sink is needed. It is a good alternative to ClickHouse Kafka Engine.

To make it work the writer needs the following:

Here is a pseudo-code how it could be used to consume data from Kafka and insert into ClickHouse written in Spark Structured Streaming (ver. 2.4+)

val streamDF = spark.readStream()
                	  .format("kafka")
                    .option(<kafka brokers and topics>)
                    .load();

val writer = new SparkToClickHouseWriter(<my_conf>)

streamDF.forEachBatch(df -> writer.write(df))

Refs

There is a talk about problems with ClickHouse Distributed and Kafka engines and reasons which forced us to implement this util library.