scylladb/kafka-connect-scylladb

how to write multiple topics to single scylla table

dhgokul opened this issue · 2 comments

At present we have topics - topic1, topic2 and topic3. Using separate sink connector (separate server)for ach topic ,
instead of writing each topic message to different scylla table, looking to write in common-single table for all topics.

At present used confluent regex property to achieve above . But its not efficient, as overwriting happens on regex used sink connectors.

Is there any method to achieve in efficient.

Could you be more specific on what the problem is? Are you observing a poor performance of RegexRouter ("regex property")? Or is another part of the system slow (the connector itself)? It seems (after looking at RegexRouter source code, doing some micro-benchmarks) that this transform should not add a major amount of overhead.

Using sink connector we trying to write multiple topics from redpanda to a single Scylla table.
In our case did testing of 100 Ml test using 5 topics, used 5 sink connectors, the topics includes: topic1, sub1-topic1, sub2-topic1,sub3-topic1,sub4-topic1.
In redpanda we have 3 cluster. Each topic has partition 10. Tried with and without replica in redpanda

**Connector [JSON] Config:**

bootstrap.servers=redpanda_cluster_1_ip:9092,redpanda_cluster_2_ip:9092,redpanda_cluster_3_ip:9092
key.converter=org.apache.kafka.connect.storage.StringConverter
value.converter=org.apache.kafka.connect.json.JsonConverter
offset.storage.file.filename=/tmp/connect.offsets
plugin.path=target/components/packages/

**Sink Connector-1 Config [Json]:**
name=scylladb-sink-connector
connector.class =io.connect.scylladb.ScyllaDbSinkConnector
tasks.max= 56
topics =topic1
scylladb.contact.points=scylla_cluster_1_ip,scylla_cluster_2_ip,scylla_cluster_3_ip
scylladb.port=9042
scylladb.keyspace=streamprocess
key.converter=org.apache.kafka.connect.storage.StringConverter
value.converter=org.apache.kafka.connect.json.JsonConverter
key.converter.schemas.enable=true
value.converter.schemas.enable=true
transforms=createKey
transforms.createKey.fields=id
transforms.createKey.type=org.apache.kafka.connect.transforms.ValueToKey



**Sink Connector-2 Config [Json]:**

name=scylladb-sink-connector2
connector.class =io.connect.scylladb.ScyllaDbSinkConnector
tasks.max= 56
topics =sub1-topic1
scylladb.contact.points= scylla_cluster_1_ip,scylla_cluster_2_ip,scylla_cluster_3_ip
scylladb.consistency.level= QUORUM 
scylladb.keyspace.replication.factor=3
scylladb.port=9042
scylladb.keyspace=streamprocess
key.converter=org.apache.kafka.connect.storage.StringConverter
value.converter=org.apache.kafka.connect.json.JsonConverter
key.converter.schemas.enable=true
value.converter.schemas.enable=true
transforms.createKey.fields=id
transforms.createKey.type=org.apache.kafka.connect.transforms.ValueToKey
transforms=createKey,dropPrefix
transforms.createKey.type=org.apache.kafka.connect.transforms.ValueToKey
transforms.dropPrefix.type=org.apache.kafka.connect.transforms.RegexRouter
transforms.dropPrefix.regex=sub1-(.*)
transforms.dropPrefix.replacement=$1

using sink connector config 1 for topic1 and Sink Connector-2 Config for the rest of topics, running in 5 separate machines,
we facing 2 issues:

  1. Comparing sink connector 1 config, sink connector 2 config are slower
  2. Overwriting of messages are happening in scylla cluster, i.e once 100 milion messages dumbed in scylla, we restarted only sink connectors, but overwriting of message happening, when checked node tool tablestats instead of local write count 10million it showing 100+ million counts.

Are there any changes needed in the config?