ClickHouse/clickhouse-kafka-connect

Losing event in around million events

abhishekgahlot2 opened this issue ยท 20 comments

Hi

we ran a script to do sanity check. though we see some events being missing from json array which led to believe that something in processing records seems to be the problem. Not sure what could be but its very random.

For eg sent 2-3 million events in couple of days but lost around 10-15 events from json payload itself.

Ex: Sending [1,2,3,4,5] in one payload out of random we tend to lose events which are at the end of array for eg 5 here.

Running on msk aws

tasks.max=2
topics.regex=[0-9a-f]{8}{12}
ssl=false
hostname=
database=default
password=
port=80
value.converter.schemas.enable=false
value.converter=org.apache.kafka.connect.json.JsonConverter
exactlyOnce=true
username=clickhouse
schemas.enable=false

Any ideas?

In connector logs

We see missing count for eg we sent 29 in json array but logs mention

[Worker-04c10fb697e20c0fc] [2024-05-03 12:22:46,399] INFO [prod-clickhouse-connector-main|task-0] Trying to insert [9] records  (QueryId: [ac4516f7-772f-4e98-9343-3dfd52d0876c]) (com.clickhouse.kafka.connect.sink.db.ClickHouseWriter:161)

I believe this could be happening due to mismatched minoffset and maxoffset from replicated connect state table and zookeeper lag because the last few event are missed out of couple of million events. I am still not sure but find this the probable reason. Looking forward to your ideas :)

By last or first i mean in payload either we lose [1,2,3,4,5] 1 or 5 from this array.

#380

@Paultagoras

I also had rare gaps in data after errors related to offset commit in kafka that occurred when running backups on the kafka cluster, which caused the consumer to start reading messages from the old offset again. Since the size of received messages is not deterministic, this led to the case mentioned in #291 (comment). I have to use a custom build with a fix for more than half a year.

I have run the script and processing again with just one node to confirm.

For now i have removed the extra node from load balancer and trying to figure out if this could be a problem with replicated table and offset now i am pushing to one node so zookeeper lag can be eliminated.

@ne1r0n i also believe the offset seems to be causing to remove few events.

Fascinating - so to summarize (please correct me if I'm wrong):

  • Payload sent with some data (e.g. min-offset: 1 and max-offset: 10)
  • A hiccup happens in kafka (as they sometimes do)
  • Due to replicating the connect-state across nodes, the min-offset and max-offset may not be 100% accurate (from either a prior or new call) so chopping happens, dropping some records

Is this happening in a self-hosted instance, or a cloud instance?

Yes this is correct. Its been 2 days and more than 3 million records. I haven't any seen any missing records yet because i removed 2 nodes from 3 node cluster so data is going on only 1 node now. We are processing millions of records every few hours.

Will update if i see problem happening on even 1 node too. Trying to figure out a solution where reading from connect state table is consistent.

Yes the infra is AWS Managed Kafka, Clickhouse Kafka Connector, Clickhouse 3 node cluster behind AWS load balancer with Keeper running for coordination.

It looks like it happened with 1 node too it took millions of event but it did happen after 3.5 millions event.

So either could be combination of keeper lag and the one @ne1r0n mentioned.

@ne1r0n did it happen to you without any broker restart or while kafka is running. I am looking to apply some solution because we are running on production so missing events is a big deal.

Do you happen to have the config for the self-hosted setup, the connector, and some sample data? I'd be curious to know if I can a) replicate this locally and b) replicate this in the cloud.

Sure its pretty standard but here are the files

python script to generate sample data
https://gist.github.com/abhishekgahlot2/c43d0d4509f6f4fbb536ae910d9b6f9a

sample data json
https://gist.github.com/abhishekgahlot2/01555c6586510bbca90acf9538624487

docker-compose for clickhouse
https://gist.github.com/abhishekgahlot2/568adcafbb04be4ef95425213990a757

config.xml
https://gist.github.com/abhishekgahlot2/879d6d40028b38487018f0af7e48cd57

please let me know if you need anything else.

it would be wise to log records from ProxySinkTask.java file to eliminate other potential problems. i will do that

I believe this new code that is merged

        long actualMinOffset = rangeContainer.getMinOffset();
        long actualMaxOffset = rangeContainer.getMaxOffset();

        // SAME State [0,10] Actual [0,10]
        if (maxOffset == actualMaxOffset && minOffset == actualMinOffset)
            return RangeState.SAME;
        // NEW State [0,10] Actual [11,20]
        if (actualMinOffset > maxOffset)
            return RangeState.NEW;
        // CONTAINS [0,10] Actual [1, 10]
        if (actualMaxOffset <= maxOffset && actualMinOffset >= minOffset)
            return RangeState.CONTAINS;
        // ZEROED [10, 20] Actual [0, 10]
        if (actualMinOffset < minOffset && actualMinOffset == 0)
            return RangeState.ZERO;
        // ERROR [10,20] Actual [8, X]
        if (actualMinOffset < minOffset)
            return RangeState.ERROR;
        // OVER_LAPPING
        return RangeState.OVER_LAPPING;
    }

seems to address the problem of missing events in case of overlap. I am deploying new code to see if this resolves the issue.

The latest code merged was a fix provided by @ne1r0n if it turns out to solve the issue ๐Ÿ™‚

Another reason i am expecting is having tasks.max = 2

also in KeeperStateProvider.java this could happen in future

 String key = String.format("%s-%d", topic, partition);
 String selectStr = String.format("SELECT * FROM `%s` WHERE `key`= '%s'", csc.getZkDatabase(), key);

in case of new code where virtual topic is allowed. Same keys can be stored in connect_state table with different min and max offset.

for eg i have same two table in different database the keeper state would look liike
tablename-0 | minOffset | maxOffset
I think it should be changed for multiple databases too because in virtual topic only the last part is taken for table key.

Please let me know if I am wrong here. Can the connector have race condition when using tasks.max.

Please let me know if I am wrong here. Can the connector have race condition when using tasks.max.

It shouldn't get into a race condition, because of the way the Connect framework actually handles tasks - essentially, it's 1:1 with partitions (unless a task fails, I believe).

I see thanks in that case shouldn't happen. I have now deployed the new main to production to see if I get overlapping issues after the new code.

I am using this PR: #389

Once I verify that its working I can add multiple nodes so that zookeeper lag can be figured out too. :)

An update, we still got the missing event issue after adding multiple nodes to load balancer. looks like zookeeper lag is still a problem.

We haven't had any case for 4-5 days with single node but as soon as we added multiple nodes. The events started to miss they are lesser in frequency after main branch is deployed but still there.

cc @Paultagoras

We haven't had any case for 4-5 days with single node but as soon as we added multiple nodes. The events started to miss they are lesser in frequency after main branch is deployed but still there.

Hmm interesting - so just to be clear, you're adding Keeper nodes or ClickHouse nodes?

And are you using something like https://clickhouse.com/docs/en/sql-reference/table-functions/cluster with your SELECT statement to make sure you're getting all the entries across replicas? (It's eventual consistency I believe, not immediate. There's also more performance details in that doc page around Distributed tables that could help)

@abhishekgahlot2, if the issue still happens, can you ping us on ClickHouse community slack so we can discuss

I'm going to close this since we haven't heard anything, but if it's continuing to happen please let us know!

Thanks @Paultagoras @mzitnik , I will let you guys know on slack. as of now I haven't had issue so far. One instance was probably due to timeout apart from that.

Thanks again in helping here :)