Invalid negative sequence number used
RagingPuppies opened this issue · 0 comments
Subject of the issue
On high load brooklin clusters once enabled flushless i get tons of these errors,
producers get killed but there are some partitions that keeps growing lag until i restart the service
Your environment
- Ubuntu 18.04
- Brooklin version 1.1.0
- Java version 1.8.0_152
- Kafka version 2.5.2
- ZooKeeper version 3.4
Steps to reproduce
my default producer settings:
default['brooklin_v2']['default_provider_properties']['buffer.memory'] = '61600000'
default['brooklin_v2']['default_provider_properties']['batch.size'] = '80000'
default['brooklin_v2']['default_provider_properties']['linger.ms'] = '15000'
default['brooklin_v2']['default_provider_properties']['request.timeout.ms'] = '300000'
default['brooklin_v2']['default_provider_properties']['compression.type'] = 'gzip'
default['brooklin_v2']['default_provider_properties']['producersPerTask'] = '3'
default['brooklin_v2']['default_provider_properties']['numProducersPerConnector'] = '15'
default['brooklin_v2']['default_provider_properties']['producerRateLimiter'] = '0.05'
default['brooklin_v2']['default_provider_properties']['retries'] = '100'
default['brooklin_v2']['default_provider_properties']['retry.backoff.ms'] = '300'
default['brooklin_v2']['default_provider_properties']['enable.idempotence'] = 'true'
default['brooklin_v2']['default_provider_properties']['acks'] = 'all'
my default consumer:
default['brooklin_v2']['default_connector_properties']['consumer.partition.assignment.strategy'] = 'org.apache.kafka.clients.consumer.RoundRobinAssignor'
default['brooklin_v2']['default_connector_properties']['factoryClassName'] = 'com.linkedin.datastream.connectors.kafka.mirrormaker.KafkaMirrorMakerConnectorFactory'
default['brooklin_v2']['default_connector_properties']['assignmentStrategyFactory'] = 'com.linkedin.datastream.server.assignment.BroadcastStrategyFactory'
default['brooklin_v2']['default_connector_properties']['consumer.receive.buffer.bytes'] = '4096000'
default['brooklin_v2']['default_connector_properties']['consumer.session.timeout.ms'] = '120000'
default['brooklin_v2']['default_connector_properties']['consumer.heartbeat.interval.ms'] = '9000'
default['brooklin_v2']['default_connector_properties']['consumer.request.timeout.ms'] = '60000'
default['brooklin_v2']['default_connector_properties']['consumer.auto.offset.reset'] = 'latest'
default['brooklin_v2']['default_connector_properties']['commitIntervalMs'] = '30000'
default['brooklin_v2']['default_connector_properties']['consumer.fetch.min.bytes'] = '500000'
default['brooklin_v2']['default_connector_properties']['consumer.fetch.max.bytes'] = '4000000'
default['brooklin_v2']['default_connector_properties']['consumer.fetch.max.wait.ms'] = '250'
default['brooklin_v2']['default_connector_properties']['pausePartitionOnError'] = 'true'
default['brooklin_v2']['default_connector_properties']['pauseErrorPartitionDurationMs'] = '180000'
other then that i've added "isFlushlessModeEnabled": "true" to my connectors settings
my brooklin cluster is processing around 600 MBps with 3 datastreams 450 consumers and around 1000 producers.
i have 3 more clusters that are suffering from the same issues which started about 2 weeks after the implemntation.
Expected behaviour
producer with OUT_OF_ORDER_SEQUENCE_NUMBER should retry and die at some point
new producer should raise and get new PID, this is overall happening but it seems that 1~few partitions doesn't heal
there are no errors realted to this partitions but it just seem that brooklin "forgets" about them
in this screentshot, you can see most of the partitions are getting consumed after the issue eccours but there is one that only after service restart was back to consumption, from brooklin side i don't even see that the lag metric shows that there is lag which is strange
Actual behaviour
all partitions should be healed after producers spawns