linkedin/brooklin

Invalid negative sequence number used

RagingPuppies opened this issue · 0 comments

Subject of the issue

On high load brooklin clusters once enabled flushless i get tons of these errors,
producers get killed but there are some partitions that keeps growing lag until i restart the service

Your environment

  • Ubuntu 18.04
  • Brooklin version 1.1.0
  • Java version 1.8.0_152
  • Kafka version 2.5.2
  • ZooKeeper version 3.4

Steps to reproduce

my default producer settings:

default['brooklin_v2']['default_provider_properties']['buffer.memory'] = '61600000'
default['brooklin_v2']['default_provider_properties']['batch.size'] = '80000'
default['brooklin_v2']['default_provider_properties']['linger.ms'] = '15000'
default['brooklin_v2']['default_provider_properties']['request.timeout.ms'] = '300000'
default['brooklin_v2']['default_provider_properties']['compression.type'] = 'gzip'
default['brooklin_v2']['default_provider_properties']['producersPerTask'] = '3'
default['brooklin_v2']['default_provider_properties']['numProducersPerConnector'] = '15'
default['brooklin_v2']['default_provider_properties']['producerRateLimiter'] = '0.05'
default['brooklin_v2']['default_provider_properties']['retries'] = '100'
default['brooklin_v2']['default_provider_properties']['retry.backoff.ms'] = '300'
default['brooklin_v2']['default_provider_properties']['enable.idempotence'] = 'true'
default['brooklin_v2']['default_provider_properties']['acks'] = 'all'

my default consumer:

default['brooklin_v2']['default_connector_properties']['consumer.partition.assignment.strategy'] = 'org.apache.kafka.clients.consumer.RoundRobinAssignor'
default['brooklin_v2']['default_connector_properties']['factoryClassName'] = 'com.linkedin.datastream.connectors.kafka.mirrormaker.KafkaMirrorMakerConnectorFactory'
default['brooklin_v2']['default_connector_properties']['assignmentStrategyFactory'] = 'com.linkedin.datastream.server.assignment.BroadcastStrategyFactory'
default['brooklin_v2']['default_connector_properties']['consumer.receive.buffer.bytes'] = '4096000'
default['brooklin_v2']['default_connector_properties']['consumer.session.timeout.ms'] = '120000'
default['brooklin_v2']['default_connector_properties']['consumer.heartbeat.interval.ms'] = '9000'
default['brooklin_v2']['default_connector_properties']['consumer.request.timeout.ms'] = '60000'
default['brooklin_v2']['default_connector_properties']['consumer.auto.offset.reset'] = 'latest'
default['brooklin_v2']['default_connector_properties']['commitIntervalMs'] = '30000'
default['brooklin_v2']['default_connector_properties']['consumer.fetch.min.bytes'] = '500000'
default['brooklin_v2']['default_connector_properties']['consumer.fetch.max.bytes'] = '4000000'
default['brooklin_v2']['default_connector_properties']['consumer.fetch.max.wait.ms'] = '250'
default['brooklin_v2']['default_connector_properties']['pausePartitionOnError'] = 'true'
default['brooklin_v2']['default_connector_properties']['pauseErrorPartitionDurationMs'] = '180000'

other then that i've added "isFlushlessModeEnabled": "true" to my connectors settings
my brooklin cluster is processing around 600 MBps with 3 datastreams 450 consumers and around 1000 producers.
i have 3 more clusters that are suffering from the same issues which started about 2 weeks after the implemntation.

Expected behaviour

producer with OUT_OF_ORDER_SEQUENCE_NUMBER should retry and die at some point
new producer should raise and get new PID, this is overall happening but it seems that 1~few partitions doesn't heal
there are no errors realted to this partitions but it just seem that brooklin "forgets" about them
in this screentshot, you can see most of the partitions are getting consumed after the issue eccours but there is one that only after service restart was back to consumption, from brooklin side i don't even see that the lag metric shows that there is lag which is strange
Screen Shot 2022-03-11 at 18 37 59

Actual behaviour

all partitions should be healed after producers spawns