awslabs/amazon-kinesis-client

assertAllParentShardsAreClosed check interacts poorly with high shard count dynamodb streams

zerth opened this issue · 1 comments

zerth commented

This assertion during shard sync can prevent worker initialization for workers processing dynamodb streams associated with tables having large numbers of partitions:

assertAllParentShardsAreClosed(shardIdToChildShardIdsMap, shardIdToShardMap);

Example error message:

Sep 13, 2017 4:40:31 PM com.amazonaws.services.kinesis.clientlibrary.lib.worker.ShardSyncTask call
SEVERE: Caught exception while sync'ing Kinesis shards and leases
com.amazonaws.services.kinesis.clientlibrary.exceptions.internal.KinesisClientLibIOException: Parent shardId shardId-00000001505307450567-xxxxxxxx is not closed. This can happen due to a race condition between describeStream and a reshard operation.
    at com.amazonaws.services.kinesis.clientlibrary.lib.worker.ShardSyncer.assertAllParentShardsAreClosed(ShardSyncer.java:161)
    at com.amazonaws.services.kinesis.clientlibrary.lib.worker.ShardSyncer.syncShardLeases(ShardSyncer.java:117)
    at com.amazonaws.services.kinesis.clientlibrary.lib.worker.ShardSyncer.checkAndCreateLeasesForNewShards(ShardSyncer.java:88)
    at com.amazonaws.services.kinesis.clientlibrary.lib.worker.ShardSyncTask.call(ShardSyncTask.java:68)
    at com.amazonaws.services.kinesis.clientlibrary.lib.worker.MetricsCollectingTaskDecorator.call(MetricsCollectingTaskDecorator.java:49)
    at com.amazonaws.services.kinesis.clientlibrary.lib.worker.Worker.initialize(Worker.java:427)
    at com.amazonaws.services.kinesis.clientlibrary.lib.worker.Worker.run(Worker.java:356)
    at com.amazonaws.services.kinesis.multilang.MultiLangDaemon.call(MultiLangDaemon.java:111)
    at com.amazonaws.services.kinesis.multilang.MultiLangDaemon.call(MultiLangDaemon.java:58)
    at java.util.concurrent.FutureTask.run(FutureTask.java:262)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    at java.lang.Thread.run(Thread.java:745)

This is with KCL 1.7.5 and dynamodb-streams-kinesis-adapter 1.2.1.

I believe this is caused by pagination of the dynamodb stream description for very large tables taking more than several seconds, by which time a new child shard is likely to have been created (referencing a still-open parent which was seen earlier in the paginated response and thus triggering this assertion).

This commit attempts to fix this against 1.7.5: AdRoll@09dcc99

The attempted fix adds a configuration parameter controlling whether the assertion is made, and also prevents the creation of new leases for such children with still-open parents during shard sync.

zerth commented

PR was merged.