awslabs/amazon-kinesis-client

MultiLangDaemon Throws NullPointerException Going From One Shard To Two When Multiple Daemons Are Running

eyesoftime opened this issue · 14 comments

If there's a stream with one shard that is split into two shards and two daemons are run starting from the trim horizon of the shards, the daemon that does processing of the parent shard dies with NullPointerException when it reaches the end of the shard. The second daemon will take over the processing of child shards but one of the daemons has exited. Happens with 1.2.0 as well as 1.4.0.

Jun 29, 2015 11:41:47 AM com.amazonaws.services.kinesis.multilang.MultiLangRecordProcessor stopProcessing
SEVERE: Encountered an error while trying to shutdown child process
java.lang.NullPointerException
        at com.amazonaws.services.kinesis.multilang.MultiLangRecordProcessor.shutdown(MultiLangRecordProcessor.java:154)
        at com.amazonaws.services.kinesis.clientlibrary.lib.worker.V1ToV2RecordProcessorAdapter.shutdown(V1ToV2RecordProcessorAdapter.java:48)
        at com.amazonaws.services.kinesis.clientlibrary.lib.worker.ShutdownTask.call(ShutdownTask.java:94)
        at com.amazonaws.services.kinesis.clientlibrary.lib.worker.MetricsCollectingTaskDecorator.call(MetricsCollectingTaskDecorator.java:48)
        at com.amazonaws.services.kinesis.clientlibrary.lib.worker.MetricsCollectingTaskDecorator.call(MetricsCollectingTaskDecorator.java:23)
        at java.util.concurrent.FutureTask.run(FutureTask.java:262)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:745)

Jun 29, 2015 11:41:47 AM com.amazonaws.services.kinesis.multilang.MultiLangRecordProcessor stopProcessing
SEVERE: Encountered error while trying to shutdown
java.lang.NullPointerException
        at com.amazonaws.services.kinesis.multilang.MessageWriter.close(MessageWriter.java:163)
        at com.amazonaws.services.kinesis.multilang.MultiLangRecordProcessor.childProcessShutdownSequence(MultiLangRecordProcessor.java:186)
        at com.amazonaws.services.kinesis.multilang.MultiLangRecordProcessor.stopProcessing(MultiLangRecordProcessor.java:249)
        at com.amazonaws.services.kinesis.multilang.MultiLangRecordProcessor.shutdown(MultiLangRecordProcessor.java:164)
        at com.amazonaws.services.kinesis.clientlibrary.lib.worker.V1ToV2RecordProcessorAdapter.shutdown(V1ToV2RecordProcessorAdapter.java:48)
        at com.amazonaws.services.kinesis.clientlibrary.lib.worker.ShutdownTask.call(ShutdownTask.java:94)
        at com.amazonaws.services.kinesis.clientlibrary.lib.worker.MetricsCollectingTaskDecorator.call(MetricsCollectingTaskDecorator.java:48)
        at com.amazonaws.services.kinesis.clientlibrary.lib.worker.MetricsCollectingTaskDecorator.call(MetricsCollectingTaskDecorator.java:23)
        at java.util.concurrent.FutureTask.run(FutureTask.java:262)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:745)

Thanks for reporting. I'll try to reproduce this.

Hi eyesoftime,

I'm not able to reproduce this. Can you provide the steps which you took to produce the problem?

What language are you using to process the records? Are you using one of the official multilang KCLs?

Thanks

I was loading records into the stream, each one about 130KB of arbitrary data with indexes applied for tracking. Initially I started out with one shard. After about 2500 records I split the shard into two, after another 2500 records I merged the new shards. So, there's 4 shard altogether. There were no consumers running at the time (but that doesn't really change the outcome).

So then I started one daemon with Python sample application (with additional logging added, again for tracking). While it was consuming records from the first shard, I started the second daemon which then was idling as the first shard hadn't been consumed yet and second and third are the child shards of the first. Now when the consumer of the first shard reached the end, the first daemon died with the NPE. It happened repeatedly, while running daemons on the same EC2 instance, or parallel to one in my local machine. The same thing happened when the test was done 2-1-2 with shardes, ie merging and splitting. In that case it also died when going from one shard to two.

Hope it helps you.

Thanks for the information. I did not do the shard merge in my own test, so that might be the problem. I will do so to see if that reproduces the problem.

I have reproduced the problem. The problem isn't with MultiLangRecordProcessor per se, but rather with the Worker implementation.

A Worker will sometimes call shutdown on an IRecordProcessor even if initialize has not been called on the same instance. Since MultiLangRecordProcessor uses its initialize method to construct certain fields, and its shutdown method assumes that those fields have been initialized, an NPE occurs.

Once again thank you for reporting the problem. It will be fixed in a future release.

@kevincdeng

Using the python wrapper for this package, and I seem to be running into the same issue. Is there a workaround you can recommend to guarantee that initialize always gets called?

If your code doesn't need the shard id, you might be able to place it in the constructor of the class instead. Do you absolutely need initialize to be called? If it's just to ensure proper functioning of the shutdown method, adding a flag to check whether initialization has happened might be sufficient.

Was this fixed in a recent version?

This remains unfixed. The MultiLangRecordProcessor has not been changed since Oct 2014 https://github.com/awslabs/amazon-kinesis-client/blob/73ac2c0e25a25776cbc88f2c685223fb049e6757/src/main/java/com/amazonaws/services/kinesis/multilang/MultiLangRecordProcessor.java

I was able to reproduce this issue on 1.6.1 (the current latest version)

@kevincdeng ETA here?

@kevincdeng @findchris FWIW I've found success by specifying the failoverTimeMillis property in the .properties file to a high number (e.g. 100s)

Kahn commented

This apparently shipped in https://github.com/awslabs/amazon-kinesis-client#release-162-march-23-2016 @manango can you close or merge this PR please? Its confusing to leave open.

The issue has been resolved in 1.6.2 release. Closing the issue.

I am facing similar issue in https://github.com/awslabs/amazon-kinesis-client-net. Can someone please help?