MultiLangDaemon Throws NullPointerException Going From One Shard To Two When Multiple Daemons Are Running

Question

MultiLangDaemon Throws NullPointerException Going From One Shard To Two When Multiple Daemons Are Running

eyesoftime opened this issue 9 years ago · 14 comments

If there's a stream with one shard that is split into two shards and two daemons are run starting from the trim horizon of the shards, the daemon that does processing of the parent shard dies with NullPointerException when it reaches the end of the shard. The second daemon will take over the processing of child shards but one of the daemons has exited. Happens with 1.2.0 as well as 1.4.0.

Jun 29, 2015 11:41:47 AM com.amazonaws.services.kinesis.multilang.MultiLangRecordProcessor stopProcessing
SEVERE: Encountered an error while trying to shutdown child process
java.lang.NullPointerException
        at com.amazonaws.services.kinesis.multilang.MultiLangRecordProcessor.shutdown(MultiLangRecordProcessor.java:154)
        at com.amazonaws.services.kinesis.clientlibrary.lib.worker.V1ToV2RecordProcessorAdapter.shutdown(V1ToV2RecordProcessorAdapter.java:48)
        at com.amazonaws.services.kinesis.clientlibrary.lib.worker.ShutdownTask.call(ShutdownTask.java:94)
        at com.amazonaws.services.kinesis.clientlibrary.lib.worker.MetricsCollectingTaskDecorator.call(MetricsCollectingTaskDecorator.java:48)
        at com.amazonaws.services.kinesis.clientlibrary.lib.worker.MetricsCollectingTaskDecorator.call(MetricsCollectingTaskDecorator.java:23)
        at java.util.concurrent.FutureTask.run(FutureTask.java:262)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:745)

Jun 29, 2015 11:41:47 AM com.amazonaws.services.kinesis.multilang.MultiLangRecordProcessor stopProcessing
SEVERE: Encountered error while trying to shutdown
java.lang.NullPointerException
        at com.amazonaws.services.kinesis.multilang.MessageWriter.close(MessageWriter.java:163)
        at com.amazonaws.services.kinesis.multilang.MultiLangRecordProcessor.childProcessShutdownSequence(MultiLangRecordProcessor.java:186)
        at com.amazonaws.services.kinesis.multilang.MultiLangRecordProcessor.stopProcessing(MultiLangRecordProcessor.java:249)
        at com.amazonaws.services.kinesis.multilang.MultiLangRecordProcessor.shutdown(MultiLangRecordProcessor.java:164)
        at com.amazonaws.services.kinesis.clientlibrary.lib.worker.V1ToV2RecordProcessorAdapter.shutdown(V1ToV2RecordProcessorAdapter.java:48)
        at com.amazonaws.services.kinesis.clientlibrary.lib.worker.ShutdownTask.call(ShutdownTask.java:94)
        at com.amazonaws.services.kinesis.clientlibrary.lib.worker.MetricsCollectingTaskDecorator.call(MetricsCollectingTaskDecorator.java:48)
        at com.amazonaws.services.kinesis.clientlibrary.lib.worker.MetricsCollectingTaskDecorator.call(MetricsCollectingTaskDecorator.java:23)
        at java.util.concurrent.FutureTask.run(FutureTask.java:262)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:745)

Answer 1 · 2015-06-29T17:05:11.000Z

Thanks for reporting. I'll try to reproduce this.

Answer 2 · 2015-06-29T19:21:22.000Z

Hi eyesoftime,

I'm not able to reproduce this. Can you provide the steps which you took to produce the problem?

What language are you using to process the records? Are you using one of the official multilang KCLs?

Thanks

Answer 3 · 2015-06-29T19:43:30.000Z

I was loading records into the stream, each one about 130KB of arbitrary data with indexes applied for tracking. Initially I started out with one shard. After about 2500 records I split the shard into two, after another 2500 records I merged the new shards. So, there's 4 shard altogether. There were no consumers running at the time (but that doesn't really change the outcome).

So then I started one daemon with Python sample application (with additional logging added, again for tracking). While it was consuming records from the first shard, I started the second daemon which then was idling as the first shard hadn't been consumed yet and second and third are the child shards of the first. Now when the consumer of the first shard reached the end, the first daemon died with the NPE. It happened repeatedly, while running daemons on the same EC2 instance, or parallel to one in my local machine. The same thing happened when the test was done 2-1-2 with shardes, ie merging and splitting. In that case it also died when going from one shard to two.

Hope it helps you.

Answer 4 · 2015-06-29T19:54:31.000Z

Thanks for the information. I did not do the shard merge in my own test, so that might be the problem. I will do so to see if that reproduces the problem.

Answer 5 · 2015-06-29T22:28:49.000Z

I have reproduced the problem. The problem isn't with MultiLangRecordProcessor per se, but rather with the Worker implementation.

A Worker will sometimes call shutdown on an IRecordProcessor even if initialize has not been called on the same instance. Since MultiLangRecordProcessor uses its initialize method to construct certain fields, and its shutdown method assumes that those fields have been initialized, an NPE occurs.

Once again thank you for reporting the problem. It will be fixed in a future release.

Answer 6 · 2015-07-28T17:29:40.000Z

@kevincdeng

Using the python wrapper for this package, and I seem to be running into the same issue. Is there a workaround you can recommend to guarantee that initialize always gets called?

Answer 7 · 2015-07-28T19:17:32.000Z

If your code doesn't need the shard id, you might be able to place it in the constructor of the class instead. Do you absolutely need initialize to be called? If it's just to ensure proper functioning of the shutdown method, adding a flag to check whether initialization has happened might be sufficient.

Answer 8 · 2015-11-16T17:31:17.000Z

Was this fixed in a recent version?

Answer 9 · 2016-01-28T21:08:57.000Z

This remains unfixed. The MultiLangRecordProcessor has not been changed since Oct 2014 https://github.com/awslabs/amazon-kinesis-client/blob/73ac2c0e25a25776cbc88f2c685223fb049e6757/src/main/java/com/amazonaws/services/kinesis/multilang/MultiLangRecordProcessor.java

I was able to reproduce this issue on 1.6.1 (the current latest version)

Answer 10 · 2016-03-03T00:23:14.000Z

@kevincdeng ETA here?

Answer 11 · 2016-03-21T22:41:24.000Z

@kevincdeng @findchris FWIW I've found success by specifying the failoverTimeMillis property in the .properties file to a high number (e.g. 100s)

Answer 12 · 2016-04-04T22:37:14.000Z

This apparently shipped in https://github.com/awslabs/amazon-kinesis-client#release-162-march-23-2016 @manango can you close or merge this PR please? Its confusing to leave open.

Answer 13 · 2016-04-05T06:25:43.000Z

The issue has been resolved in 1.6.2 release. Closing the issue.

Answer 14 · 2016-12-19T03:25:05.000Z

I am facing similar issue in https://github.com/awslabs/amazon-kinesis-client-net. Can someone please help?