miguno/wirbelsturm

Some hostnames in cluster not resolvable, workers have UnresolvedAddressException

bstrand opened this issue · 4 comments

Summary

Some of the hosts in the cluster are not resolvable, and connection failures result. The /etc/hosts entries are not complete and not consistent across the cluster. The specific symptom observed is a worker throwing UnresolvedAddressException's trying to connect to other supervisors.

Workaround

Of course it easily worked around by amending the hosts file to add all machines.

Example error

E.g. a worker running on supervisor1 is unable to connect to supervisor2:-

2015-02-10 08:01:35 b.s.m.n.StormClientErrorHandler [INFO] Connection failed Netty-Client-supervisor2:6700
java.nio.channels.UnresolvedAddressException: null
        at sun.nio.ch.Net.checkAddress(Net.java:127) ~[na:1.7.0_75]
        at sun.nio.ch.SocketChannelImpl.connect(SocketChannelImpl.java:644) ~[na:1.7.0_75]
        at org.apache.storm.netty.channel.socket.nio.NioClientSocketPipelineSink.connect(NioClientSocketPipelineSink.java:108) [storm-core-0.9.3.jar:0.9.3]
        at org.apache.storm.netty.channel.socket.nio.NioClientSocketPipelineSink.eventSunk(NioClientSocketPipelineSink.java:70) [storm-core-0.9.3.jar:0.9.3]
        at org.apache.storm.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendDownstream(DefaultChannelPipeline.java:779) [storm-core-0.9.3.jar:0.9.3]
        at org.apache.storm.netty.handler.codec.oneone.OneToOneEncoder.handleDownstream(OneToOneEncoder.java:54) [storm-core-0.9.3.jar:0.9.3]
        at org.apache.storm.netty.channel.DefaultChannelPipeline.sendDownstream(DefaultChannelPipeline.java:591) [storm-core-0.9.3.jar:0.9.3]
        at org.apache.storm.netty.channel.DefaultChannelPipeline.sendDownstream(DefaultChannelPipeline.java:582) [storm-core-0.9.3.jar:0.9.3]
        at org.apache.storm.netty.channel.Channels.connect(Channels.java:634) [storm-core-0.9.3.jar:0.9.3]
        at org.apache.storm.netty.channel.AbstractChannel.connect(AbstractChannel.java:207) [storm-core-0.9.3.jar:0.9.3]
        at org.apache.storm.netty.bootstrap.ClientBootstrap.connect(ClientBootstrap.java:229) [storm-core-0.9.3.jar:0.9.3]
        at org.apache.storm.netty.bootstrap.ClientBootstrap.connect(ClientBootstrap.java:182) [storm-core-0.9.3.jar:0.9.3]
        at backtype.storm.messaging.netty.Client.connect(Client.java:152) [storm-core-0.9.3.jar:0.9.3]
        at backtype.storm.messaging.netty.Client.access$000(Client.java:43) [storm-core-0.9.3.jar:0.9.3]
        at backtype.storm.messaging.netty.Client$1.run(Client.java:107) [storm-core-0.9.3.jar:0.9.3]
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) [na:1.7.0_75]
        at java.util.concurrent.FutureTask.run(FutureTask.java:262) [na:1.7.0_75]
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:178) [na:1.7.0_75]
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:292) [na:1.7.0_75]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) [na:1.7.0_75]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) [na:1.7.0_75]
        at java.lang.Thread.run(Thread.java:745) [na:1.7.0_75]

Environment Info

Current machine states:
zookeeper1 running (virtualbox)
nimbus1 running (virtualbox)
supervisor1 running (virtualbox)
supervisor2 running (virtualbox)
kafka1 running (virtualbox)

[vagrant@supervisor1 ~]$ cat /etc/hosts
127.0.0.1 localhost
127.0.1.1 supervisor1
10.0.0.251 nimbus1
10.0.0.101 supervisor1
10.0.0.241 zookeeper1

[vagrant@nimbus1 ~]$ cat /etc/hosts
127.0.0.1 localhost
127.0.1.1 nimbus1
10.0.0.251 nimbus1
10.0.0.241 zookeeper1

Apologies, I see now this has been raised in issue #4.
I will add that the hosts were brought up all together with 'vagrant up' except for kafka1. kafka1 was brought up afterwards, and has a complete hosts file.

Thanks for the detailed report.

Can you retry with the ./deploy script?

Have you been able to sort out your problem? If so, please feel free to report back what the required fix was.

I relied on the previously reported workaround of manual updates to the hosts file. I have not yet had the chance to circle back and repro with the suggested, alternative fix.