Spark workers cannot connect to master
Closed this issue · 3 comments
Hi,
I'm trying to run Spark with HDFS on our cluster using magpie through SLURM. The Spark master looks to be running fine, I can connect to the Master through the WebUI but the workers are unable to connect.
The worker logs have the following error:
2019-04-11 14:47:39 INFO SecurityManager:54 - SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(ch741); groups with view permissions: Set(); users with modify permissions: Set(ch741); groups with modify permissions: Set()
2019-04-11 14:47:39 INFO Utils:54 - Successfully started service 'sparkWorker' on port 43718.
2019-04-11 14:47:39 INFO Worker:54 - Starting Spark worker 10.43.0.149:43718 with 32 cores, 186.4 GB RAM
2019-04-11 14:47:39 INFO Worker:54 - Running Spark version 2.4.0
2019-04-11 14:47:39 INFO Worker:54 - Spark home: /usr/local/software/spark/spark-2.4.0-bin-hadoop2.7
2019-04-11 14:47:39 INFO log:192 - Logging initialized @3374ms
2019-04-11 14:47:39 INFO Server:351 - jetty-9.3.z-SNAPSHOT, build timestamp: unknown, git hash: unknown
2019-04-11 14:47:39 INFO Server:419 - Started @3413ms
2019-04-11 14:47:39 INFO AbstractConnector:278 - Started ServerConnector@32941ff2{HTTP/1.1,[http/1.1]}{0.0.0.0:8081}
2019-04-11 14:47:39 INFO Utils:54 - Successfully started service 'WorkerUI' on port 8081.
2019-04-11 14:47:39 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@3109a02{/logPage,null,AVAILABLE,@Spark}
2019-04-11 14:47:39 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@5c2dd2f3{/logPage/json,null,AVAILABLE,@Spark}
2019-04-11 14:47:39 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@51086c25{/,null,AVAILABLE,@Spark}
2019-04-11 14:47:39 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@623923fe{/json,null,AVAILABLE,@Spark}
2019-04-11 14:47:39 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@5031cf00{/static,null,AVAILABLE,@Spark}
2019-04-11 14:47:39 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@320a869a{/log,null,AVAILABLE,@Spark}
2019-04-11 14:47:39 INFO WorkerWebUI:54 - Bound WorkerWebUI to 0.0.0.0, and started at http://cpu-e-149.data.cluster:8081
2019-04-11 14:47:39 INFO Worker:54 - Connecting to master cpu-e-83.data.cluster:7177...
2019-04-11 14:47:39 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@7f7e4d4e{/metrics/json,null,AVAILABLE,@Spark}
2019-04-11 14:47:39 INFO TransportClientFactory:267 - Successfully created connection to cpu-e-83.data.cluster/10.43.0.83:7177 after 25 ms (0 ms spent in bootstraps)
2019-04-11 14:47:39 WARN Worker:87 - Failed to connect to master cpu-e-83.data.cluster:7177
org.apache.spark.SparkException: Exception thrown in awaitResult:
at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:226)
at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)
at org.apache.spark.rpc.RpcEnv.setupEndpointRefByURI(RpcEnv.scala:101)
at org.apache.spark.rpc.RpcEnv.setupEndpointRef(RpcEnv.scala:109)
at org.apache.spark.deploy.worker.Worker$$anonfun$org$apache$spark$deploy$worker$Worker$$tryRegisterAllMasters$1$$anon$1.run(Worker.scala:253)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.RuntimeException: java.lang.IllegalStateException: Expected SaslMessage, received something else (maybe your client does not have SASL enabled?)
at org.apache.spark.network.sasl.SaslMessage.decode(SaslMessage.java:69)
at org.apache.spark.network.sasl.SaslRpcHandler.receive(SaslRpcHandler.java:90)
at org.apache.spark.network.server.TransportRequestHandler.processRpcRequest(TransportRequestHandler.java:181)
at org.apache.spark.network.server.TransportRequestHandler.handle(TransportRequestHandler.java:103)
at org.apache.spark.network.server.TransportChannelHandler.channelRead(TransportChannelHandler.java:118)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340)
at io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:286)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340)
at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:102)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340)
at org.apache.spark.network.util.TransportFrameDecoder.channelRead(TransportFrameDecoder.java:85)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340)
at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1359)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:935)
at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:138)
at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:645)
at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:580)
at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:497)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:459)
at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858)
at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:138)
at java.lang.Thread.run(Thread.java:748)
at org.apache.spark.network.client.TransportResponseHandler.handle(TransportResponseHandler.java:207)
at org.apache.spark.network.server.TransportChannelHandler.channelRead(TransportChannelHandler.java:120)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340)
at io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:286)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340)
at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:102)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340)
at org.apache.spark.network.util.TransportFrameDecoder.channelRead(TransportFrameDecoder.java:85)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340)
at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1359)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:935)
at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:138)
at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:645)
at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:580)
at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:497)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:459)
at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858)
at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:138)
... 1 more
2019-04-11 14:47:45 INFO Worker:54 - Retrying connection to master (attempt # 1)
The slurm output doesn't show any issues apart from this which doesn't seem relevant:
-mkdir: java.net.URISyntaxException: Expected scheme-specific part at index 5: hdfs:
Usage: hadoop fs [generic options] -mkdir [-p] <path> ...
I've tried changing a few different settings in conf/spark/spark-defaults.conf such as spark.authenticate.enableSaslEncryption but no luck so far.
I'm trying to test using the Wordcount example, I also tried the sparkpi example.
I've previously manually run Spark jobs through Slurm simply by launching the master on the SLURMD_NODENAME node and workers on all others without issues.
I'm using Magpie 2.1 with Spark 2.4.0, Hadoop 2.7.7 and JDK 8u141.
Any help/pointers are appreciated.
Many thanks,
Chris
Hi, a guess right off the bat, is the hostname the workers are trying to connect to correct? By default Magpie connects to the resource listed by Slurm. But that may not be the actual host/IP you want to connect to. There are options to work around this (on phone, can find option names later).
Now that I'm sitting down, I see the "Expected SaslMessage, received something else (maybe your client does not have SASL enabled?)" error. which is perplexing.
Given you've run Spark under Slurm before, is it possible you have an alternate spark-defaults.conf
file laying around that your Spark 2.4.0 could be accidentally picking up? Did you apply the patches/spark/spark-2.4.0-bin-hadoop2.7-alternate.patch
patch?
Thanks for the quick response. Patching fixed it!