LLNL/magpie

Spark workers cannot connect to master

Closed this issue · 3 comments

Hi,

I'm trying to run Spark with HDFS on our cluster using magpie through SLURM. The Spark master looks to be running fine, I can connect to the Master through the WebUI but the workers are unable to connect.
The worker logs have the following error:

2019-04-11 14:47:39 INFO  SecurityManager:54 - SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(ch741); groups with view permissions: Set(); users  with modify permissions: Set(ch741); groups with modify permissions: Set()
2019-04-11 14:47:39 INFO  Utils:54 - Successfully started service 'sparkWorker' on port 43718.
2019-04-11 14:47:39 INFO  Worker:54 - Starting Spark worker 10.43.0.149:43718 with 32 cores, 186.4 GB RAM                                                                                    
2019-04-11 14:47:39 INFO  Worker:54 - Running Spark version 2.4.0
2019-04-11 14:47:39 INFO  Worker:54 - Spark home: /usr/local/software/spark/spark-2.4.0-bin-hadoop2.7                                                                                        
2019-04-11 14:47:39 INFO  log:192 - Logging initialized @3374ms
2019-04-11 14:47:39 INFO  Server:351 - jetty-9.3.z-SNAPSHOT, build timestamp: unknown, git hash: unknown                                                                                     
2019-04-11 14:47:39 INFO  Server:419 - Started @3413ms
2019-04-11 14:47:39 INFO  AbstractConnector:278 - Started ServerConnector@32941ff2{HTTP/1.1,[http/1.1]}{0.0.0.0:8081}                                                                        
2019-04-11 14:47:39 INFO  Utils:54 - Successfully started service 'WorkerUI' on port 8081.
2019-04-11 14:47:39 INFO  ContextHandler:781 - Started o.s.j.s.ServletContextHandler@3109a02{/logPage,null,AVAILABLE,@Spark}                                                                 
2019-04-11 14:47:39 INFO  ContextHandler:781 - Started o.s.j.s.ServletContextHandler@5c2dd2f3{/logPage/json,null,AVAILABLE,@Spark}                                                           
2019-04-11 14:47:39 INFO  ContextHandler:781 - Started o.s.j.s.ServletContextHandler@51086c25{/,null,AVAILABLE,@Spark}                                                                       
2019-04-11 14:47:39 INFO  ContextHandler:781 - Started o.s.j.s.ServletContextHandler@623923fe{/json,null,AVAILABLE,@Spark}                                                                   
2019-04-11 14:47:39 INFO  ContextHandler:781 - Started o.s.j.s.ServletContextHandler@5031cf00{/static,null,AVAILABLE,@Spark}                                                                 
2019-04-11 14:47:39 INFO  ContextHandler:781 - Started o.s.j.s.ServletContextHandler@320a869a{/log,null,AVAILABLE,@Spark}                                                                    
2019-04-11 14:47:39 INFO  WorkerWebUI:54 - Bound WorkerWebUI to 0.0.0.0, and started at http://cpu-e-149.data.cluster:8081                                                                   
2019-04-11 14:47:39 INFO  Worker:54 - Connecting to master cpu-e-83.data.cluster:7177...
2019-04-11 14:47:39 INFO  ContextHandler:781 - Started o.s.j.s.ServletContextHandler@7f7e4d4e{/metrics/json,null,AVAILABLE,@Spark}                                                           
2019-04-11 14:47:39 INFO  TransportClientFactory:267 - Successfully created connection to cpu-e-83.data.cluster/10.43.0.83:7177 after 25 ms (0 ms spent in bootstraps)                       
2019-04-11 14:47:39 WARN  Worker:87 - Failed to connect to master cpu-e-83.data.cluster:7177
org.apache.spark.SparkException: Exception thrown in awaitResult:
        at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:226)
        at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)
        at org.apache.spark.rpc.RpcEnv.setupEndpointRefByURI(RpcEnv.scala:101)
        at org.apache.spark.rpc.RpcEnv.setupEndpointRef(RpcEnv.scala:109)
        at org.apache.spark.deploy.worker.Worker$$anonfun$org$apache$spark$deploy$worker$Worker$$tryRegisterAllMasters$1$$anon$1.run(Worker.scala:253)                                       
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.RuntimeException: java.lang.IllegalStateException: Expected SaslMessage, received something else (maybe your client does not have SASL enabled?)
        at org.apache.spark.network.sasl.SaslMessage.decode(SaslMessage.java:69)
        at org.apache.spark.network.sasl.SaslRpcHandler.receive(SaslRpcHandler.java:90)
        at org.apache.spark.network.server.TransportRequestHandler.processRpcRequest(TransportRequestHandler.java:181)
        at org.apache.spark.network.server.TransportRequestHandler.handle(TransportRequestHandler.java:103)
        at org.apache.spark.network.server.TransportChannelHandler.channelRead(TransportChannelHandler.java:118)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
        at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340)
        at io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:286)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
        at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340)
        at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:102)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
        at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340)
        at org.apache.spark.network.util.TransportFrameDecoder.channelRead(TransportFrameDecoder.java:85)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
        at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340)
        at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1359)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
        at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:935)
        at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:138)
        at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:645)
        at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:580)
        at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:497)
        at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:459)
        at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858)
        at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:138)
        at java.lang.Thread.run(Thread.java:748)

        at org.apache.spark.network.client.TransportResponseHandler.handle(TransportResponseHandler.java:207)
        at org.apache.spark.network.server.TransportChannelHandler.channelRead(TransportChannelHandler.java:120)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
        at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340)
        at io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:286)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
        at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340)
        at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:102)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
        at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340)
        at org.apache.spark.network.util.TransportFrameDecoder.channelRead(TransportFrameDecoder.java:85)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
        at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340)
        at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1359)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
        at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:935)
        at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:138)
        at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:645)
        at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:580)
        at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:497)
        at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:459)
        at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858)
        at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:138)
        ... 1 more
2019-04-11 14:47:45 INFO  Worker:54 - Retrying connection to master (attempt # 1)

The slurm output doesn't show any issues apart from this which doesn't seem relevant:

-mkdir: java.net.URISyntaxException: Expected scheme-specific part at index 5: hdfs:
Usage: hadoop fs [generic options] -mkdir [-p] <path> ...

I've tried changing a few different settings in conf/spark/spark-defaults.conf such as spark.authenticate.enableSaslEncryption but no luck so far.

I'm trying to test using the Wordcount example, I also tried the sparkpi example.

I've previously manually run Spark jobs through Slurm simply by launching the master on the SLURMD_NODENAME node and workers on all others without issues.

I'm using Magpie 2.1 with Spark 2.4.0, Hadoop 2.7.7 and JDK 8u141.
Any help/pointers are appreciated.

Many thanks,
Chris

chu11 commented

Hi, a guess right off the bat, is the hostname the workers are trying to connect to correct? By default Magpie connects to the resource listed by Slurm. But that may not be the actual host/IP you want to connect to. There are options to work around this (on phone, can find option names later).

chu11 commented

Now that I'm sitting down, I see the "Expected SaslMessage, received something else (maybe your client does not have SASL enabled?)" error. which is perplexing.

Given you've run Spark under Slurm before, is it possible you have an alternate spark-defaults.conf file laying around that your Spark 2.4.0 could be accidentally picking up? Did you apply the patches/spark/spark-2.4.0-bin-hadoop2.7-alternate.patch patch?

Thanks for the quick response. Patching fixed it!