databricks/simr

Getting ClosedChannelException:null from master

Opened this issue · 2 comments

I have attempted to run simr --shell with 10 nodes. The 9 slave nodes were brought up and I can see they are still running by viewing the JobTracker UI. But the Master instance died. In the Mapper Log of the Master we see a few of the following exceptions:

2014-02-18 20:30:34,990 WARN akka.actor.ActorSystemImpl: RemoteClientWriteFailed@akka://SimrRelay@127.0.0.1:45121: MessageClass[scala.Tuple3] Error[java.nio.channels.ClosedChannelException:null
at org.jboss.netty.channel.socket.nio.AbstractNioWorker.cleanUpWriteBuffer(AbstractNioWorker.java:703)
at org.jboss.netty.channel.socket.nio.AbstractNioWorker.writeFromUserCode(AbstractNioWorker.java:426)
at org.jboss.netty.channel.socket.nio.NioClientSocketPipelineSink.eventSunk(NioClientSocketPipelineSink.java:116)
at org.jboss.netty.channel.Channels.write(Channels.java:733)
at org.jboss.netty.handler.codec.oneone.OneToOneEncoder.handleDownstream(OneToOneEncoder.java:65)
at org.jboss.netty.channel.Channels.write(Channels.java:733)
at org.jboss.netty.handler.codec.oneone.OneToOneEncoder.handleDownstream(OneToOneEncoder.java:65)
at org.jboss.netty.handler.execution.ExecutionHandler.handleDownstream(ExecutionHandler.java:185)
at org.jboss.netty.channel.Channels.write(Channels.java:712)
at org.jboss.netty.channel.Channels.write(Channels.java:679)
at org.jboss.netty.channel.AbstractChannel.write(AbstractChannel.java:246)
at akka.remote.netty.RemoteClient.send(Client.scala:76)
at akka.remote.netty.RemoteClient.send(Client.scala:63)
at akka.remote.netty.NettyRemoteTransport.send(NettyRemoteSupport.scala:154)
at akka.remote.RemoteActorRef.$bang(RemoteActorRefProvider.scala:247)
at org.apache.spark.simr.RelayServer$$anonfun$receive$1.apply(RelayServer.scala:180)
at org.apache.spark.simr.RelayServer$$anonfun$receive$1.apply(RelayServer.scala:151)
at akka.actor.Actor$class.apply(Actor.scala:318)
at org.apache.spark.simr.RelayServer.apply(RelayServer.scala:50)
at akka.actor.ActorCell.invoke(ActorCell.scala:626)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:197)
at akka.dispatch.Mailbox.run(Mailbox.scala:179)
at akka.dispatch.ForkJoinExecutorConfigurator$MailboxExecutionTask.exec(AbstractDispatcher.scala:516)
at akka.jsr166y.ForkJoinTask.doExec(ForkJoinTask.java:259)
at akka.jsr166y.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:975)
at akka.jsr166y.ForkJoinPool.runWorker(ForkJoinPool.java:1479)
at akka.jsr166y.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:104)
]

Is this a configuration error or some other issue?

Here is log from one of the slaves

Task Logs: 'attempt_201309101252_50100_m_000000_0'

stderr logs

SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [file:/ngs4/app/mapred/tt/taskTracker/edwaetlt/distcache/-3665564613369269577_249905135_1185992748/ma-gbit-lnn11.corp.apple.com/user/edwaetlt/.staging/job_201309101252_50100/libjars/spark-assembly-hadoop-1.0.4.jar/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/lib/hadoop/lib/slf4j-log4j12-1.4.3.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]

syslog logs

2014-02-18 20:30:25,361 INFO org.apache.hadoop.util.NativeCodeLoader: Loaded the native-hadoop library
2014-02-18 20:30:25,509 INFO org.apache.hadoop.metrics2.impl.MetricsConfig: loaded properties from hadoop-metrics2.properties
2014-02-18 20:30:25,527 INFO org.apache.hadoop.metrics2.impl.MetricsSinkAdapter: Sink ganglia started
2014-02-18 20:30:25,572 INFO org.apache.hadoop.metrics2.impl.MetricsSourceAdapter: MBean for source MetricsSystem,sub=Stats registered.
2014-02-18 20:30:25,573 INFO org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Scheduled snapshot period at 10 second(s).
2014-02-18 20:30:25,573 INFO org.apache.hadoop.metrics2.impl.MetricsSystemImpl: MapTask metrics system started
2014-02-18 20:30:25,573 INFO org.apache.hadoop.metrics2.impl.MetricsSourceAdapter: MBean for source ugi registered.
2014-02-18 20:30:25,576 INFO org.apache.hadoop.metrics2.impl.MetricsSourceAdapter: MBean for source jvm registered.
2014-02-18 20:30:25,668 INFO org.apache.hadoop.util.ProcessTree: setsid exited with exit code 0
2014-02-18 20:30:25,684 INFO org.apache.hadoop.mapred.Task: Using ResourceCalculatorPlugin : org.apache.hadoop.util.LinuxResourceCalculatorPlugin@1dacccc
2014-02-18 20:30:38,647 INFO akka.event.slf4j.Slf4jEventHandler: Slf4jEventHandler started
2014-02-18 20:30:38,791 INFO org.apache.spark.executor.CoarseGrainedExecutorBackend: Connecting to driver: akka://spark@ma-gbit-ldn2002.corp.apple.com:35581/user/CoarseGrainedScheduler
2014-02-18 20:30:38,925 INFO org.apache.spark.executor.CoarseGrainedExecutorBackend: Successfully registered with driver
2014-02-18 20:30:38,936 INFO org.apache.spark.executor.Executor: Using REPL class URI: http://17.169.56.37:47094
2014-02-18 20:30:38,972 INFO akka.event.slf4j.Slf4jEventHandler: Slf4jEventHandler started
2014-02-18 20:30:38,989 INFO org.apache.spark.SparkEnv: Connecting to BlockManagerMaster: akka://spark@ma-gbit-ldn2002.corp.apple.com:35581/user/BlockManagerMaster
2014-02-18 20:30:38,989 INFO akka.actor.ActorSystemImpl: RemoteServerStarted@akka://spark@17.169.56.50:36362
2014-02-18 20:30:39,014 INFO org.apache.spark.storage.DiskBlockManager: Created local directory at /ngs4/app/mapred/tt/taskTracker/edwaetlt/jobcache/job_201309101252_50100/attempt_201309101252_50100_m_000000_0/work/tmp/spark-local-20140218203039-e9c8
2014-02-18 20:30:39,020 INFO org.apache.spark.storage.MemoryStore: MemoryStore started with capacity 1334.8 MB.
2014-02-18 20:30:39,048 INFO org.apache.spark.network.ConnectionManager: Bound socket to port 34921 with id = ConnectionManagerId(17.169.56.50,34921)
2014-02-18 20:30:39,052 INFO org.apache.spark.storage.BlockManagerMaster: Trying to register BlockManager
2014-02-18 20:30:39,055 INFO akka.actor.ActorSystemImpl: RemoteClientStarted@akka://spark@ma-gbit-ldn2002.corp.apple.com:35581
2014-02-18 20:30:39,066 INFO org.apache.spark.storage.BlockManagerMaster: Registered BlockManager
2014-02-18 20:30:39,091 INFO org.apache.spark.SparkEnv: Connecting to MapOutputTracker: akka://spark@ma-gbit-ldn2002.corp.apple.com:35581/user/MapOutputTracker
2014-02-18 20:30:39,101 INFO org.apache.spark.HttpFileServer: HTTP File server directory is /ngs4/app/mapred/tt/taskTracker/edwaetlt/jobcache/job_201309101252_50100/attempt_201309101252_50100_m_000000_0/work/tmp/spark-d7043dca-e00a-4972-a006-c36a28e8830d
2014-02-18 20:30:39,154 INFO org.eclipse.jetty.server.Server: jetty-7.x.y-SNAPSHOT
2014-02-18 20:30:39,171 INFO org.eclipse.jetty.server.AbstractConnector: Started SocketConnector@0.0.0.0:59212

I met the same problem.