apache-spark-on-k8s/spark

Unsupported RPCMessage and then not able to spin up worker

leletan opened this issue · 4 comments

I was trying to run the job on minikube v0.22.3 within virtual box on macosx, simulating kubenetes v1.7.5.The master was successfully spun up but worker was not.

Looked into the driver log and seeing following error message:
2017-12-26 09:09:44 ERROR Inbox:91 - Ignoring error org.apache.spark.SparkException: Unsupported message RpcMessage(172.17.0.10:45720,RetrieveSparkAppConfig(1),org.apache.spark.rpc.netty.RemoteNettyRpcCallContext@39e8f5ab) from 172.17.0.10:45720 at org.apache.spark.rpc.netty.Inbox$$anonfun$process$1$$anonfun$apply$mcV$sp$1.apply(Inbox.scala:106) at org.apache.spark.rpc.netty.Inbox$$anonfun$process$1$$anonfun$apply$mcV$sp$1.apply(Inbox.scala:105) at org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverEndpoint$$anonfun$receiveAndReply$1.applyOrElse(CoarseGrainedSchedulerBackend.scala:155) at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171) at org.apache.spark.rpc.netty.Inbox$$anonfun$process$1.apply$mcV$sp(Inbox.scala:105) at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:205) at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:101) at org.apache.spark.rpc.netty.Dispatcher$MessageLoop.run(Dispatcher.scala:213) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:748)

After this error message, things seems to back to normal:
2017-12-26 09:10:10 INFO KubernetesClusterSchedulerBackend:54 - SchedulerBackend is ready for scheduling beginning after waiting maxRegisteredResourcesWaitingTime: 30000(ms) 2017-12-26 09:10:10 INFO SharedState:54 - Setting hive.metastore.warehouse.dir ('null') to the value of spark.sql.warehouse.dir ('file:/opt/spark/work-dir/spark-warehouse'). 2017-12-26 09:10:10 INFO SharedState:54 - Warehouse path is 'file:/opt/spark/work-dir/spark-warehouse'. 2017-12-26 09:10:10 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@c1fca2a{/SQL,null,AVAILABLE,@Spark} 2017-12-26 09:10:10 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@7c447c76{/SQL/json,null,AVAILABLE,@Spark} 2017-12-26 09:10:10 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@6107165{/SQL/execution,null,AVAILABLE,@Spark} 2017-12-26 09:10:10 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@11ebb1b6{/SQL/execution/json,null,AVAILABLE,@Spark}

However, later when worker tasks are launched, there are warnings in the log (as following) indicating there is not enough resource in the cluster, which is not true:
2017-12-26 09:10:29 WARN KubernetesTaskSchedulerImpl:66 - Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources 2017-12-26 09:10:44 WARN KubernetesTaskSchedulerImpl:66 - Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources 2017-12-26 09:10:59 WARN KubernetesTaskSchedulerImpl:66 - Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources

I ran into this issue a couple of times. I tried deleting the minikube and re-install it for a couple of times. There was only once I did not run into this issue thus was able to run the spark job successfully.

It seems vm related issue. Upgraded my virtual box and the issue is gone. Closing.

This time seeing this one on 1.8.5-gke.0 as well.
Any idea?

This is due to a spark distribution conflict with the one in the base image, shadowing the spark dependencies in fat jar works. Closing