JahstreetOrg/spark-on-kubernetes-helm

how the livy spark driver start spark executor pod

Closed this issue · 9 comments

hi:
I have installed the chart following your guide, but the spark application started by spark magic seems only contains a ivy-spark driver pod , how can I get the executor pod run in k8s throght livy-spark driver pod ?

Have you installed it locally on Minikube-like environment? If so there might be an issue of not having enough CPU/MEM to launch Executors. Can you share you Spark Driver Pod logs to check that?
Also tomorrow I will attach more details on how to launch the session from Jupyter and debug it.

I have checked my k8s environment, it was really a not having enough resouces issue, and I have solved it. But I am facing following issues:

  1. when I start pyspark kernel, the logs show like this:

20/04/14 03:23:09 INFO BlockManagerMasterEndpoint: Registering block manager remotesparkmagics-sample-1586834577660-driver-svc.jhub.svc:7079 with 413.9 MB RAM, BlockManagerId(driver, remotesparkmagics-sample-1586834577660-driver-svc.jhub.svc, 7079, None) 20/04/14 03:23:09 INFO BlockManagerMaster: Registered BlockManager BlockManagerId(driver, remotesparkmagics-sample-1586834577660-driver-svc.jhub.svc, 7079, None) 20/04/14 03:23:09 INFO BlockManager: Initialized BlockManager: BlockManagerId(driver, remotesparkmagics-sample-1586834577660-driver-svc.jhub.svc, 7079, None) 20/04/14 03:23:15 INFO KubernetesClusterSchedulerBackend$KubernetesDriverEndpoint: Registered executor NettyRpcEndpointRef(spark-client://Executor) (10.42.6.131:59440) with ID 1 20/04/14 03:23:15 INFO BlockManagerMasterEndpoint: Registering block manager 10.42.6.131:35347 with 413.9 MB RAM, BlockManagerId(1, 10.42.6.131, 35347, None) 20/04/14 03:23:15 INFO KubernetesClusterSchedulerBackend$KubernetesDriverEndpoint: Registered executor NettyRpcEndpointRef(spark-client://Executor) (10.42.5.145:50008) with ID 2 20/04/14 03:23:15 INFO KubernetesClusterSchedulerBackend: SchedulerBackend is ready for scheduling beginning after reached minRegisteredResourcesRatio: 0.8 20/04/14 03:23:15 INFO SparkEntries: Spark context finished initialization in 8524ms 20/04/14 03:23:15 INFO SparkEntries: Created Spark session. 20/04/14 03:23:15 INFO BlockManagerMasterEndpoint: Registering block manager 10.42.5.145:46056 with 413.9 MB RAM, BlockManagerId(2, 10.42.5.145, 46056, None) 20/04/14 03:23:23 WARN Session: Fail to start interpreter pyspark java.io.IOException: Cannot run program "python": error=2, No such file or directory at java.lang.ProcessBuilder.start(ProcessBuilder.java:1048) at org.apache.livy.repl.PythonInterpreter$.apply(PythonInterpreter.scala:75) at org.apache.livy.repl.Session.liftedTree1$1(Session.scala:106) at org.apache.livy.repl.Session.interpreter(Session.scala:98) at org.apache.livy.repl.Session.setJobGroup(Session.scala:353) at org.apache.livy.repl.Session.$anonfun$execute$1(Session.scala:164) at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at scala.concurrent.Future$.$anonfun$apply$1(Future.scala:659) at scala.util.Success.$anonfun$map$1(Try.scala:255) at scala.util.Success.map(Try.scala:213) at scala.concurrent.Future.$anonfun$map$1(Future.scala:292) at scala.concurrent.impl.Promise.liftedTree1$1(Promise.scala:33) at scala.concurrent.impl.Promise.$anonfun$transform$1(Promise.scala:33) at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:64) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Caused by: java.io.IOException: error=2, No such file or directory at java.lang.UNIXProcess.forkAndExec(Native Method) at java.lang.UNIXProcess.<init>(UNIXProcess.java:247) at java.lang.ProcessImpl.start(ProcessImpl.java:134) at java.lang.ProcessBuilder.start(ProcessBuilder.java:1029) ... 16 more 20/04/14 03:23:23 WARN Session: Fail to start interpreter pyspark java.io.IOException: Cannot run program "python": error=2, No such file or directory at java.lang.ProcessBuilder.start(ProcessBuilder.java:1048) at org.apache.livy.repl.PythonInterpreter$.apply(PythonInterpreter.scala:75) at org.apache.livy.repl.Session.liftedTree1$1(Session.scala:106) at org.apache.livy.repl.Session.interpreter(Session.scala:98) at org.apache.livy.repl.Session.$anonfun$execute$1(Session.scala:168) at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at scala.concurrent.Future$.$anonfun$apply$1(Future.scala:659) at scala.util.Success.$anonfun$map$1(Try.scala:255) at scala.util.Success.map(Try.scala:213) at scala.concurrent.Future.$anonfun$map$1(Future.scala:292) at scala.concurrent.impl.Promise.liftedTree1$1(Promise.scala:33) at scala.concurrent.impl.Promise.$anonfun$transform$1(Promise.scala:33) at scala

  1. Do you have any idea about how to start driver pod in a specified namespace? Is there any feature related to this issues?

Really appreciate for your help.

  1. I assume you used your own image for Spark Pods, isn't it? Basically due to the logs I see that probably your Spark image doesn't have python in PATH. Please share the steps you did to install the chart with the list of customizations made including Dockerfile so I could reproduce your issue and come up with the solution. If you follow this guide there shouldn't be any issues with PySpark:
  • Create PySpark Kernel notebook

Screenshot 2020-04-14 at 12 51 52

  • Run the code

Screenshot 2020-04-14 at 12 52 15

Screenshot 2020-04-14 at 12 54 12

Ref: Databricks docs.

  1. Spark namespace by default is determined by the Helm Release namespace. There are several ways on how to change it:
  • Through env vars: add --set livy.env.LIVY_SPARK_KUBERNETES_NAMESPACE.value=<default-namespace> when installing spark-cluster chart.
  • Configure Livy session (POST /sessions request body) in the notebook's first cell and execute it before calling any other commands. This will start Spark containers in the selected namespace only for this session:
%%configure -f 
{
  "name": "app-name",
  "executorMemory": "4G",
  "executorCores": 4,
  ...
  "conf": {
    "spark.kubernetes.namespace": "<target-namespace>"
  }
}
  • NOTE: with both options if you have RBAC-enabled cluster you need to be sure that you have the ServiceAccounts configured with the name <release-name>-livy-spark (by default, can be also reconfigured if needed) and correct privileges in the Spark Driver namespaces to allow Spark Driver request the Executor Pods from Kubernetes API. Please refer the SA and RBAC resources created in the Release namespace by default. Also find how these values are configured here.
  • For other Livy customization options refer the linked docs.

hi:
Beacause of the 403 issue, I build the Spark 2.4.5 version image based on your docker repo, and it looks like this:
https://github.com/cyliu0204/spark-on-k8s-docker/blob/master/spark-base/2.4.5_2.12-without-hadoop/Dockerfile

when I start the pyspark driver pod , the log looks like this:

20/04/20 07:26:32 INFO KubernetesClusterSchedulerBackend: SchedulerBackend is ready for scheduling beginning after waiting maxRegisteredResourcesWaitingTime: 30000(ms) 20/04/20 07:26:32 INFO SparkEntries: Spark context finished initialization in 31942ms 20/04/20 07:26:32 INFO SparkEntries: Created Spark session. Exception in thread "Thread-24" java.lang.NoClassDefFoundError: org/apache/spark/sql/hive/HiveContext at java.lang.Class.getDeclaredMethods0(Native Method) at java.lang.Class.privateGetDeclaredMethods(Class.java:2701) at java.lang.Class.privateGetPublicMethods(Class.java:2902) at java.lang.Class.getMethods(Class.java:1615) at py4j.reflection.ReflectionEngine.getMethodsByNameAndLength(ReflectionEngine.java:345) at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:305) at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:326) at py4j.Gateway.invoke(Gateway.java:274) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:238) at java.lang.Thread.run(Thread.java:748) Caused by: java.lang.ClassNotFoundException: org.apache.spark.sql.hive.HiveContext at java.net.URLClassLoader.findClass(URLClassLoader.java:382) at java.lang.ClassLoader.loadClass(ClassLoader.java:424) at java.lang.ClassLoader.loadClass(ClassLoader.java:357) ... 12 more 20/04/20 07:26:40 ERROR PythonInterpreter: Process has died with 1 20/04/20 07:26:40 ERROR PythonInterpreter: ERROR:root:Exception while sending command. Traceback (most recent call last): File "/opt/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1159, in send_command raise Py4JNetworkError("Answer from Java side is empty") py4j.protocol.Py4JNetworkError: Answer from Java side is empty During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/opt/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 985, in send_command response = connection.send_command(command) File "/opt/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1164, in send_command "Error while receiving", e, proto.ERROR_ON_RECEIVE) py4j.protocol.Py4JNetworkError: Error while receiving Traceback (most recent call last): File "/tmp/2775406139211743686", line 714, in <module> sys.exit(main()) File "/tmp/2775406139211743686", line 589, in main jsc = gateway.entry_point.sc() File "/opt/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in __call__ File "/opt/spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 336, in get_return_value py4j.protocol.Py4JError: An error occurred while calling t.sc 20/04/20 08:21:38 INFO KubernetesClusterSchedulerBackend$KubernetesDriverEndpoint: Registered executor NettyRpcEndpointRef(spark-client://Executor) (10.42.4.209:45974) with ID 1 20/04/20 08:21:38 INFO KubernetesClusterSchedulerBackend$KubernetesDriverEndpoint: Registered executor NettyRpcEndpointRef(spark-client://Executor) (10.42.5.192:45748) with ID 2 20/04/20 08:21:38 INFO BlockManagerMasterEndpoint: Registering block manager 10.42.4.209:43098 with 413.9 MB RAM, BlockManagerId(1, 10.42.4.209, 43098, None) 20/04/20 08:21:38 INFO BlockManagerMasterEndpoint: Registering block manager 10.42.5.192:38398 with 413.9 MB RAM, BlockManagerId(2, 10.42.5.192, 38398, None)
Any idea what I am missing?

  1. I also try to build a spark 3.0 image, since we are really interested in some features in Spark 3.0, but it seems not work very well with this livy version ; https://github.com/apache/incubator-livy/pull/289, an I have seen this pr already, besides this, is anything should be done to support Spark 3.0 through livy on k8s?

Ok, I see that you do not have Livy installed in the image. Also Scala 2.12 may have compatibility issues. Let me prepare the guide on Docker images customization for you. Will try to get to it till the end of the week.

Ok, I see that you do not have Livy installed in the image. Also Scala 2.12 may have compatibility issues. Let me prepare the guide on Docker images customization for you. Will try to get to it till the end of the week.

Do you mean the livy-spark image which have livy installed base on spark image?The Dockerfile is spark_base image, I also build the livy-spark image , it looks like this https://github.com/cyliu0204/spark-on-k8s-docker/blob/master/livy-spark/0.7.0-incubating-spark_2.4.5_2.12-hadoop_3.2.1/Dockerfile;

the log info seems link to a version issues, have you viewed this pr? apache/incubator-livy#289, it replaced the hivecontext with sqlcontext which may solved this issue; And I really want to have a livy version which merged this two features together;

Hi @cyliu0204 , I've updated the Docker image build for Spark 2.4.5 and updated the Helm charts accordingly. If you need the customizations please follow the steps:

  • Build spark image (base)
  • Build livy-spark image (based on spark, with Livy jars included)
  • Build livy image (with Livy entrypoint, based on livy-spark)
  • Change used images for Livy and Spark

I've tested these images locally with the updated Helm charts. Please refer spark-cluster v0.7.0.
What comes about the Livy features to include, we can always rebase the PR for Kubernetes support on master once they are merged.

Hi @cyliu0204, have your issue been resolved?

Hi @cyliu0204, have your issue been resolved?

yes, thank you for your help sincerely