Spark 3.0.0 incompatibilities

Question

Spark 3.0.0 incompatibilities

Wadaboa opened this issue 4 years ago · 20 comments

Hi,
I'm using Flintrock to deploy a Scala/Spark application to a simple EC2 cluster (master + 1 slave), but I'm having some problems. In particular, I'm using the following custom Flintrock configuration to launch the cluster:

provider: ec2

services:
  spark:
    version: 3.0.0

launch:
  num-slaves: 1
  install-spark: True

providers:
  ec2:
    key-name: my-key-pair
    identity-file: my-key-pair.pem
    instance-type: m4.large
    region: us-east-1
    ami: ami-00b882ac5193044e4
    user: ec2-user
    instance-profile-name: ec2-s3

debug: true

As you can see, I'm using Spark 3.0.0. Once I have packaged my Scala application (using sbt) I perform a Flintrock's copy-file operation. Then, when I want to run the JAR using spark-submit, I have issues:

If I run spark-submit from my local machine using option --master spark://<ec2-master-name>:7077 and --deploy-mode cluster, I get a Java error saying Cannot run program "/usr/local/opt/openjdk@11/bin/java" (in directory "/home/ec2-user/spark/work/driver-20200718133018-0005")
If I run spark-submit from inside the master cluster (using either Flintrock's run-command --master-only or login commands), the job gets correctly submitted, but the spark UI is not reporting it (at <ec2-master-name>:8080). Instead, the application execution can be viewed at <ec2-master-name>:4040, even though in the Executors page only the master seems to be active.

I can't wrap my head around this problem. I don't know if it is some kind of Flintrock's compatibility issue with Spark 3.0.0, or I'm doing something wrong.
I apologize if this is the wrong place where to post this kind of doubts.

However, my environment is the following:

Flintrock version: 1.0.0
Python version: 3.8.4
OS: macOS Catalina 10.15.5
SBT version: 1.3.13
Scala version: 2.12.10
OpenJDK version: 1.8

Answer 1 · 2020-07-20T16:27:33.000Z

Thanks for sharing this report, @Wadaboa. This is the correct place to report issues like this.

I haven't tested Flintrock myself with Spark 3.0 yet, so I'm not surprised there are issues.

Cannot run program "/usr/local/opt/openjdk@11/bin/java" (in directory "/home/ec2-user/spark/work/driver-20200718133018-0005")

This looks interesting. Flintrock currently only ensures that Java 8 is available, not Java 11. Spark 3.0 should still work with Java 8, though. Something is causing Spark to look at Java 11.

Are you sure you've built your application against Java 8?

Another possibility is that there is some setting we need to provide to tell Spark 3.0 to use Java 8. The migration guide doesn't mention anything like that, but I haven't looked closely. Maybe try setting JAVA_HOME to point specifically at the Java 8 home directory.

Answer 2 · 2020-07-20T18:13:24.000Z

I'm sure I've built my Scala application using Java 8, since running sbt package gives me the following output on the very first line:

[info] welcome to sbt 1.3.13 (AdoptOpenJDK Java 1.8.0_252)

(I'm using jenv to manage different Java versions).

Moreover, building the application on my local machine with Java 8 or Java 11 makes no difference: spark-submit works properly locally.

I also tried to build the application using Java 11, installing Java 11 (I tried with both sudo amazon-linux-extras install java-openjdk11 and Amazon Corretto 11) on the master node and setting Java 11 as the standard Java version (by using alternatives --config java), but It didn't work out.

I assume that to make this "hack" work I would also have to install Java 11 on each slave node, but I was expecting at least some kind of different output/error when launching spark-submit on the cluster, which wasn't the case.

One thing I noticed while looking for Spark's brew formula is that it requires openjdk@11, even though Spark documentation clearly states that version 3.0.0 is still compatible with Java 1.8.

Answer 3 · 2020-07-20T20:45:46.000Z

I would try on each cluster node to set JAVA_HOME in spark-env.sh to the Java 8 home and see if that solves the problem.

Answer 4 · 2020-07-21T15:27:20.000Z

I tried setting JAVA_HOME in spark-env.sh on each cluster node, by adding the following line:

export JAVA_HOME="/usr/lib/jvm/jre"

The openjdk@11 error is still there. I tried to execute spark-submit both from my local machine and from within the master node, but I get the same behavior as I originally reported.

By the way, I also had another try in which I added the following two lines to spark-env.sh on each cluster node, but everything remained the same:

export JAVA_HOME="/usr/lib/jvm/jre"
export PATH="$JAVA_HOME/bin:$PATH"

I also tried to restart the master and the slave nodes, by using the scripts located in spark/sbin/ (stop-all.sh and start-all.sh).

Answer 5 · 2020-07-21T17:23:49.000Z

This is strange. I will give things a shot myself later this week and see if I can get to the bottom of what's going on.

Answer 6 · 2020-07-29T22:52:15.000Z

I managed to make it work by doing something sketchy. First of all, I changed my custom Flintrock configuration to be:

provider: ec2

services:
  spark:
    version: 3.0.0
  hdfs:
    version: 2.8.5

launch:
  num-slaves: 1
  install-spark: True
  install-hdfs: True

providers:
  ec2:
    key-name: my-key-pair
    identity-file: my-key-pair.pem
    instance-type: t2.xlarge
    region: us-east-1
    ami: ami-0f84e2a3635d2fac9
    user: ec2-user
    instance-profile-name: ec2-s3

debug: true

Then, after launching the cluster, I ran the following commands:

flintrock run-command <cluster-name> "sudo amazon-linux-extras install -y java-openjdk11"
flintrock run-command <cluster-name> "yes 2 | sudo alternatives --config java"
flintrock run-command <cluster-name> "sudo mkdir -p /usr/local/opt/openjdk@11/bin/"
flintrock run-command <cluster-name> "sudo ln -s /usr/lib/jvm/jre/bin/java /usr/local/opt/openjdk@11/bin/"

Obviously, this is not the go-to solution, but at least it's something. My biggest doubt is still about the /usr/local/opt/openjdk@11/bin/java error, which I think could be related to my local machine, but I really don't know how.

Answer 7 · 2020-07-30T15:39:19.000Z

Thank you for looking into this and sharing your workaround. I still plan to look into this as well; sorry I haven't been able to do it yet.

Answer 8 · 2020-08-03T02:27:31.000Z

Just did some testing. I cannot reproduce the error you reported with PySpark (REPL or spark-submit) or with the Scala REPL. Everything seems to work fine.

So I guess the problem is triggered by a combination of sbt-built jobs and Spark 3.0. I know that sbt made some breaking changes in the jump from 0.13 to 1.0. I see you're using sbt 1.3.

How about if you try building your application with sbt 0.13? Or some other version of sbt < 1.3?

Answer 9 · 2020-08-08T17:35:23.000Z

I tried building my application with sbt 0.13, but nothing changed. I'm almost convinced that this error is related to how Homebrew installs the apache-spark package: inside /usr/local/Cellar/apache-spark/3.0.0/bin/load-spark-env.sh I found the following:

#!/bin/bash
JAVA_HOME="/usr/local/opt/openjdk@11" exec "/usr/local/Cellar/apache-spark/3.0.0/libexec/bin/load-spark-env.sh"  "$@"

I tried removing that initial JAVA_HOME string, but I had no luck.
I will keep investigating and if I find the solution I will post it here.

Answer 10 · 2020-08-12T21:42:39.000Z

Oh, hmm... I'm not a Scala guy so humor me here:

Is it possible that when you build your Spark application, the build process somehow references the Homebrew-installed Spark, which includes a dependency on JDK 11? ~~Can you somehow try to build your application against a version of Spark that's built against JDK 8 and see if that helps?~~ Not sure this idea makes sense, actually.

I don't mean to make so much work for you. 😄 If JDK 11 is going to be frequently used with Spark 3.0, perhaps Flintrock should just support this out of the box somehow.

The challenge will be to figure out how to make Flintrock gracefully support both users who use JDK 8 and those who use JDK 11. Or, at the very least, provide some new option or note in the README that JDK 11 users can reference.

Answer 11 · 2020-08-14T14:21:43.000Z

Actually, I'm not that much of a Scala guy too.. I'm currently learning the language...

Anyway, I just did some more testing:
I don't think that my issue can be related to Flintrock at all, but it could be related to how Spark and/or it's Homebrew installation manages the --deploy-mode cluster option of spark-submit.

I tried to build Spark 3.0.0 using a custom Homebrew formula to support Java 8 (in this case my JAVA_HOME variable was set to /usr/local/opt/jenv/versions/1.8/, since I'm using jenv to manage multiple Java versions). Then, I created the cluster using Flintrock, as I did before, and I launched spark-submit from my local machine with --deploy-mode cluster: this time the error

Cannot run program "/usr/local/opt/openjdk@11/bin/java" (in directory "/home/ec2-user/spark/work/driver-...")

became

Cannot run program "/usr/local/opt/jenv/versions/1.8/bin/java" (in directory "/home/ec2-user/spark/work/driver-...")

meaning that something related to the driver is not fully executed on the cluster. So, I really don't understand how the --deploy-mode cluster option currently works, since it should just launch the driver directly inside the cluster, instead of running it on the submitter.

Now, I'm not using anymore the hack related to forcing JAVA_HOME on the cluster to be the one on my local machine (as I reported above), but instead I'm running spark-submit directly from the master node of the cluster, like so:

ssh -i <key-pair> <ec2-user>@<ec2-master-name> -t "spark-submit --master spark://<ec2-master-name>:7077 ..."

where <key-pair> is the path to an EC2 key pair .pem file, <ec2-user> is the name of the user in the cluster (Flintrock's user option in the .yaml config) and <ec2-master-name> is the DNS name of the EC2 master node.

Another thing that I noticed is that the reported command works, while the following Flintrock command does not:

flintrock run-command --master-only <ec2-cluster-name> "spark-submit --master spark://<ec2-master-name>:7077 ..."

where <ec2-cluster-name> is the name of the EC2 cluster, given by flintrock describe.

Ideally, the two commands should be doing the same thing, shouldn't they?

About out-of-the-box Flintrock's support of Java 11, what about providing a new option inside the .yaml configuration?

Answer 12 · 2020-08-25T18:36:27.000Z

Ideally, the two commands should be doing the same thing, shouldn't they?

They should. Do you get an error with flintrock run-command, or just no output?

About out-of-the-box Flintrock's support of Java 11, what about providing a new option inside the .yaml configuration?

That could work, though I am more inclined to just have Flintrock automatically make things work whether Java 8 or Java 11 are being used, if that's not too complicated.

Answer 13 · 2020-09-02T01:55:28.000Z

Hi Nicholas,

Just noticed this issue. I have been routinely using Flintrock to run Spark 3.0 on Java 11 for many months. My multi-jdk-support is still a bit ugly and I am wary of Don't expand the support matrix.

BTW, generally I run Spark from the master node rather than my desktop FWIW.

Answer 14 · 2020-09-02T07:24:05.000Z

Replying to @nchammas, flintrock run-command just hangs, while the ssh command reported above works like a charm.

Replying to @sfcoy, now I also run Spark from the master node.. running from my desktop does not seem to consistently work.

Answer 15 · 2020-09-11T01:27:49.000Z

@sfcoy - That's good to know, and I appreciate the restraint to add more supported configurations.

Do you know if there is any reason to install Java 8 if the cluster is being launched with Spark 3+? If apps built with Java 8 can be deployed to a Java 11 runtime, perhaps we can just install Java 11 automatically when Spark is set to 3.0 or newer. That saves us from needing to add a user-facing option.

@Wadaboa - That definitely sounds like a bug. Does that happen just with spark-submit, or does it also happen if you try something more mundane like flintrock run-command --master-only <ec2-cluster-name> "ls /tmp"?

Also, are you sure the command hangs, or is it just that it doesn't return any output unless there is an error? See #135 for a potentially relevant issue.

Answer 16 · 2020-09-13T00:42:55.000Z

Hi @nchammas, as far as Java 11 goes, we cannot use it if the user chooses to use HDFS. Hadoop 3.3 (released July 2020) is required for Java 11 and as yet the Apache Spark team have not discussed publicly (that I have noticed) when they will build releases that include this. Of course it is trivial to build one's own release.

I'm not sure it is a good idea to auto select the JDK because some users may have their own libraries that are not JDK 11 compatible.

Answer 17 · 2020-09-13T13:42:01.000Z

@nchammas - You are right, it works, it just doesn't return any output unless there is an error. I wasn't aware of the actual workings of the command.

Anyway, I agree with issue #135 that it could be valuable to actually see the output produced by run-command. Maybe implementing a new module, different from run-command, could be useful.

Answer 18 · 2020-10-18T22:54:32.000Z

@Wadaboa - I think run-command could perhaps be refactored or rewritten as a wrapper around parallel-ssh. Let's continue that conversation over on #135.

@sfcoy - Would you like to submit a PR that adds an option to Flintrock allowing the user to specify the Java version they want deployed to the cluster? I took a quick look at your branch and it looks like a very good start to me. I think with a few refinements we can get this in and release it as part of Flintrock 1.1.

Answer 19 · 2020-10-21T06:11:05.000Z

Sure thing. I also think that "no JDK" should be an option as well. That way users can choose an AMI which includes the required JDK.

Answer 20 · 2021-01-08T01:55:42.000Z

#316 was merged in, which addresses the Java 11 issue. I'm closing this.

If there are any remaining problems here not covered by existing issues (like #135), please speak up!