Spark 3.0.0 incompatibilities
Wadaboa opened this issue · 20 comments
Hi,
I'm using Flintrock to deploy a Scala/Spark application to a simple EC2 cluster (master + 1 slave), but I'm having some problems. In particular, I'm using the following custom Flintrock configuration to launch the cluster:
provider: ec2
services:
spark:
version: 3.0.0
launch:
num-slaves: 1
install-spark: True
providers:
ec2:
key-name: my-key-pair
identity-file: my-key-pair.pem
instance-type: m4.large
region: us-east-1
ami: ami-00b882ac5193044e4
user: ec2-user
instance-profile-name: ec2-s3
debug: true
As you can see, I'm using Spark 3.0.0. Once I have packaged my Scala application (using sbt
) I perform a Flintrock's copy-file
operation. Then, when I want to run the JAR using spark-submit
, I have issues:
- If I run
spark-submit
from my local machine using option--master spark://<ec2-master-name>:7077
and--deploy-mode cluster
, I get a Java error sayingCannot run program "/usr/local/opt/openjdk@11/bin/java" (in directory "/home/ec2-user/spark/work/driver-20200718133018-0005")
- If I run
spark-submit
from inside the master cluster (using either Flintrock'srun-command --master-only
orlogin
commands), the job gets correctly submitted, but the spark UI is not reporting it (at<ec2-master-name>:8080
). Instead, the application execution can be viewed at<ec2-master-name>:4040
, even though in theExecutors
page only the master seems to be active.
I can't wrap my head around this problem. I don't know if it is some kind of Flintrock's compatibility issue with Spark 3.0.0, or I'm doing something wrong.
I apologize if this is the wrong place where to post this kind of doubts.
However, my environment is the following:
- Flintrock version:
1.0.0
- Python version:
3.8.4
- OS:
macOS Catalina 10.15.5
- SBT version:
1.3.13
- Scala version:
2.12.10
- OpenJDK version:
1.8
Thanks for sharing this report, @Wadaboa. This is the correct place to report issues like this.
I haven't tested Flintrock myself with Spark 3.0 yet, so I'm not surprised there are issues.
Cannot run program "/usr/local/opt/openjdk@11/bin/java" (in directory "/home/ec2-user/spark/work/driver-20200718133018-0005")
This looks interesting. Flintrock currently only ensures that Java 8 is available, not Java 11. Spark 3.0 should still work with Java 8, though. Something is causing Spark to look at Java 11.
Are you sure you've built your application against Java 8?
Another possibility is that there is some setting we need to provide to tell Spark 3.0 to use Java 8. The migration guide doesn't mention anything like that, but I haven't looked closely. Maybe try setting JAVA_HOME
to point specifically at the Java 8 home directory.
I'm sure I've built my Scala application using Java 8, since running sbt package
gives me the following output on the very first line:
[info] welcome to sbt 1.3.13 (AdoptOpenJDK Java 1.8.0_252)
(I'm using jenv to manage different Java versions).
Moreover, building the application on my local machine with Java 8 or Java 11 makes no difference: spark-submit
works properly locally.
I also tried to build the application using Java 11, installing Java 11 (I tried with both sudo amazon-linux-extras install java-openjdk11
and Amazon Corretto 11) on the master node and setting Java 11 as the standard Java version (by using alternatives --config java
), but It didn't work out.
I assume that to make this "hack" work I would also have to install Java 11 on each slave node, but I was expecting at least some kind of different output/error when launching spark-submit
on the cluster, which wasn't the case.
One thing I noticed while looking for Spark's brew formula is that it requires openjdk@11
, even though Spark documentation clearly states that version 3.0.0 is still compatible with Java 1.8.
I would try on each cluster node to set JAVA_HOME
in spark-env.sh to the Java 8 home and see if that solves the problem.
I tried setting JAVA_HOME
in spark-env.sh on each cluster node, by adding the following line:
export JAVA_HOME="/usr/lib/jvm/jre"
The openjdk@11
error is still there. I tried to execute spark-submit
both from my local machine and from within the master node, but I get the same behavior as I originally reported.
By the way, I also had another try in which I added the following two lines to spark-env.sh on each cluster node, but everything remained the same:
export JAVA_HOME="/usr/lib/jvm/jre"
export PATH="$JAVA_HOME/bin:$PATH"
I also tried to restart the master and the slave nodes, by using the scripts located in spark/sbin/
(stop-all.sh
and start-all.sh
).
This is strange. I will give things a shot myself later this week and see if I can get to the bottom of what's going on.
I managed to make it work by doing something sketchy. First of all, I changed my custom Flintrock configuration to be:
provider: ec2
services:
spark:
version: 3.0.0
hdfs:
version: 2.8.5
launch:
num-slaves: 1
install-spark: True
install-hdfs: True
providers:
ec2:
key-name: my-key-pair
identity-file: my-key-pair.pem
instance-type: t2.xlarge
region: us-east-1
ami: ami-0f84e2a3635d2fac9
user: ec2-user
instance-profile-name: ec2-s3
debug: true
Then, after launching the cluster, I ran the following commands:
flintrock run-command <cluster-name> "sudo amazon-linux-extras install -y java-openjdk11"
flintrock run-command <cluster-name> "yes 2 | sudo alternatives --config java"
flintrock run-command <cluster-name> "sudo mkdir -p /usr/local/opt/openjdk@11/bin/"
flintrock run-command <cluster-name> "sudo ln -s /usr/lib/jvm/jre/bin/java /usr/local/opt/openjdk@11/bin/"
Obviously, this is not the go-to solution, but at least it's something. My biggest doubt is still about the /usr/local/opt/openjdk@11/bin/java
error, which I think could be related to my local machine, but I really don't know how.
Thank you for looking into this and sharing your workaround. I still plan to look into this as well; sorry I haven't been able to do it yet.
Just did some testing. I cannot reproduce the error you reported with PySpark (REPL or spark-submit) or with the Scala REPL. Everything seems to work fine.
So I guess the problem is triggered by a combination of sbt-built jobs and Spark 3.0. I know that sbt made some breaking changes in the jump from 0.13 to 1.0. I see you're using sbt 1.3.
How about if you try building your application with sbt 0.13? Or some other version of sbt < 1.3?
I tried building my application with sbt 0.13
, but nothing changed. I'm almost convinced that this error is related to how Homebrew
installs the apache-spark
package: inside /usr/local/Cellar/apache-spark/3.0.0/bin/load-spark-env.sh
I found the following:
#!/bin/bash
JAVA_HOME="/usr/local/opt/openjdk@11" exec "/usr/local/Cellar/apache-spark/3.0.0/libexec/bin/load-spark-env.sh" "$@"
I tried removing that initial JAVA_HOME
string, but I had no luck.
I will keep investigating and if I find the solution I will post it here.
Oh, hmm... I'm not a Scala guy so humor me here:
Is it possible that when you build your Spark application, the build process somehow references the Homebrew-installed Spark, which includes a dependency on JDK 11? Can you somehow try to build your application against a version of Spark that's built against JDK 8 and see if that helps? Not sure this idea makes sense, actually.
I don't mean to make so much work for you. 😄 If JDK 11 is going to be frequently used with Spark 3.0, perhaps Flintrock should just support this out of the box somehow.
The challenge will be to figure out how to make Flintrock gracefully support both users who use JDK 8 and those who use JDK 11. Or, at the very least, provide some new option or note in the README that JDK 11 users can reference.
Actually, I'm not that much of a Scala guy too.. I'm currently learning the language...
Anyway, I just did some more testing:
I don't think that my issue can be related to Flintrock at all, but it could be related to how Spark and/or it's Homebrew installation manages the --deploy-mode cluster
option of spark-submit
.
I tried to build Spark 3.0.0 using a custom Homebrew formula to support Java 8 (in this case my JAVA_HOME
variable was set to /usr/local/opt/jenv/versions/1.8/
, since I'm using jenv
to manage multiple Java versions). Then, I created the cluster using Flintrock, as I did before, and I launched spark-submit
from my local machine with --deploy-mode cluster
: this time the error
Cannot run program "/usr/local/opt/openjdk@11/bin/java" (in directory "/home/ec2-user/spark/work/driver-...")
became
Cannot run program "/usr/local/opt/jenv/versions/1.8/bin/java" (in directory "/home/ec2-user/spark/work/driver-...")
meaning that something related to the driver is not fully executed on the cluster. So, I really don't understand how the --deploy-mode cluster
option currently works, since it should just launch the driver directly inside the cluster, instead of running it on the submitter.
Now, I'm not using anymore the hack related to forcing JAVA_HOME
on the cluster to be the one on my local machine (as I reported above), but instead I'm running spark-submit
directly from the master node of the cluster, like so:
ssh -i <key-pair> <ec2-user>@<ec2-master-name> -t "spark-submit --master spark://<ec2-master-name>:7077 ..."
where <key-pair>
is the path to an EC2 key pair .pem
file, <ec2-user>
is the name of the user in the cluster (Flintrock's user
option in the .yaml
config) and <ec2-master-name>
is the DNS name of the EC2 master node.
Another thing that I noticed is that the reported command works, while the following Flintrock command does not:
flintrock run-command --master-only <ec2-cluster-name> "spark-submit --master spark://<ec2-master-name>:7077 ..."
where <ec2-cluster-name>
is the name of the EC2 cluster, given by flintrock describe
.
Ideally, the two commands should be doing the same thing, shouldn't they?
About out-of-the-box Flintrock's support of Java 11, what about providing a new option inside the .yaml
configuration?
Ideally, the two commands should be doing the same thing, shouldn't they?
They should. Do you get an error with flintrock run-command
, or just no output?
About out-of-the-box Flintrock's support of Java 11, what about providing a new option inside the .yaml configuration?
That could work, though I am more inclined to just have Flintrock automatically make things work whether Java 8 or Java 11 are being used, if that's not too complicated.
Hi Nicholas,
Just noticed this issue. I have been routinely using Flintrock to run Spark 3.0 on Java 11 for many months. My multi-jdk-support is still a bit ugly and I am wary of Don't expand the support matrix.
BTW, generally I run Spark from the master node rather than my desktop FWIW.
@sfcoy - That's good to know, and I appreciate the restraint to add more supported configurations.
Do you know if there is any reason to install Java 8 if the cluster is being launched with Spark 3+? If apps built with Java 8 can be deployed to a Java 11 runtime, perhaps we can just install Java 11 automatically when Spark is set to 3.0 or newer. That saves us from needing to add a user-facing option.
@Wadaboa - That definitely sounds like a bug. Does that happen just with spark-submit
, or does it also happen if you try something more mundane like flintrock run-command --master-only <ec2-cluster-name> "ls /tmp"
?
Also, are you sure the command hangs, or is it just that it doesn't return any output unless there is an error? See #135 for a potentially relevant issue.
Hi @nchammas, as far as Java 11 goes, we cannot use it if the user chooses to use HDFS. Hadoop 3.3 (released July 2020) is required for Java 11 and as yet the Apache Spark team have not discussed publicly (that I have noticed) when they will build releases that include this. Of course it is trivial to build one's own release.
I'm not sure it is a good idea to auto select the JDK because some users may have their own libraries that are not JDK 11 compatible.
@nchammas - You are right, it works, it just doesn't return any output unless there is an error. I wasn't aware of the actual workings of the command.
Anyway, I agree with issue #135 that it could be valuable to actually see the output produced by run-command
. Maybe implementing a new module, different from run-command
, could be useful.
@Wadaboa - I think run-command
could perhaps be refactored or rewritten as a wrapper around parallel-ssh. Let's continue that conversation over on #135.
@sfcoy - Would you like to submit a PR that adds an option to Flintrock allowing the user to specify the Java version they want deployed to the cluster? I took a quick look at your branch and it looks like a very good start to me. I think with a few refinements we can get this in and release it as part of Flintrock 1.1.
Sure thing. I also think that "no JDK" should be an option as well. That way users can choose an AMI which includes the required JDK.