Launch cluster with AMI that already has spark
pwsiegel opened this issue · 5 comments
- Flintrock version: 0.11.0
- Python version: 3.7.5
- OS: OSX
I tend to use Flintrock with custom builds of spark. Normally I host the build somewhere and use the download-source
configuration parameter in the flintrock config to link to it, and this works fine. But I thought it might be convenient to create an AMI (starting with Amazon Linux 2) with Java and Spark already installed and set install-spark
to False
in the flintrock config, and so I gave this a try. The cluster launched as expected, but when I tried to start a spark session I got:
WARN StandaloneAppClient$ClientEndpoint: Failed to connect to master [IP-ADDRESS]
It retried for awhile, and eventually errored out. Is there a known way to get this to work?
One last note: when I allow Flintrock to install Spark there is normally a message at the end of the launch process which says Configuring Spark master...
. I didn't get that when I set install-spark
to False
.
Thank you!
It's probably because Spark isn't being configured with the addresses of the nodes in your cluster (which would happen as part of the "Configuring Spark master" step).
Off the top of my head, I think to get this to work you'd need to set install-spark
to True
and then update the install()
method of the Spark
service class to skip trying to download Spark. It's a hack but will get you going.
A more proper fix would perhaps be to add a new download-destination
configuration and have install()
skip the download if the destination is already populated. That way you can keep install-spark
enabled and Flintrock will skip the download but still do the necessary configuration.
Got it, thank you. I'm not sure if/when I'll have the time, but would you consider a PR in the direction of your second suggestion?
And one last question, for my own understanding: what is the intended use of the install-spark
parameter?
Yes, I would consider a PR along those lines.
If you just want a cluster with HDFS, or plan to do the Spark config yourself, then setting install-spark
to False
is what you want.
In other words, you can make things work with Flintrock the way it is and set install-spark
to False
, but you also need to call flintrock run-command ...
and do the Spark configuration yourself.
The download-destination
idea would make it easier to separate downloading Spark from configuring Spark. That would be better for users in your situation, since Flintrock can still do the configuration and not put that on the user to do. Right now you have to either enable both download and configure together, or disable both together.
That makes sense. I might have a little time to tinker this week; I'll be back if I make sufficient progress. Thanks for your help, and for Flintrock itself!