Launch cluster with AMI that already has spark

Question

Launch cluster with AMI that already has spark

pwsiegel opened this issue 4 years ago · 5 comments

Flintrock version: 0.11.0
Python version: 3.7.5
OS: OSX

I tend to use Flintrock with custom builds of spark. Normally I host the build somewhere and use the download-source configuration parameter in the flintrock config to link to it, and this works fine. But I thought it might be convenient to create an AMI (starting with Amazon Linux 2) with Java and Spark already installed and set install-spark to False in the flintrock config, and so I gave this a try. The cluster launched as expected, but when I tried to start a spark session I got:

WARN StandaloneAppClient$ClientEndpoint: Failed to connect to master [IP-ADDRESS]

It retried for awhile, and eventually errored out. Is there a known way to get this to work?

One last note: when I allow Flintrock to install Spark there is normally a message at the end of the launch process which says Configuring Spark master.... I didn't get that when I set install-spark to False.

Thank you!

Answer 1 · 2020-05-29T02:06:49.000Z

It's probably because Spark isn't being configured with the addresses of the nodes in your cluster (which would happen as part of the "Configuring Spark master" step).

Off the top of my head, I think to get this to work you'd need to set install-spark to True and then update the install() method of the Spark service class to skip trying to download Spark. It's a hack but will get you going.

A more proper fix would perhaps be to add a new download-destination configuration and have install() skip the download if the destination is already populated. That way you can keep install-spark enabled and Flintrock will skip the download but still do the necessary configuration.

Answer 2 · 2020-05-31T17:23:26.000Z

Got it, thank you. I'm not sure if/when I'll have the time, but would you consider a PR in the direction of your second suggestion?

And one last question, for my own understanding: what is the intended use of the install-spark parameter?

Answer 3 · 2020-05-31T23:09:13.000Z

Yes, I would consider a PR along those lines.

If you just want a cluster with HDFS, or plan to do the Spark config yourself, then setting install-spark to False is what you want.

In other words, you can make things work with Flintrock the way it is and set install-spark to False, but you also need to call flintrock run-command ... and do the Spark configuration yourself.

The download-destination idea would make it easier to separate downloading Spark from configuring Spark. That would be better for users in your situation, since Flintrock can still do the configuration and not put that on the user to do. Right now you have to either enable both download and configure together, or disable both together.

Answer 4 · 2020-06-01T20:07:22.000Z

That makes sense. I might have a little time to tinker this week; I'll be back if I make sufficient progress. Thanks for your help, and for Flintrock itself!

Answer 5 · 2021-05-09T18:03:12.000Z

Let's continue this discussion over on #237, which I think captures the same need expressed here.