d2iq-archive/spark-build

Add CLI support for getting list of `submissionIds`

Opened this issue · 0 comments

At the moment, the DCOS Spark CLI (as described here) allows you to:

  • get status for a given Spark driver, via dcos spark status <submissionId>
  • kill a given Spark driver, via dcos spark kill <submissionId>.

But in order to know your submissionId, you currently have to pay attention when it's returned as a result from the original dcos spark run command.

Our use case is for long running Spark drivers; if we update our code for a given driver, we want our CI/CD pipeline to replace the existing Spark driver with the newly built one. It's trivial to automate starting the newly built Spark driver (just run dcos spark run) but killing the old one is more complex.

Given we don't want our CI/CD pipeline to maintain state of what the last submissionId was for a given driver, we need some mechanism to inspecting what drivers are currently running in Mesosphere Spark, and identify any existing submissions which match a specified name (as per whatever name was provided in the original dcos spark run command).

I realise there is no Spark mechanism to provide this information (I see from browsing the DCOS Spark CLI code that the commands described above just delegate to spark-submit --status and spark-submit --kill.

However, having inspected the data in Zookeeper (both using Exhibitor and also zkCli.sh on a master node) it's clear that Spark stores the information we need in /spark_mesos_dispatcher/launchedDrivers, with each driver an entity in that path containing a binary MesosClusterSubmissionState. To explore this yourself on the Zookeeper CLI:

  • ls /spark_mesos_dispatcher/launchedDrivers to get a list of submission IDs
  • get /spark_mesos_dispatcher/launchedDrivers/<driverId> to print a string representation of the MesosClusterSubmissionState object for that submission.

At this point I see us having three options:

  1. Work with Apache Spark to add support in spark-submit for this, and then work with this project to expose that functionality.
  2. Work with this project to directly add support for a dcos spark list or similar command to list running drivers, which under the covers hits Zookeeper directly.
  3. Build our own mechanism - probably a simple service running in a Docker container inside Mesos - which inspects information in Zookeeper and serves it via a REST API, which we can then expose outside the cluster.

Given the chain of dependencies I suspect that option (1) - if it is in fact considered acceptable by all parties - would take forever before it would be available to us in our Mesosphere cluster; therefore I thought I'd explore the possibility of option (2) with yourselves - hence raising this issue! - before we resort to option (3).

Looking forward to your thoughts.