USCDataScience/sparkler

Sparkler cannot be executed on Databricks because sparkContext not pulled from sparkSession

mattvryan-github opened this issue · 0 comments

Issue Description

When trying to run Sparkler on a databricks cluster it fails to see the worker nodes. This is because the way Databricks image sets up the spark environment the sparkContext must be pulled from the sparkSession.

How to reproduce it

Put sparkler fat jar, conf and plugin directories on the master node of a databricks cluster and try to crawl. You will get messages like:
2020-10-05 22:50:43 INFO Injector$:97 - Injecting 1 seeds
2020-10-05 22:50:47 WARN SparkContext:69 - Please ensure that the number of slots available on your executors is limited by the number of cores to task cpus and not another custom resource. If cores is not the limiting resource then dynamic allocation will not work properly!
2020-10-05 22:51:04 WARN TaskSchedulerImpl:69 - Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources

Environment and Version Information

Please indicate relevant versions, including, if relevant:

  • Java Version
  • Spark Version 3.0.1
  • Operating System name and version Redhat and Ubuntu linux

An external links for reference

https://docs.databricks.com/jobs.html

Contributing

If you'd like to help us fix the issue by contributing some code, but would like guidance or help in doing so, please mention it!
pull request in process