microsoft/SynapseVSCode

Can't run spark-shell or pyspark

fernandojpsilva opened this issue · 0 comments

Hi! New to using this, but have been struggling with running spark-shell/pyspark inside the container. Initially I was attempting to run a simple python script that creates a local spark session and does some dataframe transformations, but it was crashing. So I tried just running spark-shell and still got the same error. Here's the full log:

    sh-5.1# spark-shell
    Setting default log level to "WARN".
    To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
    [WARN ] 2024-06-04 11:22:52.489 [main] NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
    [WARN ] 2024-06-04 11:22:53.230 [main] SparkContext: Exception when load sparklyr connector java.lang.ClassNotFoundException: org.apache.spark.sparklyr.DefaultConnector
    Spark context Web UI available at http://156570b35212:4040
    Spark context available as 'sc' (master = local[*], app id = local-1717500173045).
    Spark session available as 'spark'.
    Welcome to
          ____              __
         / __/__  ___ _____/ /__
        _\ \/ _ \/ _ `/ __/  '_/
       /___/ .__/\_,_/_/ /_/\_\   version 3.4.1
          /_/

    Using Scala version 2.12.17 (OpenJDK 64-Bit Server VM, Java 1.8.0_372)
    Type in expressions to have them evaluated.
    Type :help for more information.
    scala> [ERROR] 2024-06-04 11:23:03.262 [lighter-poll-status] LighterClientState: fetch status
    java.io.IOException: Could not find Lighter configuration file: conf/lighter-config.json
    at org.apache.spark.lighter.client.JsonConfigReader.reload(JsonConfigReader.scala:20) ~[spark-lighter-core_2.12-2.0.8_spark-3.4.0.jar:?]
    at org.apache.spark.lighter.client.JsonConfigReader.<init>(JsonConfigReader.scala:15) ~[spark-lighter-core_2.12-2.0.8_spark-3.4.0.jar:?]
    at org.apache.spark.lighter.client.LighterClientContext.init(LighterClientContext.scala:56) ~[spark-lighter-core_2.12-2.0.8_spark-3.4.0.jar:?]
    at org.apache.spark.lighter.client.LighterClientContext.<init>(LighterClientContext.scala:45) ~[spark-lighter-core_2.12-2.0.8_spark-3.4.0.jar:?]
    at org.apache.spark.lighter.client.LighterClientContext$.context$lzycompute(LighterClientContext.scala:123) ~[spark-lighter-core_2.12-2.0.8_spark-3.4.0.jar:?]
    at org.apache.spark.lighter.client.LighterClientContext$.context(LighterClientContext.scala:123) ~[spark-lighter-core_2.12-2.0.8_spark-3.4.0.jar:?]
    at org.apache.spark.lighter.client.LighterClientContext$.getOrCreate(LighterClientContext.scala:125) ~[spark-lighter-core_2.12-2.0.8_spark-3.4.0.jar:?]
    at org.apache.spark.lighter.client.LighterClientState$.fetchStatus(LighterClientState.scala:99) ~[spark-lighter-core_2.12-2.0.8_spark-3.4.0.jar:?]
    at org.apache.spark.lighter.client.LighterClientState$.$anonfun$new$1(LighterClientState.scala:80) ~[spark-lighter-core_2.12-2.0.8_spark-3.4.0.jar:?]
    at org.apache.spark.util.Utils$.tryOrExit(Utils.scala:1454) ~[spark-core_2.12-3.4.1.jar:3.4.1]
    at org.apache.spark.lighter.client.LighterClientState$$anon$1.run(LighterClientState.scala:90) ~[spark-lighter-core_2.12-2.0.8_spark-3.4.0.jar:?]
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) ~[?:1.8.0_372]
    at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) ~[?:1.8.0_372]
    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) ~[?:1.8.0_372]
    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) ~[?:1.8.0_372]
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) ~[?:1.8.0_372]
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) ~[?:1.8.0_372]
    at java.lang.Thread.run(Thread.java:750) ~[?:1.8.0_372]

The same error happens when trying to run pyspark. To add more context, the only information I found about this json file was in the Create and manage Apache Spark job definitions in Visual Studio Code page posted by Microsoft about MS Fabric. They mention that:

In the root folder of the source script, the system creates a subfolder named conf. Within this folder, a file named lighter-config.json contains some system metadata needed for the remote run. Do NOT make any changes to it.

However I can't find such file. I'm running it on WSL. The only changes to the Dockerfile I've made was adding
RUN tdnf install -y wget tar awk procps

Please let me know if you are aware of this and if there's a workaround. Thank you.