Use spark 3.0.2 as most repos do not have 3.0.1 anymore
jpuris opened this issue · 2 comments
jpuris commented
Spark 3.0.1 is no longer available and building jupyter/pyspark-notebook
generates the following error
❯ docker build --rm --force-rm -t jupyter/pyspark-notebook:3.0.1 .
[+] Building 3.3s (7/12)
=> [internal] load build definition from Dockerfile 0.0s
=> => transferring dockerfile: 3.27kB 0.0s
=> [internal] load .dockerignore 0.0s
=> => transferring context: 2B 0.0s
=> [internal] load metadata for docker.io/jupyter/scipy-notebook:latest 2.7s
=> [1/9] FROM docker.io/jupyter/scipy-notebook@sha256:00af391facb071b6b6191893555811a7680a3d7c40eb1bac5c17540e6131f625 0.0s
=> CACHED [2/9] RUN apt-get -y update && apt-get install --no-install-recommends -y "openjdk-11-jre-headless" ca-certificates-java && 0.0s
=> CACHED [3/9] WORKDIR /tmp 0.0s
=> ERROR [4/9] RUN wget -q $(python -c "import requests; content = requests.get('https://www.apache.org/dyn/closer.lua/spark/spark-3.0.1/spark-3.0.1-b 0.5s
------
> [4/9] RUN wget -q $(python -c "import requests; content = requests.get('https://www.apache.org/dyn/closer.lua/spark/spark-3.0.1/spark-3.0.1-bin-hadoop2.7.tgz?as_json').json(); print(content['preferred']+content['path_info'])"):
------
executor failed running [/bin/bash -o pipefail -c wget -q $(python -c "import requests; content = requests.get('https://www.apache.org/dyn/closer.lua/spark/spark-3.0.1/spark-3.0.1-bin-hadoop2.7.tgz?as_json').json(); print(content['preferred']+content['path_info'])")]: exit code: 8
Updating dockerfile's spark versio and checksum values fixes this issue
Old
ARG spark_version="3.0.1"
--
ARG spark_checksum="F4A10BAEC5B8FF1841F10651CAC2C4AA39C162D3029CA180A9749149E6060805B5B5DDF9287B4AA321434810172F8CC0534943AC005531BB48B6622FBE228DDC"
New
ARG spark_version="3.0.2"
--
ARG spark_checksum="a9bd16d6957579bb2f539d88f83ef5a5005bfbf2909078691397f0f1590b6a0e73c7fd6d51a0b1d69251a1c4c20b9490006b8fa26ebe37b87e9c0cee98aa3338"
irenemathew commented
@jpuris : Thanks for the solution. I was able to build the docker image. But when I run the test dags, it's failing. I tried to run a spark submit command directly in the container shell, but no luck. Error is not specific(The job gets killed automatically)
Airflow error:
Shell error:
21/05/15 20:40:23 INFO BlockManagerMasterEndpoint: Registering block manager spark:38413 with 366.3 MiB RAM, BlockManagerId(driver, spark, 38413, None)
21/05/15 20:40:23 INFO BlockManagerMaster: Registered BlockManager BlockManagerId(driver, spark, 38413, None)
21/05/15 20:40:23 INFO BlockManager: Initialized BlockManager: BlockManagerId(driver, spark, 38413, None)
21/05/15 20:40:23 INFO StandaloneAppClient$ClientEndpoint: Executor updated: app-20210515204023-0003/1 is now RUNNING
21/05/15 20:40:23 INFO StandaloneAppClient$ClientEndpoint: Executor updated: app-20210515204023-0003/0 is now RUNNING
21/05/15 20:40:23 INFO StandaloneAppClient$ClientEndpoint: Executor updated: app-20210515204023-0003/2 is now RUNNING
21/05/15 20:40:24 INFO StandaloneSchedulerBackend: SchedulerBackend is ready for scheduling beginning after reached minRegisteredResourcesRatio: 0.0
21/05/15 20:40:24 INFO SharedState: Setting hive.metastore.warehouse.dir ('null') to the value of spark.sql.warehouse.dir ('file:/opt/bitnami/spark/spark-warehouse').
21/05/15 20:40:24 INFO SharedState: Warehouse path is 'file:/opt/bitnami/spark/spark-warehouse'.
21/05/15 20:40:26 INFO InMemoryFileIndex: It took 89 ms to list leaf files for 1 paths.
21/05/15 20:40:26 INFO InMemoryFileIndex: It took 2 ms to list leaf files for 1 paths.
Killed
Have you faced anything like this when you changed to 3.0.2?
cordon-thiago commented
Hi @jpuris
Thank you for your contribution.
The repo was updated to use Spark 3.1.2: https://github.com/cordon-thiago/airflow-spark/tree/airflow1.10.7_spark3.1.2