cordon-thiago/airflow-spark

Use spark 3.0.2 as most repos do not have 3.0.1 anymore

jpuris opened this issue · 2 comments

Spark 3.0.1 is no longer available and building jupyter/pyspark-notebook generates the following error

❯ docker build --rm --force-rm -t jupyter/pyspark-notebook:3.0.1 .
[+] Building 3.3s (7/12)
 => [internal] load build definition from Dockerfile                                                                                                     0.0s
 => => transferring dockerfile: 3.27kB                                                                                                                   0.0s
 => [internal] load .dockerignore                                                                                                                        0.0s
 => => transferring context: 2B                                                                                                                          0.0s
 => [internal] load metadata for docker.io/jupyter/scipy-notebook:latest                                                                                 2.7s
 => [1/9] FROM docker.io/jupyter/scipy-notebook@sha256:00af391facb071b6b6191893555811a7680a3d7c40eb1bac5c17540e6131f625                                  0.0s
 => CACHED [2/9] RUN apt-get -y update &&     apt-get install --no-install-recommends -y     "openjdk-11-jre-headless"     ca-certificates-java &&       0.0s
 => CACHED [3/9] WORKDIR /tmp                                                                                                                            0.0s
 => ERROR [4/9] RUN wget -q $(python -c "import requests; content = requests.get('https://www.apache.org/dyn/closer.lua/spark/spark-3.0.1/spark-3.0.1-b  0.5s
------
 > [4/9] RUN wget -q $(python -c "import requests; content = requests.get('https://www.apache.org/dyn/closer.lua/spark/spark-3.0.1/spark-3.0.1-bin-hadoop2.7.tgz?as_json').json(); print(content['preferred']+content['path_info'])"):
------
executor failed running [/bin/bash -o pipefail -c wget -q $(python -c "import requests; content = requests.get('https://www.apache.org/dyn/closer.lua/spark/spark-3.0.1/spark-3.0.1-bin-hadoop2.7.tgz?as_json').json(); print(content['preferred']+content['path_info'])")]: exit code: 8

Updating dockerfile's spark versio and checksum values fixes this issue
Old

ARG spark_version="3.0.1"
--
ARG spark_checksum="F4A10BAEC5B8FF1841F10651CAC2C4AA39C162D3029CA180A9749149E6060805B5B5DDF9287B4AA321434810172F8CC0534943AC005531BB48B6622FBE228DDC"

New

ARG spark_version="3.0.2"
--
ARG spark_checksum="a9bd16d6957579bb2f539d88f83ef5a5005bfbf2909078691397f0f1590b6a0e73c7fd6d51a0b1d69251a1c4c20b9490006b8fa26ebe37b87e9c0cee98aa3338"

@jpuris : Thanks for the solution. I was able to build the docker image. But when I run the test dags, it's failing. I tried to run a spark submit command directly in the container shell, but no luck. Error is not specific(The job gets killed automatically)
Airflow error:
Airflow error
Shell error:

21/05/15 20:40:23 INFO BlockManagerMasterEndpoint: Registering block manager spark:38413 with 366.3 MiB RAM, BlockManagerId(driver, spark, 38413, None)
21/05/15 20:40:23 INFO BlockManagerMaster: Registered BlockManager BlockManagerId(driver, spark, 38413, None)
21/05/15 20:40:23 INFO BlockManager: Initialized BlockManager: BlockManagerId(driver, spark, 38413, None)
21/05/15 20:40:23 INFO StandaloneAppClient$ClientEndpoint: Executor updated: app-20210515204023-0003/1 is now RUNNING
21/05/15 20:40:23 INFO StandaloneAppClient$ClientEndpoint: Executor updated: app-20210515204023-0003/0 is now RUNNING
21/05/15 20:40:23 INFO StandaloneAppClient$ClientEndpoint: Executor updated: app-20210515204023-0003/2 is now RUNNING
21/05/15 20:40:24 INFO StandaloneSchedulerBackend: SchedulerBackend is ready for scheduling beginning after reached minRegisteredResourcesRatio: 0.0
21/05/15 20:40:24 INFO SharedState: Setting hive.metastore.warehouse.dir ('null') to the value of spark.sql.warehouse.dir ('file:/opt/bitnami/spark/spark-warehouse').
21/05/15 20:40:24 INFO SharedState: Warehouse path is 'file:/opt/bitnami/spark/spark-warehouse'.
21/05/15 20:40:26 INFO InMemoryFileIndex: It took 89 ms to list leaf files for 1 paths.
21/05/15 20:40:26 INFO InMemoryFileIndex: It took 2 ms to list leaf files for 1 paths.
Killed

Have you faced anything like this when you changed to 3.0.2?

Hi @jpuris
Thank you for your contribution.
The repo was updated to use Spark 3.1.2: https://github.com/cordon-thiago/airflow-spark/tree/airflow1.10.7_spark3.1.2