Build for PySpark App

Current we build on Spark version: 3.4.1

Python as 3.10.x

Sparglim is a good tools for PySpark app and daemon Connect Server

Avaliable on dockerhub:

For PySpark app: wh1isper/pyspark-app-base
For Spark Connect Server: wh1isper/spark-executor
For Spark on k8s(Spark executor): wh1isper/spark-connector-server

You can modify the build script and dockerfile to suit your Spark version and other needs

Prepare

We use pypi to install pyspark

Prepare your jars in ./pyspark-app-base/jars, see below:"(Optional) Adding Hadoop tools to Spark(eg. s3a ...)"

I prepared jars of hadoop tools(hadoop 3.3.4) for spark 3.4.x, Try cd ./pyspark-app-base && ./download-jars.sh

Build

cd pyspark-app-base
# Prepare your jars in ./pyspark-app-base/jars, see below:"(Optional) Adding Hadoop tools to Spark(eg. s3a ...)"
# Then build spark 3.4.1 with jars
SPARK_VERSION=3.4.1;docker buildx build -t wh1isper/pyspark-app-base:${SPARK_VERSION} --platform linux/amd64,linux/arm64/v8 -f pyspark-app-base.Dockerfile --build-arg SPARK_VERSION=${SPARK_VERSION} --push .

Build for Spark on K8S

Prepare

Requires Prebuild of Spark, download from Spark download

Current we build on Spark version 3.4.1

Download Spark:

export SPARK_VERSION=3.4.1
wget https://dlcdn.apache.org/spark/spark-${SPARK_VERSION}/spark-${SPARK_VERSION}-bin-hadoop3.tgz

Unpack Spark and configure:

# May Need sudo
tar -zxvf spark-${SPARK_VERSION}-bin-hadoop3.tgz -C /opt
mv /opt/spark-${SPARK_VERSION}-bin-hadoop3/ /opt/spark-${SPARK_VERSION}

export SPARK_HOME=/opt/spark-${SPARK_VERSION}
export PATH=${SPARK_HOME}/bin:${SPARK_HOME}/sbin:$PATH

Verified

spark-shell

(Optional) Adding Hadoop tools to Spark(eg. s3a ...)

ATTENTION: There are packages in Hadoop that may be lower than Spark's version and need to be removed

In Spark 3.4.1 + hadoop 3.3.4: These two packages are out of date zstd-jni-1.4.9-1.jar lz4-java-1.7.1.jar

From my archive

wget http://42.193.219.110:8080/hadoop-3.3.4-share-hadoop-tools-lib.tar.gz
tar -zxvf hadoop-3.3.4-share-hadoop-tools-lib.tar.gz
mv hadoop-3.3.4-tools-lib/* ${SPARK_HOME}/jars/

Official way to get hadoop tools

1 Find the specific version of hadoop used by Spark

ls $SPARK_HOME/jars | grep hadoop

hadoop-{package_name}-{version}.jar eg. hadoop-client-3.3.4.jar means hadoop version 3.3.4

2 Download hadoop, 3.3.4 as example

export HADOOP_VERSION=3.3.4
wget https://downloads.apache.org/hadoop/common/hadoop-${HADOOP_VERSION}/hadoop-${HADOOP_VERSION}.tar.gz
tar xvzf hadoop-*.tar.gz
mv hadoop-${HADOOP_VERSION} /opt/hadoop
export HADOOP_HOME=/opt/hadoop

3 Copy tools to spark jars

cp ${HADOOP_HOME}/share/hadoop/tools/lib/* ${SPARK_HOME}/jars/

(Optional) Download spark-connect jar

Spark will be downloaded automatically, if the deployment environment does not have network, please download it first and then use it.

cd $SPARK_HOME/jars
wget https://repo1.maven.org/maven2/org/apache/spark/spark-connect_${SCALA_VERSION}/${SPARK_VERSION}/spark-connect_${SCALA_VERSION}-${SPARK_VERSION}.jar

Build Spark Executor/Connect Server

Require

Need env: SPARK_HOME

Optional env: SPARK_VERSION

Build Connect-server

You can modify the start-server.sh to suit your Spark version and other needs

pushd spark-connect-server
./build.sh
popd

Run:

docker run -it --rm \
-p 15002:15002 \
-p 4040:4040 \
wh1isper/spark-connector-server:3.4.1

Test with pyspark

pyspark --remote "sc://localhost:15002"

Code example

from datetime import datetime, date
from pyspark.sql import Row

df = spark.createDataFrame([
    Row(a=1, b=2., c='string1', d=date(2000, 1, 1), e=datetime(2000, 1, 1, 12, 0)),
    Row(a=2, b=3., c='string2', d=date(2000, 2, 1), e=datetime(2000, 1, 2, 12, 0)),
    Row(a=4, b=5., c='string3', d=date(2000, 3, 1), e=datetime(2000, 1, 3, 12, 0))
])
df.show()

Build executor

pushd spark-executor
./build.sh
popd

Note If you need to call python functions, you need the python executable to be in the same location (and have the same package installed), you can use conda-pack to move your conda env into container, just make sure unpack to the same path

TIPS

S3 secrets tokens(and others) need only be configured on the Driver or Connect Server, Configuration in Connect client take no effort.

Used BY

pyspark-sampling

sparglim

mohankrishna225/spark-build

Build for PySpark App

Prepare

Build

Build for Spark on K8S

Prepare

Download Spark:

Unpack Spark and configure:

Verified

(Optional) Adding Hadoop tools to Spark(eg. s3a ...)

From my archive

Official way to get hadoop tools

(Optional) Download spark-connect jar

Build Spark Executor/Connect Server

Require

Build Connect-server

Build executor

TIPS

Used BY