Examples of code with DataBricks

Table of Content (ToC)

Examples of code with DataBricks
Overview
- References
Quick start
Use cases
Initial setup
- PySpark and Jupyter
  - Install native Spark manually

Overview

This Git repository features use cases of good and bad practices when using Spark-based tools to process and analyze data.

References

Spark

Spark Connect

Jupyter

BMC - Integrate PySpark with Jupyter

Quick start

From a dedicated terminal window/tab, launch Spark Connect server. Note that the SPARK_REMOTE environment variable should not be set at this stage, otherwise the Spark Connect server will try to connect to the corresponding Spark Connect server and will therefore not start

$ sparkconnectstart

From the current terminal/tab, different from the window/tab having launched the Spark Connect server, launch PySpark from the command-line, which in turn launches Jupyter Lab
- Follow the details given by PySpark to open Jupyter in a web browser

$ export SPARK_REMOTE="sc://localhost:15002"; pyspark
...
[C 2023-06-27 21:54:04.720 ServerApp] 
    
    To access the server, open this file in a browser:
        file://$HOME/Library/Jupyter/runtime/jpserver-21219-open.html
    Or copy and paste one of these URLs:
        http://localhost:8889/lab?token=dd69151c26a3b91fabda4b2b7e9724d13b49561f2c00908d
        http://127.0.0.1:8889/lab?token=dd69151c26a3b91fabda4b2b7e9724d13b49561f2c00908d
...

Open Jupyter in a web browser. For instance, on MacOS:

$ open ~/Library/Jupyter/runtime/jpserver-*-open.html

Open a notebook, for instance ipython-notebooks/simple-connect.ipynb
- Run the cells. The third cell should give a result like:

+-------+--------+-------+-------+
|User ID|Username|Browser|     OS|
+-------+--------+-------+-------+
|   1580|   Barry|FireFox|Windows|
|   5820|     Sam|MS Edge|  Linux|
|   2340|   Harry|Vivaldi|Windows|
|   7860|  Albert| Chrome|Windows|
|   1123|     May| Safari|  macOS|
+-------+--------+-------+-------+

Notes:
- The first cell stops the initial Spark session, when that latter has been started by Spark without making use of Spark Connect, for instance when the SPARK_REMOTE environment variable has not been set properly. There is a try-catch clause, as once the Spark session has been started through Spark Connect, it cannot be stopped that way; the first cell may thus be re-executed at will with no further side-effect on the Spark session
- The same first cell then starts, or uses when already existing, the Spark session through Spark Connect

Use cases

Learnt lessons with Spark and Delta

Initial setup

PySpark and Jupyter

As per the official Apache Spark documentation, PyPi-installed PySpark (pip install pyspark[connect]) comes with Spark Connect from Spark version 3.4 or later. However, as of Spark version up to 3.4.1, the PySpark installation lacks the two new administration scripts allowing to start and to stop the Spark Connect server. For convenience, these two scripts have therefore been copied into this Git repository, in the tools/ directory. They may then simply copied in the PySpark sbin directory, once PySpark has been installed with pip
Install PySpark and JupyterLab, along with a few other Python libraries, from PyPi:

$ pip install -U pyspark[connect,sql,pandas_on_spark] plotly pyvis jupyterlab

Add the following in the Bash/Zsh init script:

$ cat >> ~/.bashrc << _EOF

# Spark
PY_LIBDIR="$(python -mpip show pyspark|grep "^Location:"|cut -d' ' -f2,2)"
export SPARK_VERSION="\$(python -mpip show pyspark|grep "^Version:"|cut -d' ' -f2,2)"
export SPARK_HOME="\$PY_LIBDIR/pyspark"
export PATH="\$SPARK_HOME/sbin:\$PATH"
export PYSPARK_PYTHON="\$(which python3)"
export PYSPARK_DRIVER_PYTHON='jupyter'
export PYSPARK_DRIVER_PYTHON_OPTS='lab --no-browser --port=8889'

_EOF

Re-read the Shell init scripts:

$ exec bash

Copy the two Spark connect administrative scripts into the PySpark installation:

$ cp tools/st*-connect*.sh $SPARK_HOME/sbin/

Check that the scripts are installed correctly:

$ ls -lFh $SPARK_HOME/sbin/*connect*.sh
-rwxr-xr-x  1 user  staff   1.5K Jun 28 16:54 $PY_LIBDIR/pyspark/sbin/start-connect-server.sh*
-rwxr-xr-x  1 user  staff   1.0K Jun 28 16:54 $PY_LIBDIR/pyspark/sbin/stop-connect-server.sh*

Add the following Shell aliases to start and stop Spark Connect server:

$ cat >> ~/.bash_aliases << _EOF

# Spark Connect
alias sparkconnectstart=unset SPARK_REMOTE; start-connect-server.sh --packages org.apache.spark:spark-connect_2.12:\$SPARK_VERSION,io.delta:delta-core_2.12:2.4.0 --conf "spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension" --conf "spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog"'
alias sparkconnectstop='stop-connect-server.sh'
# PySpark
alias pysparkdelta='pyspark --packages io.delta:delta-core_2.12:2.4.0 --conf "spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension" --conf "spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog"'

_EOF

Re-read the Shell aliases:

. ~/.bash_aliases

Install native Spark manually

That section is kept for reference only. It is normally not needed
Install Spark/PySpark manually, e.g. with Spark 3.4.1:

$ export SPARK_VERSION="3.4.1"
  wget https://dlcdn.apache.org/spark/spark-$SPARK_VERSION/spark-$SPARK_VERSION-bin-hadoop3.tgz
  tar zxf spark-$SPARK_VERSION-bin-hadoop3.tgz && \
  mv spark-$SPARK_VERSION-bin-hadoop3 ~/ && \
  rm -f spark-$SPARK_VERSION-bin-hadoop3.tgz

Add the following in the Bash/Zsh init script:

$ cat >> ~/.bashrc << _EOF

# Spark
export SPARK_VERSION="${SPARK_VERSION}"
export SPARK_HOME="\$HOME/spark-\$SPARK_VERSION-bin-hadoop3"
export PATH="\$SPARK_HOME/bin:\$SPARK_HOME/sbin:\${PATH}"
export PYTHONPATH=\$(ZIPS=("\$SPARK_HOME"/python/lib/*.zip); IFS=:; echo "\${ZIPS[*]}"):\$PYTHONPATH
export PYSPARK_PYTHON="\$(which python3)"
export PYSPARK_DRIVER_PYTHON='jupyter'
export PYSPARK_DRIVER_PYTHON_OPTS='lab --no-browser --port=8889'

_EOF
exec bash

Add the following Shell aliases to start and stop Spark Connect server:

$ cat >> ~/.bash_aliases << _EOF

# Spark Connect
alias sparkconnectstart='start-connect-server.sh --packages org.apache.spark:spark-connect_2.12:${SPARK_VERSION}'
alias sparkconnectstop='stop-connect-server.sh'

_EOF
. ~/.bash_aliases

data-engineering-helpers/databricks-examples

Examples of code with DataBricks

Table of Content (ToC)

Overview

References

Spark

Spark Connect

Jupyter

Quick start

Use cases

Initial setup

PySpark and Jupyter

Install native Spark manually