To start the spark cluster:
docker-compose up
OR
docker-compose -f docker-compose-pip-req.yml up
To scale the workers
docker-compose up --scale spark-worker=4
When you start the spark image, you can adjust the configuration of the instance by passing one or more environment variables either on the docker-compose file or on the docker run
command line. If you want to add a new environment variable:
spark:
...
environment:
- SPARK_MODE=master
...
- SPARK_MODE: Cluster mode starting Spark. Valid values: master, worker. Default: master
- SPARK_MASTER_URL: Url where the worker can find the master. Only needed when spark mode is worker. Default: spark://spark-master:7077
- SPARK_RPC_AUTHENTICATION_ENABLED: Enable RPC authentication. Default: no
- SPARK_RPC_AUTHENTICATION_SECRET: The secret key used for RPC authentication. No defaults.
- SPARK_RPC_ENCRYPTION_ENABLED: Enable RPC encryption. Default: no
- SPARK_LOCAL_STORAGE_ENCRYPTION_ENABLED: Enable local storage encryption: Default no
- SPARK_SSL_ENABLED: Enable SSL configuration. Default: no
- SPARK_SSL_KEY_PASSWORD: The password to the private key in the key store. No defaults.
- SPARK_SSL_KEYSTORE_PASSWORD: The password for the key store. No defaults.
- SPARK_SSL_TRUSTSTORE_PASSWORD: The password for the trust store. No defaults.
- SPARK_SSL_NEED_CLIENT_AUTH: Whether to require client authentication. Default: yes
- SPARK_SSL_PROTOCOL: TLS protocol to use. Default: TLSv1.2
- SPARK_DAEMON_USER: Spark system user when the container is started as root. Default: spark
- SPARK_DAEMON_GROUP: Spark system group when the container is started as root. Default: spark
More environment variables natively supported by Spark can be found at the official documentation.
For example, you could still use SPARK_WORKER_CORES
or SPARK_WORKER_MEMORY
to configure the number of cores and the amount of memory to be used by a worker machine.
By default, this container bundles a generic set of jar files but the default image can be extended to add as many jars as needed for your specific use case. For instance, the following Dockerfile adds aws-java-sdk-bundle-1.11.704.jar
:
FROM bitnami/spark
RUN curl https://repo1.maven.org/maven2/com/amazonaws/aws-java-sdk-bundle/1.11.704/aws-java-sdk-bundle-1.11.704.jar --output /opt/bitnami/spark/jars/aws-java-sdk-bundle-1.11.704.jar
In a similar way that in the previous section, you may want to use a different version of Hadoop jars.
Go to https://spark.apache.org/downloads.html and copy the download url bundling the Hadoop version you want and matching the Spark version of the container. Extend the Bitnami container image as below:
FROM bitnami/spark:3.0.0
USER root
RUN rm -r /opt/bitnami/spark/jars && \
curl --location http://mirror.cc.columbia.edu/pub/software/apache/spark/spark-3.0.0/spark-3.0.0-bin-hadoop2.7.tgz | \
tar --extract --gzip --strip=1 --directory /opt/bitnami/spark/ spark-3.0.0-bin-hadoop2.7/jars/
USER 1001
You can check the Hadoop version by running the following commands in the new container image:
$ pyspark
>>> sc._gateway.jvm.org.apache.hadoop.util.VersionInfo.getVersion()
'2.7.4'
The Bitnami Spark Docker image sends the container logs to the stdout
. To view the logs:
$ docker logs spark
or using Docker Compose:
$ docker-compose logs spark
You can configure the containers logging driver using the --log-driver
option if you wish to consume the container logs differently. In the default configuration docker uses the json-file
driver.
More at bitnami/bitnami-docker-spark