Spark 2.2.1 Docker Image. (Since AWS EMR uses 2.2.1).
- Spark 2.2.1 (includes
pyspark
) - Miniconda Python 3.6 Installation w/
numpy
ipython
pandas
ipython
set to be the defaultpyspark
interpreter to provide convenient auto-completion
To launch the container and run a bash
shell interactively, simply run
$ docker run --rm -it gvacaliuc/spark
Once you're in, you can run pyspark
from the terminal which will run
ipython
and provide a SparkContext
and SparkSession
. You can also launch
pyspark
directly when you launch the container by running
$ docker run --rm -it gvacaliuc/spark pyspark
The image includes a Miniconda installation that is user-writable, so you
can simply use conda
or pip
to install packages if need be:
$ conda install tqdm
# if you prefer
$ pip install tqdm
Bear in mind that if you launch the container with --rm
any packages
you install will need to be reinstalled when the container is restarted.
To build the image, simply pull the image and run
$ docker build . -t gvacaliuc/spark