NOTE: This repo is purely for experimentation and does not intend to provide production-grade containers.
This present repo provides a set of files to quickly bootstrap a fully dockerized environment for doing Data Science on top of distributed Big Data components, like Apache Spark.
All docker images are available to be pulled from the central Docker hub.
- Docker installed & configured for your system (Windows / macOS / Linux).
- Docker compose installed.
- Recent version of
bash
. This is for using the aliases. On Windows, Cygwin or another alternative may be used.
The below starts a full stack with Spark, Hadoop, Zeppelin, Jupyter, etc.
git clone https://github.com/bwv988/datascience-playground.git
cd datascience-playground
bin/playground.sh start
bin/playground.sh stop
- Apache Spark + PySpark + R + other libs
- Apache Hadoop
- Apache Zeppelin
- Apache Hive
- Apache Zookeeper
- Jupyter
FIXME: Add paragraph to describe how to dynamically add interpreter settings.
Investigate issues by running a shell in a container, e.g.:
docker exec -it zeppelin bash
Inspect individual container logs by using the container name:
docker logs zeppelin
cd datascience-docker-playground
bin/playground.sh spark start
docker ps
bin/playground.sh spark stop
For the subsequent examples I'll be making use of the aliases provided.
This is handy for running some quick tests in Scala.
# First source the aliases definitions.
source bin/aliases.sh
# Create spark logs dir in HDFS or we get an exception.
hadoop fs -mkdir /spark-logs
spark-shell
Same as above, only for PySpark.
hadoop fs -mkdir /spark-logs
pyspark
This can be achieved via the host volume which the docker container mounts:
hadoop fs -ls /
hadoop fs -mkdir /tmp
hadoop fs -ls /
echo "Hello world" > test.txt
# First, move the file into the shared folder.
sudo mv test.txt ~/ds-playground/workdir
# From there, we can load the data into Hadoop.
hadoop fs -put /workdir/test.txt /tmp/test.txt
hadoop fs -ls /tmp
Here is how to access Beeline and run SQL commands through the docker container:
beeline
beeline> !connect jdbc:hive2://hive:10000 hiveuser hiveuser
0: jdbc:hive2://hive:10000> show tables
docker exec -it postgres psql -U postgres
postgres=# \c metastore
You are now connected to database "metastore" as user "postgres".
metastore=# select * from "VERSION";
VER_ID | SCHEMA_VERSION | VERSION_COMMENT
--------+----------------+----------------------------------
1 | 1.2.0 | Set by MetaStore root@172.18.0.7
(1 row)
metastore=#
TBD
TBD
TBD
- Big Data Europe main GitHub repo: https://github.com/big-data-europe?page=1
- Work done by https://github.com/earthquakesan