/bigdata_stack

Dockerized Hadoop/Minio/Hive/Presto stack

Primary LanguageShell

Big Data Stack

Big data stack running in pseudo-distributed mode with the following components:

  • Hadoop 2.8.5
  • Minio RELEASE.2019-10-12T01-39-57Z
  • Hive 2.3.6
  • Presto 326
  • Superset 0.35.1
  • Hue 4.5.0

For more details see the following post.

Quick start

Clone the repository and create .env file based on sample.env making sure DATADIR points to a suitable directory (persistent storage for all containers). Bring up the base stack:

docker-compose up -d

If you also want to start Superset and Hue, then run:

docker-compose -f superset/docker-compose.yml up -d
docker-compose -f hue/docker-compose.yml up -d

and initialize:

./scripts/init-hue.sh
./scripts/init-superset.sh

The stack should now be up and running and the following services available:

Contents

The stack uses update/modified Docker images from Big Data Europe, shawnzhu, and Cloudera. See Dockerfiles for details.

All needed images are on Docker Hub, but if you want to build the updated/modified images yourself, just run build-local.sh in the different sub-directories.

Changes compared to original images:

  • Hadoop updated to version 2.8.5
  • Hive update to version 2.3.6
  • S3 support added
  • Presto update to 326
  • Presto JDBC driver added to Hue

The scripts directory contains some helper scripts:

  • beeline.sh: Launch Beeline (Hive CLI) in Hive container
  • hadoop-client.sh: Start container with Hadoop utilities (host filesystem mounted as /host). Useful for moving files to HDFS.
  • init-hue.sh: Create admin home folder in HDFS in order to avoid error in Hue File Browser.
  • init-superset.sh: Initialize Superset database and add Presto as data source
  • presto-cli.sh: Launch Presto CLI (downloads jar if needed)