/data-environment-poc

A simple proof of concept using Hadoop, Hive, Presto, Postgresql and Sqoop.

Primary LanguageShell

Data Environment cluster

Simple data infrastructure environment built with docker and docker-compose.

The environment is composed by:

Building

Just run make build and the project will take care of building the required images.

[Building and] Running

make will build everything necessary and use docker-compose to deploy the environment.

Proof of Concept

  • Start the environment

  • The postgres database is initialized with two tables that represent a soccer tournament, one for Teams and another for Matches. See setupdb.sh

  • During deployment, one of the hadoop datanodes will use sqoop to import the Matches table into hive. See the hadoop-datanode startup script

  • Run the sbaldrich/presto-consumer-example image making sure to attach it to the network that was created by compose. Notice that the url of the presto coordinator is given as argument to the execution of the image.

docker run -it --network data-environment_default sbaldrich/presto-consumer-example jdbc:presto://coordinator:8080
  • The sbaldrich/presto-consumer-example image contains a simple java application that will run a SQL script that consumes and joins data from different data sources (postgres and hive) and joins it to produce a result.

Acknowledgements

Thanks to Lewuathe, the owner of the docker-presto-cluster code on which this project is heavily based.