/all-in-one-docker-bigdataops

all-in-one-docker-bigdataops is a comprehensive Docker Compose environment that simplifies Big Data operations by bundling Hadoop, Spark, Hive, Hue, and Airflow into a ready-to-run stack, with example workflows, quick setup, and easy customization, making it ideal for learning, development, and testing in Big DataOps.

Primary LanguageJupyter Notebook

all-in-one-docker-bigdataops

all-in-one-docker-bigdataops is a comprehensive Docker Compose environment that simplifies Big Data operations by bundling Hadoop, Spark, Hive, Hue, and Airflow into a ready-to-run stack, with example workflows, quick setup, and easy customization, making it ideal for learning, development, and testing in Big DataOps.

Key Features:

  • Fully Integrated Big Data Tools: Pre-configured and ready to use, including Hadoop for distributed storage and processing, Spark for big data analytics, Hive for SQL-based querying, Hue for a web-based UI, and Airflow for orchestrating complex data pipelines.
  • Quick Setup: Get started with a single make start-all command. No more dealing with tedious configurations or dependency issues.
  • Versatile Usage: Ideal for learning, development, testing, and small-scale production environments.
  • Example Workflows: Includes sample jobs and scripts to demonstrate the capabilities of the stack, helping you kickstart your big data projects.
  • Customizable: Easily extend or modify the stack to fit your specific use cases.

Why Use This Repository?

  • Ease of Use: Simplifies the setup process, so you can focus on developing and testing your big data solutions.
  • All-in-One Solution: No need to piece together different tools. Everything is bundled in a single, cohesive package.
  • Community-Driven: Designed to be accessible, well-documented, and easy to discover, with the goal of becoming a go-to resource for Big DataOps on Docker.

How to use?

  • Ensure you have docker installed + make
  • You can find useful commands in Makefile
  • When using airflow, remember to add the following connections:
    • conn_id 'spark_docker' --conn-type spark --conn-host spark://spark-master --conn-port 7077
    • conn_id 'ssh_hadoop' --conn-type ssh --conn-host namenode --conn-port 22 --conn-login root --conn-password root123
  • In case you want to deploy the full infrastructure, just:
make start-all