/hadoop-docker-cluster

Hadoop based Distributed System Docker Cluster

Primary LanguageShell

Hadoop based Distributed System Docker Cluster

Alt text

Installed Components

  1. Airflow 2.7.0
  2. Hadoop 2.8.0
  3. Hive 2.3.2 (scala 2.11.8, python 2.7.9)
  4. Spark 2.1.2
  5. Kafka 7.0.0-css

1. Usage

  1. Start an cluster
    run_cluster.sh

    run_cluster.sh
    
  2. Stop an cluster
    stop_cluster.sh

    stop_cluster.sh
    

2. Interfaces (port forwarding)

  1. Airflow
  2. Hadoop
  3. Hive
    • Server port: 10000
    • Metastore(postgreSQL) port: 9083
  4. Spark
  5. Kafka
    • Kafka port: 9092
    • Kafdrop: http://localhost:9000
    • LISTENER_DOCKER_INTERNAL port: 19092
    • LISTENER_DOCKER_EXTERNAL port: 9092

3. Airflow connections (example)

  1. Hive
    1. hive_cli_conn
      • Connection Id: hive_cli_conn
      • Connection Type: Hive Client Wrapper
      • Host: localhost (host IP)
      • Port: 10000
      • Extra: {"use_beeline": true}
  2. Spark
    1. spark_conn
      • Connection Id: spark_conn
      • Connection Type: Spark
      • Host: local[*]
  3. Kafka
    1. kafka_default
      • Connection Id: kafka_default
      • Connection Type: Apache Kafka
      • Config Dict
        {
          "bootstrap.servers": "kafka-server:19092",
          "group.id": "group_1",
          "security.protocol": "PLAINTEXT",
          "auto.offset.reset": "beginning"
        }
        
    2. kafka_listener
      • Connection Id: kafka_listener
      • Connection Type: Apache Kafka
      • Config Dict
        {
          "bootstrap.servers": "kafka-server:19092",
          "group.id": "group_2",
          "security.protocol": "PLAINTEXT",
          "auto.offset.reset": "beginning"
        }
        

4. Notice


References