/spark-kafka-consumer

Spark application to consume kafka events generated by a python producer.

Primary LanguageJupyter Notebook

spark-kafka-consumer

Spark application that consumes Kafka events generated by a Python producer.

Architecture

alt text

How to run

  1. Clone the project
git clone https://github.com/cordon-thiago/spark-kafka-consumer
  1. Set the KAFKA_ADVERTISED_HOST_NAME variable inside the docker-compose.yml with your docker host IP. Note: Do not use localhost or 127.0.0.1 as the host ip if you want to run multiple brokers. More information about the variables you can configure on the kafka docker, please refer to this repository.

  2. Start docker containers with compose.

cd spark-kafka-consumer/docker
docker-compose up -d

It will start the following services:

  • zookeeper:
    • Image: wurstmeister/zookeeper
    • Port: 2181
  • kafka:
    • Image: wurstmeister/kafka:2.11-1.1.1
    • Port: 9092
  • spark:
    • Image: jupyter/all-spark-notebook
    • Port: 8888
  1. Get the Jupyter Notebook URL + Token accessing the spark container

Access the container bash

docker exec -it docker_spark_1 bash

Then, get the notebook URL. Copy and paste the URL in the browser.

jupyter notebook list
  1. Run the event-producer.ipynb notebook to start producing events from changes in Wikipedia pages to a Kafka topic. More information about the Wikipedia event here.

  2. Run the event-consumer-spark.ipynb notebook to start consuming events from the Kafka topic and write it in parquet files.

  3. Run the data-visualization.ipynb notebook to read the parquet files as streaming and visualize the top 10 users that have more edits.