/kafka_storm_pipeline

A real-time, dockerized Kafka event processor pipeline (utilizing an Apache Storm topology). The project is running in AWS with automated infrastructure deployment and execution, using Ansible.

Primary LanguageJava

kafka_storm_pipeline

Diagram

alt text

Requirements

This project requires Ansible, Java8, Apache Maven and an AWS account. In addition, even though not required, Docker, redis-cli, Apache Kafka will be nice to have installed locally, to further explore various parts of this project.

Tools/Services Used

  • Java
  • Ansible
  • Docker
  • Apache Zookeeper
  • Apache Kafka
  • Apache Storm
  • AWS
    • EC2
    • RDS
    • Elasticache(Redis)
    • S3

Short Description

A real-time, dockerized Kafka event processor pipeline (utilizing an Apache Storm topology). The project is running in AWS with automated infrastructure deployment and execution, using Ansible.

Process Description

The events (in following format - CustID,Balance) for this pipeline are being generated by KafkaStormProducer.jar(generated from the KafkaStormProducer module by maven). The KafkaStormProducer.jar publeshes to Kafka topic from local machine. The topic acts as a source(Spout in Storm's world) for StormLookup topology, which for each event(tuple in Storm's world) does
  • LookupBolt
    • Extracts the CustID from the tuple
    • Looks up the Redis cluster and gets the SSN for that CustID
    • Passes the SSN along with the Balance to both RDSInserterBolt and S3WriterBolt
  • RDSInserterBolt
    • For each tuple does an upsert in RDS PSQL Balances table
  • S3WriterBolt
    • Accumulates the tuples received based on either specified count or specified time delta(whatever happens first)
    • Generates a file based on above and writes into S3

Execution

In order to execute issue ansible-playbook infrastructure.yml, while using your own AWS user. Once ansible run is complete, run one or more instances of KafkaStormProducer.jar(generated from the KafkaStormProducer module by maven). The pipeline will start populating the RDS and writing files into S3 at this point.

Execution Process Description

  • ansible-playbook infrastructure.yml
    • Creates a dedicated VPC for this project
    • Creates two subnets inside that VPC, sets up an Internet Gateway and defines all the necessary routes
    • Creates a SecurityGroup to be assigned to different resources throughout this project
    • Spins up EC2 instances and runs Zookeeper, Kafka and Storm Docker containers on them
    • Creates an Elastichache(Redis) cluster and populates it with lookup data
    • Creates an RDS Postgres instance to be populated by the pipeline
    • Deploys the storm topology
  • KafkaStormProducer
    • Produces events to kafka topic to be consumed by the pipeline

To Do

  • Make KafkaStormProducer param based
  • Split CustID_Balance to 2 separate files
  • Create a config.yml.tplt and start pushing this instead of original ansible config one
  • Implement time/count based batching for S3WriterBolt
  • Add storm logviewer to supervisor dockers
  • Implement a prod level logging and exception handling

Observations

  • If possible, always use Terraform for infrastructure creation. Ansible is good as far as working with already provisioned resources goes(i.e a kafka container is being spun up on an EC2 instance, however, terraform is much more intuitive as far as EC2 instance creation, itself, is concerned)

Warnings

  • Current configuration of this project will be using AWS services that are beyond the Free Tier!