kafka_storm_pipeline

Diagram

Requirements

This project requires Ansible, Java8, Apache Maven and an AWS account. In addition, even though not required, Docker, redis-cli, Apache Kafka will be nice to have installed locally, to further explore various parts of this project.

Tools/Services Used

Java
Ansible
Docker
Apache Zookeeper
Apache Kafka
Apache Storm
AWS

EC2
RDS
Elasticache(Redis)
S3

Short Description

A real-time, dockerized Kafka event processor pipeline (utilizing an Apache Storm topology). The project is running in AWS with automated infrastructure deployment and execution, using Ansible.

Process Description

The events (in following format - CustID,Balance) for this pipeline are being generated by KafkaStormProducer.jar(generated from the KafkaStormProducer module by maven). The KafkaStormProducer.jar publeshes to Kafka topic from local machine. The topic acts as a source(Spout in Storm's world) for StormLookup topology, which for each event(tuple in Storm's world) does

LookupBolt
- Extracts the CustID from the tuple
- Looks up the Redis cluster and gets the SSN for that CustID
- Passes the SSN along with the Balance to both RDSInserterBolt and S3WriterBolt
RDSInserterBolt
- For each tuple does an upsert in RDS PSQL Balances table
S3WriterBolt
- Accumulates the tuples received based on either specified count or specified time delta(whatever happens first)
- Generates a file based on above and writes into S3

Execution

In order to execute issue ansible-playbook infrastructure.yml, while using your own AWS user. Once ansible run is complete, run one or more instances of KafkaStormProducer.jar(generated from the KafkaStormProducer module by maven). The pipeline will start populating the RDS and writing files into S3 at this point.

Execution Process Description

ansible-playbook infrastructure.yml
- Creates a dedicated VPC for this project
- Creates two subnets inside that VPC, sets up an Internet Gateway and defines all the necessary routes
- Creates a SecurityGroup to be assigned to different resources throughout this project
- Spins up EC2 instances and runs Zookeeper, Kafka and Storm Docker containers on them
- Creates an Elastichache(Redis) cluster and populates it with lookup data
- Creates an RDS Postgres instance to be populated by the pipeline
- Deploys the storm topology
KafkaStormProducer
- Produces events to kafka topic to be consumed by the pipeline

To Do

Make KafkaStormProducer param based
Split CustID_Balance to 2 separate files
Create a config.yml.tplt and start pushing this instead of original ansible config one
Implement time/count based batching for S3WriterBolt
Add storm logviewer to supervisor dockers
Implement a prod level logging and exception handling

Observations

If possible, always use Terraform for infrastructure creation. Ansible is good as far as working with already provisioned resources goes(i.e a kafka container is being spun up on an EC2 instance, however, terraform is much more intuitive as far as EC2 instance creation, itself, is concerned)

Warnings

Current configuration of this project will be using AWS services that are beyond the Free Tier!

tigstep/kafka_storm_pipeline