pinterest_exp_pipeline: A Python repository from rkildea1

User Uploads data-engineering pipeline

The process (roughly)

Pushing the simulated uploads to localhost with FastAPI from RDS
A Kafka producer is created and two consumers are configured (one for batch and the other for stream processing)
The Kafka batch consumer extracts the records as dicts and puts them to s3 as jsons
PySpark and the AWS_Hadoop maven package is used to read and clean the batch records (scheduled to run daily with Airflow).
The cleaned batch data is pushed to Cassandra
The Kafka streaming consumer is used by PySpark streaming for real time processing
Each batch is processed by pyspark and written to Postgres (ultimately to be used for real-time reporting etc)

Passwords file referenced:

<!-- User emulation details -->
HOST = HOST
USER = HOST
PASSWORD = PASSWORD
DATABASE = DATABASE
PORT = PORT

<!-- s3 config -->
S3 details for writing to the S3 bucket
aws_access_key_id = aws_access_key_id
aws_secret_access_key = aws_secret_access_key
aws_s3_bucket_name = aws_s3_bucket_name

<!-- Postgres Details -->
postgresusername = username
postgrespassword = password

rkildea1/pinterest_exp_pipeline

User Uploads data-engineering pipeline

The process (roughly)

Passwords file referenced: