/pinterest_exp_pipeline

Emulates Pinterests streaming and batch ETL pipeline for user uploads to the platform

Primary LanguagePythonMIT LicenseMIT

User Uploads data-engineering pipeline

Click here to see an Example User Post

The process (roughly)

  1. Pushing the simulated uploads to localhost with FastAPI from RDS
  2. A Kafka producer is created and two consumers are configured (one for batch and the other for stream processing)
  3. The Kafka batch consumer extracts the records as dicts and puts them to s3 as jsons
  4. PySpark and the AWS_Hadoop maven package is used to read and clean the batch records (scheduled to run daily with Airflow).
  5. The cleaned batch data is pushed to Cassandra
  6. The Kafka streaming consumer is used by PySpark streaming for real time processing
  7. Each batch is processed by pyspark and written to Postgres (ultimately to be used for real-time reporting etc)

Passwords file referenced:

<!-- User emulation details -->
HOST = HOST
USER = HOST
PASSWORD = PASSWORD
DATABASE = DATABASE
PORT = PORT

<!-- s3 config -->
S3 details for writing to the S3 bucket
aws_access_key_id = aws_access_key_id
aws_secret_access_key = aws_secret_access_key
aws_s3_bucket_name = aws_s3_bucket_name

<!-- Postgres Details -->
postgresusername = username
postgrespassword = password