Click here to see an Example User Post
- Pushing the simulated uploads to localhost with FastAPI from RDS
- A Kafka producer is created and two consumers are configured (one for batch and the other for stream processing)
- The Kafka batch consumer extracts the records as dicts and puts them to s3 as jsons
- PySpark and the AWS_Hadoop maven package is used to read and clean the batch records (scheduled to run daily with Airflow).
- The cleaned batch data is pushed to Cassandra
- The Kafka streaming consumer is used by PySpark streaming for real time processing
- Each batch is processed by pyspark and written to Postgres (ultimately to be used for real-time reporting etc)
<!-- User emulation details -->
HOST = HOST
USER = HOST
PASSWORD = PASSWORD
DATABASE = DATABASE
PORT = PORT
<!-- s3 config -->
S3 details for writing to the S3 bucket
aws_access_key_id = aws_access_key_id
aws_secret_access_key = aws_secret_access_key
aws_s3_bucket_name = aws_s3_bucket_name
<!-- Postgres Details -->
postgresusername = username
postgrespassword = password