/Reddit_API_AWS_pipeline

Datapipeline to extract SubReddit using Reddit-API - AWS - Airflow

Primary LanguagePython

Reddit API - AWS - Python - Airflow Data Pipeline

Overview

The objective of this project is to orchestarate a data pipeline using Airflow which runs in docker to acquire data from a subreddit - r/dataengineering using reddit's API, cleanse acquired data and finally load reporting level data to Amazon RDS MySQL table.

Platforms Used

  1. Airflow: Workflow orchestration management platform
  2. AWS S3: Object storage service to store raw, cleansed and aggregated formats of data
  3. AWS RDS: Relational data service to store final aggregated - reporting layer data in a table
  4. AWS IAM: Identity and Access management service to create roles to access AWS S3

Data Pipeline Architecture

Reddit_datapipeline

Airflow DAG

Screenshot 2024-03-02 at 12 23 57 PM

Final RDS Table snapshot

image