This project demonstrates the construction of an end-to-end Python ETL (Extract, Transform, Load) pipeline using AWS services. The pipeline is designed to extract real estate property data from the Zillow Rapid API and process it through various AWS components.
- API: Data is extracted from the Zillow Rapid API.
- Python: A Python script is used to connect and pull data from the API.
- Amazon EC2: Hosts the Apache Airflow instance that orchestrates the ETL pipeline.
- Amazon S3: Serves as the landing and intermediate zones for raw and processed data.
- AWS Lambda: Automates the data flow between S3 buckets and data transformation.
- Amazon Redshift: Hosts the transformed data for querying and analysis.
- AWS QuickSight: Provides BI tools for data visualization.
- Data is initially loaded into an S3 bucket (landing zone).
- A Lambda function triggers to move data to another S3 bucket, ensuring data immutability in the landing zone.
- Another Lambda function performs data transformation.
- Transformed data is loaded into a different S3 bucket.
- An S3 Key Sensor checks for the presence of transformed data before loading it into Redshift.
- Data is finally loaded into an Amazon Redshift cluster.
- Apache Airflow: Orchestrates the pipeline, running on an Amazon EC2 instance.
- Python Operator: Connects to the Zillow Rapid API to extract data.
- Bash Operator: Moves data from EC2 to the S3 landing zone.
After loading into Redshift, AWS QuickSight visualizes the data, providing insights into the real estate properties information.