airflow-data-pipeline: A Python repository from rubnsbarbosa

Airflow Data Pipeline: S3 to Redshift

This project demonstrates an end-to-end data pipeline using Apache Airflow, where data is extracted from an S3 bucket, processed with Pandas, and then loaded into Amazon Redshift.

Introduction

This repository provides an example of a data pipeline that:

Extracts a CSV file from an S3 bucket.
Loads the CSV content into a Pandas DataFrame.
Inserts the data into a table in Amazon Redshift.

Prerequisites

Python 3.8+
Apache Airflow 2.0+
LocalStack (for local development)

Virtual Environment

It's recommended to use a virtual environment for dependency management.

python3 -m venv .venv
.venv/bin/activate

Airflow DAG

The Airflow DAG is responsible for orchestrating the data pipeline. Here’s a simplified example:

@task
def run_data_pipeline():
    try:
        data = extract_data_from_s3()
        load_data_into_redshift(data)

    except Exception as e:
        raise AirflowException(f"Airflow raised the exception: {e}")


(init >> run_data_pipeline() >> done)

Usage

Activate the Virtual Environment.
Set Environment Variables.
Run Airflow (Start Airflow webserver).
Trigger the DAG on the Airflow UI.

License

This project is licensed under the MIT License.