This repo contains the files for my development work for DPPR by James Densmore.
While working through the material, I realized the significance of being well-versed in AWS. This is especially important when dealing with services like RDS (Relational Database Service) and Redshift. It's crucial to be mindful of potential costs, and I discovered that opting for the free tier can save you from unexpected expenses. Additionally, configuring security groups within a VPC (Virtual Private Cloud) cluster might seem challenging at first, but with perseverance, I learned to navigate and modify inbound rules effectively.
James Densmore provides source code examples throughout the book, and it's important to approach them with a discerning eye. While these examples are immensely valuable, I encountered some syntactic errors, particularly around Chapter 10's 'Measuring and Monitoring Pipeline Performance'. This taught me the importance of closely examining and debugging the provided code. For instance, troubleshooting a CSV ingestion into an S3 bucket helped me identify where an additional set of quotation marks was needed. Additionally, I discovered the nuances between Windows file paths and Linux file paths when working with WSL (Windows Subsystem for Linux).
One of the challenges I faced was setting up Airflow, a platform for orchestrating complex data workflows. Getting my Python virtual environment to interact seamlessly with the Airflow database on a Postgres DB hosted on RDS, was a trial-and-error process. In the future, I plan to explore setting up Airflow within a Docker container for a smoother experience.
The chapter on transforming data provided enlightening lessons on querying tables using complex join and where clauses. These insights are invaluable for anyone working with data transformation, and I found the explanations and data modeling examples provided to be especially helpful.
- Authorizing Amazon S3 Bucket operations using IAM Roles
- How to create an IAM Role for an Amazon Redshift Cluster
- Creating a role for an IAM User
- Solution to an Airflow configuration setting for email notifications
- End to end guide for Amazon Redshift connection with Python
- COPY command syntax for data transfering in a S3 Bucket
- Command line instructions to properly install Airflow to your Python virtual environment