Building ETL Pipelines with Python

This is the code repository for Building ETL Pipelines with Python, published by Packt.

Create and deploy enterprise-ready ETL pipelines by employing modern methods

What is this book about?

Modern extract, transform, and load (ETL) pipelines for data engineering have favored the Python language for its broad range of uses and a large assortment of tools, applications, and open source components. With its simplicity and extensive library support, Python has emerged as the undisputed choice for data processing.

This book covers the following exciting features:

Explore the available libraries and tools to create ETL pipelines using Python
Write clean and resilient ETL code in Python that can be extended and easily scaled
Understand the best practices and design principles for creating ETL pipelines
Orchestrate the ETL process and scale the ETL pipeline effectively
Discover tools and services available in AWS for ETL pipelines
Understand different testing strategies and implement them with the ETL process

If you feel this book is for you, get your copy today!

Instructions and Navigations

All of the code is organized into folders.

The code will look like the following:

# Merge the three dataframes into a single dataframe
merge_01_df = pd.merge(df, df2, on='CRASH_RECORD_ID')
all_data_df = pd.merge(merge_01_df, df3, on='CRASH_RECORD_ID')

Following is what you need for this book: If you are a data engineer or software professional looking to create enterprise-level ETL pipelines using Python, this book is for you. Fundamental knowledge of Python is a prerequisite.

With the following software and hardware list you can run all code files present in the book (Chapter 1-145).

Software and Hardware List

Chapter	Software required	OS required
1-15	PyCharm 2023.2	any OS
1-15	Python version ≥ 3.6	Any OS
1-15	Jupyter Notebook 7.0.2	any OS

Get to Know the Author

Brij Kishore Pandey stands as a testament to dedication, innovation, and mastery in the vast domains of software engineering, data engineering, machine learning, and architectural design. His illustrious career, spanning over 14 years, has seen him wear multiple hats, transitioning seamlessly between roles and consistently pushing the boundaries of technological advancement. He has a degree in electrical and electronics engineering. His work history includes the likes of JP Morgan Chase, American Express, 3M Company, Alaska Airlines, and Cigna Healthcare. He is currently working as a principal software engineer at Automatic Data Processing Inc. (ADP). Originally from India, he resides in Parsippany, New Jersey, with his wife and daughter.

Emily Ro Schoof is a dedicated Data Specialist with a global perspective, showcasing her expertise as a Data Scientist and Data Engineer on international platforms. Drawing from a background rooted in healthcare and experimental design, she brings a unique perspective of expertise to her data analytic roles. Her multifaceted career ranges from working with UNICEF to design automated forecasting algorithms to identify conflict anomalies using near real-time media monitoring to serving as a Subject Matter Expert for General Assembly’s Data Engineering course content and design. She holds the strong belief that providing easy access to resources, that that merge theory and real-world applications, is the essential first step in this process.

NOTE

Create Production-Ready ETL pipelines with Python and open source Libraries. The book utilizes the Pipenv environment for dependency management and PyCharm as the recommended Integrated Development Environment (IDE).

Installation

To set up the development environment for the Python Coding Book, follow the instructions below:

Install Python: Ensure that Python is installed on your system. You can download the latest version of Python from the official Python website.
Install Pipenv: Pipenv is used for managing dependencies. Install Pipenv by running the following command:

$ pip install pipenv

Fork the Repository: Fork then Clone this repository to your local machine using Git or by downloading the ZIP file from the repository's main page.
Install Dependencies: Some code examples in this chapter may require additional Python packages or libraries. These dependencies are listed in the Pipfile available in this GitHub repository. To install the required packages using Pipenv, navigate to the project directory and run the following commands:

$ pip install pipenv
$ pipenv install --dev

This will create a virtual environment and install all the required packages specified in the Pipfile.

Jupyter Notebooks: Install Jupyter Notebooks (https://jupyter.org/install) to open and interact with the code examples. Jupyter Notebooks provides an interactive and visual environment for running Python code. You can install it using the following command:

$ pip install notebook

To initiate and run a Jupyter Notebook instance, run the following command:

$ jupyter notebook

Set Up PyCharm (optional): If you prefer to use PyCharm as your IDE, follow the PyCharm installation instructions on the official JetBrains website.

Getting Started

To start working with the Python Coding Book, follow the steps below:

Activate the Pipenv shell: Navigate to the repository's root directory in PyCharm and run the following command to activate the Pipenv shell:

$ pipenv shell

Start Coding: Follow along with this book's chapters and corresponding code examples in the repository. Each chapter is organized in its respective directory and contains code files, exercises, and supporting materials.

Chapter Descriptions

The Building ETL Pipelines with Python consists of the following chapters:

Part 1: Introduction to ETL, Data Pipelines, and Design Principles

Index	Title	Brief Description
Chapter 1	A Primer on Python and the Development Environment	A brief overview of Python and setting up the development environment with an IDE and GIT.
Chapter 2	Understanding the ETL Process and Data Pipelines	Overview of the ETL process, its significance, and the difference between ETL and ELT
Chapter 3	Design Principles for Creating Scalable and Resilient Pipelines	How to implement design patterns using open-source Python libraries for robust ETL pipelines.

Part 2: Building ETL Pipelineswith Python

Index	Title	Brief Description
Chapter 4	Sourcing Insightful Data and Data Extraction Strategies	Strategies for obtaining high-quality data from various source systems.
Chapter 5	Data Cleansing and Transformation	Data cleansing, handling missing data, and applying transformation techniques to achieve the desired data format.
Chapter 6	Loading Transformed Data	Overview of best practices for data loading activities in ETL Pipelines and various data loading techniques for RDBMS and NoSQL databases.
Chapter 7	Tutorial: Building an End-to-End ETL Pipeline in Python	Guides the creation of an end-to-end ETL pipeline using different tools and technologies, using PostGreSQL Database as an example.
Chapter 8	Powerful ETL Libraries and Tools in Python	Creating ETL Pipelines using Python libraries: Bonobo, Odo, mETL, and Riko. Introduction to using big data tools: pETL, Luigi, and Apache Airflow.

Part 3: Creating ETL Pipelines in AWS

Index	Title	Brief Description
Chapter 9	A Primer on AWS Tools for ETL Processes	Explains AWS tools for ETL pipelines, including strategies for tool selection, creating a development environment, deployment, testing, and automation.
Chapter 10	Tutorial: Creating ETL Pipelines in AWS	Guides the creation of ETL pipelines in AWS using step functions, Bonobo, EC2, and RDS.
Chapter 11	Building Robust Deployment Pipeline in AWS	Demonstrates using CI/CD tools to create a more resilient ETL pipeline deployment environment using: AWS CodePipeline, CodeDeploy, CodeCommit, and GIT integration.

Part 4: Automating and Scaling ETL Pipelines

Index	Title	Brief Description
Chapter 12	Orchestration and Scaling ETL Pipelines	Covers scaling strategies, creating robust orchestration, and hands-on exercises for scaling and orchestration in ETL pipelines.
Chapter 13	Testing Strategies for ETL Pipelines	Examine the importance of ETL testing and strategies for catching bugs before production, including unit testing and external testing.
Chapter 14	Best Practices for ETL Pipelines.	Highlights industry best practices and common pitfalls to avoid when building ETL pipelines.
Chapter 15	Use Cases and Further Reading	Practical exercises, mini-project outlines, and further reading suggestions are included in this chapter. Includes a case study of creating a robust ETL pipeline for New York Yellow-taxis data and US construction market data in AWS.

Each chapter directory contains code examples, exercises, and any additional resources required for that specific chapter.

Contributing

We encourage our readers to fork and clone this repository to use in tandem with each chapter. If you find any issues or have suggestions for enhancements, please feel free to submit a pull request or open an issue in the repository.

License

Building ETL Pipelines with Python repository is released under Packt Publishing's MIT License.

Let's get started!

aka-leonel/Building-ETL-Pipelines-with-Python