Building Big Data Pipelines with Apache Beam

This is the code repository for Building Big Data Pipelines with Apache Beam, published by Packt.

Use a single programming model for both batch and stream data processing

What is this book about?

This book describes both batch processing and real-time processing pipelines. You’ll learn how to implement basic and advanced big data use cases with ease and develop a deep understanding of the Apache Beam model. In addition to this, you’ll discover how the portability layer works and the building blocks of an Apache Beam runner.

This book covers the following exciting features:

Understand the core concepts and architecture of Apache Beam
Implement stateless and stateful data processing pipelines
Use state and timers for processing real-time event processing
Structure your code for reusability
Use streaming SQL to process real-time data for increasing productivity and data accessibility
Run a pipeline using a portable runner and implement data processing using the Apache Beam Python SDK
Implement Apache Beam I/O connectors using the Splittable DoFn API

If you feel this book is for you, get your copy today!

Instructions and Navigations

All of the code is organized into folders.

The code will look like the following:

ClassLoader loader = FirstPipeline.class.getClassLoader();
String file = loader.getResource("lorem.txt").getFile();
List<String> lines = Files.readAllLines( Paths.get(file), StandardCharsets.UTF_8);

Following is what you need for this book: This book is for data engineers, data scientists, and data analysts who want to learn how Apache Beam works. Intermediate-level knowledge of the Java programming language is assumed.

With the following software and hardware list you can run all code files present in the book (Chapter 1-7).

Software and Hardware List

Chapter	Software required	OS required
1-7	Java 11, Python 3	Windows, Mac OS X, and Linux (Any)
1-7	Bash	Windows, Mac OS X, and Linux (Any)
1-7	Docker	Windows, Mac OS X, and Linux (Any)

We also provide a PDF file that has color images of the screenshots/diagrams used in this book. Click here to download it.

Get to Know the Author

Jan Lukavský is a freelance big data architect and engineer who is also a committer of Apache Beam. He is a certified Apache Hadoop professional. He is working on open source big data systems combining batch and streaming data pipelines in a unified model, enabling the rise of real-time, data-driven applications.

Download a free PDF

If you have already purchased a print or Kindle version of this book, you can get a DRM-free PDF version at no cost.
Simply click on the link to claim your free PDF.

https://packt.link/free-ebook/9781800564930

AdiePrestone/Building-Big-Data-Pipelines-with-Apache-Beam