A curated directory of awesome things related to Apache Beam. Inspired by Awesome Flink and Awesome Hadoop.
- Apache Beam in Kotlin to reduce boilerplate. Using Kotlin's special features to make Beam Java SDK less verbose!
- Scio - Scala wrapper for Apache Beam Wrap Beam functionality in a simple Scala API.
- thruber - Clojure wrapper for Apache Beam Bring Clojure's powerful, expressive toolkit (destructuring, immutability, REPL, async tools, etc etc) to Apache Beam.
- (Pending)
- Apache Zeppelin - Web-based notebook that enables interactive data analytics with plugable backends, plotting, etc.
- Tensorflow Transform is a library for preprocessing data with TensorFlow. It uses Beam, and thus it brings the portability aspect of Beam (i.e. run in any supported runner).
- (Pending)
Various resources, such as books, websites and articles.
- Error Handling Elements in Apache Beam Pipelines. A blog post detailing how to handle when individual elements have errors in their processing downstream.
- Beam Documentation
- Java SDK
- Python SDK
- Go SDK
- Beam Wiki
- Beam Quickstarts Java, Python, Go.
- Apache Beam Katas are interactive Beam coding exercises.
- Apache Beam | A Hands-On course to build Big data Pipelines.
- The Dataflow Model: A Practical Approach to Balancing Correctness, Latency, and Cost in Massive-Scale, Unbounded, Out-of-Order Data Processing - Paper introducing the Dataflow model, which was the predecesor to Beam. (2015)
- Official Beam Blog
- Python Development Environments for Beam on GCP - How to set up a development environment for Python Dataflow jobs.
- Java Development Environments for Beam on GCP - How to set up a development environment for Java Dataflow / Beam jobs.
- Coding Apache Beam in your Web Browser and Running it in Cloud Dataflow - How to create and run a Beam Pipeline on Dataflow using Code Editor.
- Realtime Data Processing with Apache Beam at Dailymotion
- So you want to write a Beam SDK? Talk by Robert Bradshaw about the pieces of an SDK and the runner API [slides]
- Robust, performant and modular APIs for data ingestion with Apache Beam - Eugene Kirpichov, Ismael Mejia [slides] - Important talk about IO, and what we think is the future of IO for Big Data systems.
- SplittableDoFn - A Transform Developer's perspective. Alex Van Boxel. [slides].
- Large Scale Landuse Classification of Satellite Imagery - Suneel Marthi [slides] [code] - Excellent talk using Beam's Python SDK to run machine learning over a dataset of images.
- Beam me up, Samza! - The Beam runner for Samza - Xinyu Liu [slides].
- Python Streaming Pipelines with Beam on Flink - Aljoscha Krettek, Thomas Weise [slides]. - A talk about how Beam enables Python pipelines to run on top of Flink.
- Spark Runner (R)evolution - David Moravek, Ismaël Mejía [slides] - A talk about Spark runner implementation, performance improvements and roadmap.