From Simple Transformations to Highly Efficient Jobs

This is a code base that serves as acompanying content for Spark Training lectured and prepared by David Vrba. It is a two days course that covers Apache Spark from three different perspectives:

The core part is the programming interface of DataFrame API (in Spark 3.0)
The internal processes of Spark SQL and execution layer together with various performance tips
APIs of ML Pipelines and GraphFrames for advanced analytics

The course is offered in Python language and Scala version is being prepared. The Python version is taught in Jupyter notebook environment, while Scala version in Apache Zeppelin. See the installation notes for the complete stack used througout the course.

Training Format

2 days
50% theory, 50% hands on
Language: Python

Objectives of the training is to learn:

Basic concepts of Apache Spark and distributed computing
How to use DataFrame API in Spark for ETL jobs or ad hoc data analysis
How the DataFrame API works under the hood
How the optimization engine works in Spark
What is happening under the cover when you send a query for execution
How is Spark application executed
How to understand query plans and use that information to optimize queries
Basic concepts of ML Pipelines library for machine learning
Basic concepts of GraphFrames library for graph processing
How to process data in (nearly) real time in Spark (Structured Streaming)
News in Spark 2.3, 2.4, 3.0

Training Outline

Introduction to Apache Spark
- High level introduction to Spark
- Introduction to Spark architecture
- Spark APIs: high level vs low level vs internal APIs
Structured APIs in Spark
- Basic concepts of DataFrame API
- DataFrame, Row, Column
- Operations in SparkSQL: transformations, actions
- Working with DataFrame: creating a DataFrame and basic transformations
- Working with different data types (Integer, String, Date, Timestamp, Boolean)
- Filtering
- Conditions
- Dealing with null values
- Joins
Lab I
- Simple ETL
Advanced transformations with DataFrames
- Aggregations and Window functions
- User Defined Functions
- Higher Order Functions and complex data types (news in Spark 2.4)
Lab II
- Analyzing data using DataFrame API
Metastore and Tables
- Catalog API
- Tables management
- Saving data
- Caveats to be careful about
Lab III
- Saving data and working with tables
Introduction to internal processes in Spark SQL
- Catalyst - Optimization engine in Spark
- Logical Planning
- Physical Planning
Execution Layer
- Introduction to low level APIs: RDDs
- Structure of Spark job (Stages, Tasks, Shuffle)
- DAG Scheduler
- Lifecycle of Spark application
Lab III
- Spark UI
Introduction to performance tuning in Spark
- Data persistence: caching, checkpointing
- Bucketing & Partitioning
- Most often bottlenecks in Spark applications
- Optimization tips
Introduction to advanced analytics in Spark
- Machine learning: basic concepts of ML Pipelines
- Graph processing: basic concepts of GraphFrames library
Lab V
- Machine learning & Graph processing
Structured Streaming
- Basic concepts of streaming in Spark
- Stateful vs stateless transformations
- Event time processing
- What is watermark and how to use it to close the state
- Real time vs near real time processing

For more information about the training you can contact directly the lecturer via LinkedIn.

Data for the training are downloaded from the Stack Exchange database.

davidvrba/From-Simple-Transformations-to-Highly-Efficient-Jobs

From Simple Transformations to Highly Efficient Jobs

Training Format

Objectives of the training is to learn:

Training Outline