Spark-Syntax

This is a public repo documenting all of the "best practices" of writing PySpark code from what I have learnt from working with PySpark for 3 years. This will mainly focus on the Spark DataFrames and SQL library.

Contributing/Topic Requests

If you notice an improvements in terms of typos, spellings, grammar, etc. feel free to create a PR and I'll review it 😁, you'll most likely be right.

If you have any topics that I could potentially go over, please create an issue and describe the topic. I'll try my best to address it 😁.

Table of Contexts:

Chapter 1 - Getting Started with Spark:

Chapter 2 - Exploring the Spark APIs:

2.1 - Non-Trivial Data Structures in Spark
- 2.1.1 - Struct Types (StructType)
- 2.1.2 - Arrays and Lists (ArrayType)
- 2.1.3 - Maps and Dictionaries (MapType)
- 2.1.4 - Decimals and Why did my Decimals overflow :( (DecimalType)
2.2 - Performing your First Transformations
- 2.2.1 - Looking at Your Data (collect/head/take/first/toPandas/show)
- 2.2.2 - Selecting a Subset of Columns (drop/select)
- 2.2.3 - Creating New Columns and Transforming Data (withColumn/withColumnRenamed)
- 2.2.4 - Constant Values and Column Expressions (lit/col)
- 2.2.5 - Casting Columns to a Different Type (cast)
- 2.2.6 - Filtering Data (where/filter/isin)
- 2.2.7 - Equality Statements in Spark and Comparisons with Nulls (isNotNull()/isNull())
- 2.2.8 - Case Statements (when/otherwise)
- 2.2.9 - Filling in Null Values (fillna/coalesce)
- 2.2.10 - Spark Functions aren't Enough, I Need my Own! (udf/pandas_udf)
- 2.2.11 - Unionizing Multiple Dataframes (union)
- 2.2.12 - Performing Joins (clean one) (join)
2.3 More Complex Transformations
- 2.3.1 - One to Many Rows (explode)
- 2.3.2 - Range Join Conditions (WIP) (join)

Chapter 3 - Aggregates:

4.1 - Clean Aggregations
4.2 - Non Deterministic Behaviours

Chapter 4 - Window Objects:

Chapter 5 - Error Logs:

Chapter 6 - Understanding Spark Performance:

6.1 - Primer to Understanding Your Spark Application
- 6.1.1 - Understanding how Spark Works
- 6.1.2 - Understanding the SparkUI
- 6.1.3 - Understanding how the DAG is Created
- 6.1.4 - Understanding how Memory is Allocated
6.2 - Analyzing your Spark Application
- 6.1 - Looking for Skew in a Stage
- 6.2 - Looking for Skew in the DAG
- 6.3 - How to Determine the Number of Partitions to Use
6.3 - How to Analyze the Skew of Your Data

Chapter 7 - High Performance Code:

7.1 - Improving Joins
- 7.1.1 - Filter Pushdown
- 7.1.2 - Joining on Skewed Data (Null Keys)
- 7.1.3 - Joining on Skewed Data (High Frequency Keys I)
- 7.1.4 - Joining on Skewed Data (High Frequency Keys II)

m4ld1tO/spark-syntax

Spark-Syntax

Contributing/Topic Requests

Table of Contexts:

Chapter 1 - Getting Started with Spark:

1.1 - Useful Material

1.2 - Creating your First DataFrame

1.3 - Reading your First Dataset

1.4 - More Comfortable with SQL?

Chapter 2 - Exploring the Spark APIs:

2.1 - Non-Trivial Data Structures in Spark

2.1.1 - Struct Types (StructType)

2.1.2 - Arrays and Lists (ArrayType)

2.1.3 - Maps and Dictionaries (MapType)

2.1.4 - Decimals and Why did my Decimals overflow :( (DecimalType)

2.2 - Performing your First Transformations

2.2.1 - Looking at Your Data (collect/head/take/first/toPandas/show)

2.2.2 - Selecting a Subset of Columns (drop/select)

2.2.3 - Creating New Columns and Transforming Data (withColumn/withColumnRenamed)

2.2.4 - Constant Values and Column Expressions (lit/col)

2.2.5 - Casting Columns to a Different Type (cast)

2.2.6 - Filtering Data (where/filter/isin)

2.2.7 - Equality Statements in Spark and Comparisons with Nulls (isNotNull()/isNull())

2.2.8 - Case Statements (when/otherwise)

2.2.9 - Filling in Null Values (fillna/coalesce)

2.2.10 - Spark Functions aren't Enough, I Need my Own! (udf/pandas_udf)

2.2.11 - Unionizing Multiple Dataframes (union)

2.2.12 - Performing Joins (clean one) (join)

2.3 More Complex Transformations

2.3.1 - One to Many Rows (explode)

2.3.2 - Range Join Conditions (WIP) (join)

Chapter 3 - Aggregates:

4.1 - Clean Aggregations

4.2 - Non Deterministic Behaviours

Chapter 4 - Window Objects:

5.1 - Default Ordering on a Window Object

5.2 - Ordering High Frequency Data with a Window Object

Chapter 5 - Error Logs:

Chapter 6 - Understanding Spark Performance:

6.1 - Primer to Understanding Your Spark Application

6.1.1 - Understanding how Spark Works

6.1.2 - Understanding the SparkUI

6.1.3 - Understanding how the DAG is Created

6.1.4 - Understanding how Memory is Allocated

6.2 - Analyzing your Spark Application

6.1 - Looking for Skew in a Stage

6.2 - Looking for Skew in the DAG

6.3 - How to Determine the Number of Partitions to Use

6.3 - How to Analyze the Skew of Your Data

Chapter 7 - High Performance Code:

7.1 - Improving Joins

7.1.1 - Filter Pushdown

7.1.2 - Joining on Skewed Data (Null Keys)

7.1.3 - Joining on Skewed Data (High Frequency Keys I)

7.1.4 - Joining on Skewed Data (High Frequency Keys II)

2.1.1 - Struct Types (`StructType`)

2.1.2 - Arrays and Lists (`ArrayType`)

2.1.3 - Maps and Dictionaries (`MapType`)

2.1.4 - Decimals and Why did my Decimals overflow :( (`DecimalType`)

2.2.1 - Looking at Your Data (`collect`/`head`/`take`/`first`/`toPandas`/`show`)

2.2.2 - Selecting a Subset of Columns (`drop`/`select`)

2.2.3 - Creating New Columns and Transforming Data (`withColumn`/`withColumnRenamed`)

2.2.4 - Constant Values and Column Expressions (`lit`/`col`)

2.2.5 - Casting Columns to a Different Type (`cast`)

2.2.6 - Filtering Data (`where`/`filter`/`isin`)

2.2.7 - Equality Statements in Spark and Comparisons with Nulls (`isNotNull()`/`isNull()`)

2.2.8 - Case Statements (`when`/`otherwise`)

2.2.9 - Filling in Null Values (`fillna`/`coalesce`)

2.2.10 - Spark Functions aren't Enough, I Need my Own! (`udf`/`pandas_udf`)

2.2.11 - Unionizing Multiple Dataframes (`union`)

2.2.12 - Performing Joins (clean one) (`join`)

2.3.1 - One to Many Rows (`explode`)

2.3.2 - Range Join Conditions (WIP) (`join`)