This is a public repo documenting all of the "best practices" of writing PySpark code from what I have learnt from working with PySpark
for 3 years. This will mainly focus on the Spark DataFrames and SQL
library.
If you notice an improvements in terms of typos, spellings, grammar, etc. feel free to create a PR and I'll review it 😁, you'll most likely be right.
If you have any topics that I could potentially go over, please create an issue and describe the topic. I'll try my best to address it 😁.
-
1.1 - Useful Material
-
-
2.1.1 - Struct Types (
StructType
) -
2.1.2 - Arrays and Lists (
ArrayType
) -
2.1.3 - Maps and Dictionaries (
MapType
) -
2.1.4 - Decimals and Why did my Decimals overflow :( (
DecimalType
)
-
-
-
2.2.1 - Looking at Your Data (
collect
/head
/take
/first
/toPandas
/show
) -
2.2.2 - Selecting a Subset of Columns (
drop
/select
) -
2.2.3 - Creating New Columns and Transforming Data (
withColumn
/withColumnRenamed
) -
2.2.4 - Constant Values and Column Expressions (
lit
/col
) -
2.2.5 - Casting Columns to a Different Type (
cast
) -
2.2.6 - Filtering Data (
where
/filter
/isin
) -
2.2.7 - Equality Statements in Spark and Comparisons with Nulls (
isNotNull()
/isNull()
) -
2.2.8 - Case Statements (
when
/otherwise
) -
2.2.9 - Filling in Null Values (
fillna
/coalesce
) -
2.2.10 - Spark Functions aren't Enough, I Need my Own! (
udf
/pandas_udf
) -
2.2.11 - Unionizing Multiple Dataframes (
union
) -
2.2.12 - Performing Joins (clean one) (
join
)
-
-
-
2.3.1 - One to Many Rows (
explode
) -
2.3.2 - Range Join Conditions (WIP) (
join
)
-
-
4.1 - Clean Aggregations
-
-
6.1.1 - Understanding how Spark Works
-
-
-
-
7.1.1 - Filter Pushdown
-