ApacheSpark: A Python repository from pydjangoboy

Introduction

This course include multiple sections. We are mainly focusing on Databricks Data Engineer certification exam. We have following tutorials:

Spark SQL ETL
Pyspark ETL

DATASETS

All the datasets used in the tutorials are available at: https://github.com/martandsingh/datasets

Spark SQL

This course is the first installment of databricks data engineering course. In this course you will learn basic SQL concept which include:

Create, Select, Update, Delete tables
Create database
Filtering data
Group by & aggregation
Ordering
SQL joins
Common table expression (CTE)
External tables
Sub queries
Views & temp views
UNION, INTERSECT, EXCEPT keywords

PySpark ETL

This course will teach you how to perform ETL pipelines using pyspark. ETL stands for Extract, Load & Transformation. We will see how to load data from various sources & process it and finally will load the process data to our destination.

This course includes:

Read files
Schema handling
Handling JSON files
Write files
Basic transformations
partitioning
caching
joins
missing value handling
Data profiling
date time functions
string function
deduplication
grouping & aggregation
User defined functions
Ordering data
Case study - sales order analysis

you can download all the notebook from our

github repo: https://github.com/martandsingh/ApacheSpark

facebook: https://www.facebook.com/codemakerz

email: martandsays@gmail.com

SETUP folder

you will see initial_setup & clean_up notebooks called in every notebooks. It is mandatory to run both the scripts in defined order. initial script will create all the mandatory tables & database for the demo. After you finish your notebook, execute clean up notebook, it will clean all the db objects.

pydjangoboy/ApacheSpark

Introduction

DATASETS

Spark SQL

PySpark ETL

SETUP folder