This repository contains PySpark examples.
Table of Contents
-
Spark Installation On Windows
document is a step by step guide to Spark Installation on "Windows".
This
-
Spark Installation On Ubuntu
document is a step by step guide to Spark installation. To setup PySpark refer this document.
This
-
PySpark DataFrames
[Source Code]
This example explains various methods on Data Worngling on PySpark Dataframes.- Load the data[iris.csv]
- Display Dataframe's Columns
- Count of of Dataframe's Columns
- Count of Dataframe's rows
- Rename Columns
- Cell Selction
- Column Selction
- PythonStyleQuery
- Dropping Columns
- Dropping off Rows (on condition)
- Performing SQL Queries
- Aggregate Methods
- Sorting
- Describe the Data
- Talking about Missing Values - Replace with Mean
- Conversion: Spark Dataframes --> Pandas Dataframes
- Conversion: Pandas Dataframes --> Spark Dataframes
-
PySpark RDD
[Source Code]
This example explains various methods on Data Wrangling on PySpark RDDs.- Dataframes and RDD
- Conversion: RDD --> Dataframe
- Conversion: Dataframe --> RDD
- Load the Data[captains_ODI.csv]
-
PySpark Hive on Azure HDInsight
document is a step by step guide to show PySpark and Hive integration on Azure HDInsight.
This