This repository contains PySpark examples.

Table of Contents


  • Spark Installation On Windows
    This document is a step by step guide to Spark Installation on "Windows".

  • Spark Installation On Ubuntu
    This document is a step by step guide to Spark installation. To setup PySpark refer this document.

  • PySpark DataFrames
    This example explains various methods on Data Worngling on PySpark Dataframes. [Source Code]
    • Load the data[iris.csv]
    • Display Dataframe's Columns
    • Count of of Dataframe's Columns
    • Count of Dataframe's rows
    • Rename Columns
    • Cell Selction
    • Column Selction
    • PythonStyleQuery
    • Dropping Columns
    • Dropping off Rows (on condition)
    • Performing SQL Queries
    • Aggregate Methods
    • Sorting
    • Describe the Data
    • Talking about Missing Values - Replace with Mean
    • Conversion: Spark Dataframes --> Pandas Dataframes
    • Conversion: Pandas Dataframes --> Spark Dataframes
  • PySpark RDD
    This example explains various methods on Data Wrangling on PySpark RDDs. [Source Code]
    • Dataframes and RDD
    • Conversion: RDD --> Dataframe
    • Conversion: Dataframe --> RDD
    • Load the Data[captains_ODI.csv]

  • PySpark Hive on Azure HDInsight
    This document is a step by step guide to show PySpark and Hive integration on Azure HDInsight.