Big-Data-using-PySpark

In this project, I explored Spark as an alternative to Pandas for Data Cleaning and Exploration. It is a part of Coursera Project Network. You can check my certification here!

Cleaning and Exploring Big Data using PySpark

Task 1 - Install Spark on Google Colab and load datasets in PySpark
Task 2 - Change column datatype, remove whitespaces and drop duplicates
Task 3 - Remove columns with Null values higher than a threshold
Task 4 - Group, aggregate and create pivot tables
Task 5 - Rename categories and impute missing numeric values
Task 6 - Create visualizations to gather insights

ShambhaviCodes/Big-Data-using-PySpark

Big-Data-using-PySpark

Cleaning and Exploring Big Data using PySpark