/Big-Data-using-PySpark

In this project, I explored Spark as an alternative to Pandas for Data Cleaning and Exploration.

Primary LanguageJupyter Notebook

Big-Data-using-PySpark

In this project, I explored Spark as an alternative to Pandas for Data Cleaning and Exploration. It is a part of Coursera Project Network. You can check my certification here!

Cleaning and Exploring Big Data using PySpark

  • Task 1 - Install Spark on Google Colab and load datasets in PySpark
  • Task 2 - Change column datatype, remove whitespaces and drop duplicates
  • Task 3 - Remove columns with Null values higher than a threshold
  • Task 4 - Group, aggregate and create pivot tables
  • Task 5 - Rename categories and impute missing numeric values
  • Task 6 - Create visualizations to gather insights