In this project, I explored Spark as an alternative to Pandas for Data Cleaning and Exploration. It is a part of Coursera Project Network. You can check my certification here!
- Task 1 - Install Spark on Google Colab and load datasets in PySpark
- Task 2 - Change column datatype, remove whitespaces and drop duplicates
- Task 3 - Remove columns with Null values higher than a threshold
- Task 4 - Group, aggregate and create pivot tables
- Task 5 - Rename categories and impute missing numeric values
- Task 6 - Create visualizations to gather insights