Cleaning and Exploring Big Data using PySpark

  • Install Spark on Google Colab and load datasets in PySpark
  • Change column datatype, remove whitespaces and drop duplicates
  • Remove columns with Null values higher than a threshold
  • Group, aggregate and create pivot tables
  • Rename categories and impute missing numeric values
  • Create visualizations to gather insights