US Pollution Analysis

The dataset – US Pollution, that was chosen for this project consists of data related to the pollution prevailing across various states and cities in United States of America. The dataset has around 24 columns & 600K unique observations (rows) and consists of data related to 4 major pollutants - SO₂, CO, NO₂ and O₃ over the years 2000 - 2021. The aim of this project is to use big data tools like PySpark to surf through this huge dataset in order to extract valuable insights.

Setup

The first thing to do is to clone the repository:

$ git clone https://github.com/vedantthapa/pyspark-us-pollution.git
$ cd pyspark-us-pollution

Install the dependencies:

# using pip
$ pip install -r requirements.txt

# using Conda
$ conda create --name <env_name> --file requirements.txt

Launch the notebook:

$ jupyter notebook

Run the code in the notebooks.