/mcs-203

Data Mining Project, MCS-203

Primary LanguageR

Data Mining Proposal

By : (Abhinav(1893891), Ayush(1893897), Divesh(1893900), Vrinda(1722229, Dept. Of Stats))

Reprort PDF

Report

Dataset

Orissa Pollution Dataset

Dimensions of dataset : (2393, 13) Details of ambient air quality with respect to air quality parameters, like Sulfur dioxide, Nitrogen dioxide, Respirable Suspended Particulate Matter (RSPM) and Suspended Particulate Matter (SPM) etc. are given in the datasets.

Source(main pollution dataset)

Data.Gov

Auxiliary Data Source(For OPD report)

International Federation of Health Information Management Associations

Vision

  • To analyze the correlation between pollutants and area.
  • Performing various statistical techniques for visualizing patterns in the air quality index.
  • Predicting the type of the area depending upon the AQI.

Techniques

  • Exploratory data analysis using visualization tools.
  • Clustering
  • Predictive modeling

Preliminary Analysis

As the data is about the collection of pollution levels around many areas of Orissa and attributes like Agency is of no use for our analysis, it's a constant value attribute after examining the data we also found out the we could also make use the year quarters and season for our analysis, so we wrote the code in R to add those attributes to our dataset the we download from data.gov.

Preprocessing

Date: 4th Feb, 2019

  • Season column added according to the date in Orissa.
  • Type of location ready for classification or regression.

Date: 5th Feb, 2019

  • Decision tree analysis Decision Tree
  • Classes in the pollution dataset : Classes

Tasks

  1. To make a supervised learning model for TypeOfLocation.
  2. Clustering.
  3. Visualization, after calculating AQI.
  4. Exploratory data analysis.

Visualization

Date: 11th Feb, 2019 Instustry wise Pollutants visualization.

Date: 20th Feb, 2019

  • Visualization part
  • NO2, Rural vs Industrail
  • SO2, Rural vs Industrail
  • RSPM.PM10, Rural vs Industrail

Here we can infer that major factor for SO2 and NO2 pollutants is industrail pollution and RSPM.PM10 pollutants concentration is almost same for industrail and rural areas in Orissa.

Date: 25th Feb, 2019

  • Search for new dataset and associate it with data we have.

Date: 27th Feb, 2019

Scatter plots

NO2 SO2 PM2.5 PM10

FIRST ANALYSIS

first analysis

From here what we can deduce is more the pollutant PM2.5 level more is the number of cases of Emphysema.

Date: 1st March, 2019

Yearly Analysis

Yerly

Here we had to normalize the data since the ranges for the pollutant levels differs vastly.

canttel

From here we can't say anything that whether the levels of NO2 are causing increase or decrease in the cases of Nephrology.

Boxplots

NO2

no2 no2

SO2

no2 no2

PM10

pm10 pm10

PM2.5

pm10 pm10

Piechart

corplot corplot corplot corplot corplot corplot corplot corplot corplot corplot corplot corplot corplot corplot gif

Geo Mapping

NO2 Geo Mapping

SO2 Geo Mapping

PM2.5 Geo Mapping

PM10 Geo Mapping

Corrplot

corplot

Prediction of city dependening upon the pollution level

plot

Clustering

We have applied k-means clustering depending upon the values of the pollutants, so that we can get all those region that have simillar pollutants level

For SO2

For NO2

For RSPM10

For PM2.5

Deicision Tree

Decision tree for predicting the type of location from training set

Confusion matrix 0 - Industrial, 1 - rural

Confusion Matrix details

Logistic Regression on Type Of Locatoin