/Outlier-Detection-PYSPARK

This repository shows, how to identify and remove the outliers using Pyspark

Primary LanguageJupyter Notebook

Outlier-Detection-PYSPARK

This repository shows, how to identify and remove the outliers using Pyspark

Attribute Information:

  1. FRESH: annual spending (m.u.) on fresh products (Continuous);
  2. MILK: annual spending (m.u.) on milk products (Continuous);
  3. GROCERY: annual spending (m.u.)on grocery products (Continuous);
  4. FROZEN: annual spending (m.u.)on frozen products (Continuous)
  5. DETERGENTS_PAPER: annual spending (m.u.) on detergents and paper products (Continuous)
  6. DELICATESSEN: annual spending (m.u.)on and delicatessen products (Continuous);
  7. CHANNEL: customers’ Channel - Horeca (Hotel/Restaurant/Café) or Retail channel (Nominal)
  8. REGION: customers’ Region – Lisnon, Oporto or Other (Nominal)

REGION Frequency

  1. Lisbon 77
  2. Oporto 47
  3. Other Region 316

Total 440

CHANNEL Frequency

  1. Horeca 298
  2. Retail 142

Total 440

Observations :

  1. The initial dataset contains 440 records with 8 features
  2. There are total 6 numerical features & 2 categorical features
  3. The original spark dataframe has the datatypes of strings
  4. The numerical features were converted into IntegerType using cast function
  5. Created a customized function to identify outliers in each record
  6. Applyng the above customized function, enables us to identify total outliers in each record, based on each feature
  7. Filtering the dataset based on the total outliers which are <=1, to eliminate the records with more than 2 outliers
  8. The new dataframe, contains 399 records after removing the outliers against 440 records in the inital data frame
  9. Comparing the outliers from the original dataset to the new dataset after outlier removal using a box plot
  10. There are still some outliers available in the dataset., even after removing majority of the outliers.
  11. This shows that the data is skewed, however, the majority of the outliers are removed

image