DataPreProcessing

Data Preprocessing part 2: https://github.com/musama619/Data-Preprocessing

1. Extended Data Dictionary (EDD)

image

2. Outliers

Reason

* Data Entry Error
* Sampling Error
* Measurement Error

Detect Outliers using EDD and visualizations using scatterplot, histogram, boxplot or jointplot image

3. Outliers Treatment

  1. Capping and Flooring Impute values above 3(p99) and below 0.3(P1)
  2. Other Methods

6. Transform Skewed Data

* Log Function 
  helpful_log = np.log(df.Helpful_Votes + 1)

7. Transform and Remove Irrevelant Variables

8. Correlation Analysis and Matrix

image