/KaggleBNP

This project is for three students with love for data science

Primary LanguageRApache License 2.0Apache-2.0

#Project process#

##Take R as the tool for this project###

  1. Load the dataset. Package "readr"

  2. can also use fread

  3. read.csv is also fine, but would be slow, only for small dataset

  4. convert the categorical data to numeric value, this is mainly for visualizing NAs. (optional for Xgboost). The code in R is below:

    for (f in names(train)) {
      if (class(train[[f]])=="character") { 
        levels <- unique(c(train[[f]], test[[f]]))
        train[[f]] <- factor(train[[f]], levels=levels)
        test[[f]]  <- factor(test[[f]],  levels=levels)
      }
    }
    
  5. Visualizing NAs. Package "VIM", explore the structure of missing value. (Missing Not at Random)

  1. Do some exploratory data analysis

  2. Analysis of duplicate variables. However, this kind of manual method is not realistic if we have thousands of variables.

  3. Imputation and Feature engineering

  4. Find and remove redundant variables. Use correlation filter method to find some highly correlated variables. Correlation method should be used for numeric values.

```R

library(corrplot)
library(caret)
temp <- train.num[,-1:-2]
corr.Matrix <- cor(, use="pairwise.complete.obs")  # mainly for NA values
corr.75 <- findCorrelation(corr.Matrix, cutoff = 0.75)
train.num.75 <- temp[, corr.75]  # try different threshold 0.85 and 0.9
corrplot(corr.Matrix, order = "hclust")

```
  1. Try various imputation methods * Imputation default value -1. This could be the baseline method. * Try to use KNNImpute * Imputation for categorical variable, how to do this in R * Optional for Amelia and Multiple Imputation. Do some research on Multiple imputation course

  2. Use entropy based method to choose some related variables to target variable. This would take a long time because a heap memory limited in R. * information.gain * gain ratio * symmetrical.uncertainty

  3. Read this paper to get deep understaning of feature selection. An Introduction to Variable and Feature Selection

  4. Until now, there are several result of train dataset after data cleaning, imputation and feature selection.

  5. The baseline preprocessing method is using all the variables and imputing -1 to NAs.

Kaggle Forum##

  1. Using missing value count per observation as a predictor