KaggleBNP: An R repository from fujunswufe

#Project process#

##Take R as the tool for this project###

Load the dataset. Package "readr"
can also use fread
read.csv is also fine, but would be slow, only for small dataset

convert the categorical data to numeric value, this is mainly for visualizing NAs. (optional for Xgboost). The code in R is below:

for (f in names(train)) {
  if (class(train[[f]])=="character") { 
    levels <- unique(c(train[[f]], test[[f]]))
    train[[f]] <- factor(train[[f]], levels=levels)
    test[[f]]  <- factor(test[[f]],  levels=levels)
  }
}

Visualizing NAs. Package "VIM", explore the structure of missing value. (Missing Not at Random)

Below is the source code on Kaggle BNP script

Do some exploratory data analysis
Analysis of duplicate variables. However, this kind of manual method is not realistic if we have thousands of variables.
Imputation and Feature engineering
Find and remove redundant variables. Use correlation filter method to find some highly correlated variables. Correlation method should be used for numeric values.

```R

library(corrplot)
library(caret)
temp <- train.num[,-1:-2]
corr.Matrix <- cor(, use="pairwise.complete.obs")  # mainly for NA values
corr.75 <- findCorrelation(corr.Matrix, cutoff = 0.75)
train.num.75 <- temp[, corr.75]  # try different threshold 0.85 and 0.9
corrplot(corr.Matrix, order = "hclust")

```

Try various imputation methods * Imputation default value -1. This could be the baseline method. * Try to use KNNImpute * Imputation for categorical variable, how to do this in R * Optional for Amelia and Multiple Imputation. Do some research on Multiple imputation course
Use entropy based method to choose some related variables to target variable. This would take a long time because a heap memory limited in R. * information.gain * gain ratio * symmetrical.uncertainty
Read this paper to get deep understaning of feature selection. An Introduction to Variable and Feature Selection
Until now, there are several result of train dataset after data cleaning, imputation and feature selection.
The baseline preprocessing method is using all the variables and imputing -1 to NAs.

Kaggle Forum##

Using missing value count per observation as a predictor

fujunswufe/KaggleBNP

Kaggle Forum##