ELToulemonde/dataPreparation

Suggestion

Closed this issue · 2 comments

It would also be great to have outlier removal/imputation based on the columns

  • 6 σ
trainData[, `:=`(mean_dv = mean(dv), sd_dv = sd(dv))]
trainData <- trainData[dv >= (mean_dv - (6*sd_dv)) & (dv <= mean_dv + (6*sd_dv))]
trainData[, c('mean_dv', 'sd_dv'):=NULL]
  • percentile
removeOneOutliersFunc <- function(trainData, colName, outlierVec = c(0.0001,0.9999)){
  vec       <- trainData[[colName]]
  values    <- as.numeric(quantile(vec, outlierVec, na.rm = TRUE))
  trainData <- trainData[vec >= values[1] & vec <= values[2]]
  return(trainData)
}

Great idea.

I was thinking on building a bunch of statistical functions to complete this package.
Those would absolutly go into this category.

I guesss I will create a project for this part.

Any ideas of statistical functions to filter/preprocess data are welcomed.

Hi,

Functions remove_sd_outlier, remove_percentile_outlier, remove_rare_categorical have been implemented and will be available in next cran release.

Feel free to test them, check if they are helpfull and suggest any form of improvements.

I close,

Thanks.

Emmanuel-Lin