Suggestion
Closed this issue · 2 comments
dotRData commented
It would also be great to have outlier removal/imputation based on the columns
- 6 σ
trainData[, `:=`(mean_dv = mean(dv), sd_dv = sd(dv))]
trainData <- trainData[dv >= (mean_dv - (6*sd_dv)) & (dv <= mean_dv + (6*sd_dv))]
trainData[, c('mean_dv', 'sd_dv'):=NULL]
- percentile
removeOneOutliersFunc <- function(trainData, colName, outlierVec = c(0.0001,0.9999)){
vec <- trainData[[colName]]
values <- as.numeric(quantile(vec, outlierVec, na.rm = TRUE))
trainData <- trainData[vec >= values[1] & vec <= values[2]]
return(trainData)
}
ELToulemonde commented
Great idea.
I was thinking on building a bunch of statistical functions to complete this package.
Those would absolutly go into this category.
I guesss I will create a project for this part.
Any ideas of statistical functions to filter/preprocess data are welcomed.
ELToulemonde commented
Hi,
Functions remove_sd_outlier
, remove_percentile_outlier
, remove_rare_categorical
have been implemented and will be available in next cran release.
Feel free to test them, check if they are helpfull and suggest any form of improvements.
I close,
Thanks.
Emmanuel-Lin