paobranco/UBL

HEOM and HVDM

Closed this issue · 9 comments

Hi,

I've been trying to use the UBL package for data balancing in a medium sized dataset. The data has 96 mixed variables and an output variable that is an ordered factor:

> str(hd2)
'data.frame':	3244 obs. of  97 variables:
 $ FUNC_STAT_TCR          : int  1 2 996 2 2 996 996 996 996 996 ...
 $ REGION                 : Factor w/ 11 levels "1","2","3","4",..: 2 1 4 4 2 5 4 3 11 10 ...
.
.
.
 $ ANTICONV_DON           : Factor w/ 2 levels "N","Y": 1 1 1 1 1 1 1 1 1 1 ...
 $ GENDER                 : Factor w/ 2 levels "F","M": 2 2 1 2 2 2 2 2 2 2 ...
 $ HIST_HYPERTENS_DON     : Factor w/ 2 levels "N","Y": 1 1 1 2 1 2 1 1 2 1 ...
 $ MALIG                  : Factor w/ 2 levels "N","Y": 1 2 1 1 1 1 1 1 1 1 ...
 $ LIFE_SUP               : Factor w/ 2 levels "N","Y": 1 2 2 1 1 1 2 2 2 2 ...
 $ OUTPUT                 : Ord.factor w/ 5 levels "y0y0.25"<"y0.25y1"<..: 5 4 4 5 5 1 1 5 5 1 ...

GaussNoiseClassif works well with this data, but SmoteClassif keeps giving me warnings:

 Warning messages:
1: In if (class(data[, tgt]) == "numeric" & p <= -4) stop("distance measure selected only available for classification tasks") :
  the condition has length > 1 and only the first element will be used
2: In if (class(data[, col]) %in% c("factor", "character")) { :
  the condition has length > 1 and only the first element will be used
3: In if (class(tgtData) != "numeric") { :
  the condition has length > 1 and only the first element will be used
4: In if (class(data[, tgt]) == "numeric" & p <= -4) stop("distance measure selected only available for classification tasks") :
  the condition has length > 1 and only the first element will be used
5: In if (class(data[, col]) %in% c("factor", "character")) { :
  the condition has length > 1 and only the first element will be used
6: In if (class(tgtData) != "numeric") { :
  the condition has length > 1 and only the first element will be used
7: In if (class(data[, tgt]) == "numeric" & p <= -4) stop("distance measure selected only available for classification tasks") :
  the condition has length > 1 and only the first element will be used
8: In if (class(data[, col]) %in% c("factor", "character")) { :
  the condition has length > 1 and only the first element will be used
9: In if (class(tgtData) != "numeric") { :
  the condition has length > 1 and only the first element will be used

I've tried both HEOM and HVDM and both give the same warnings. The warnings themselves arent a big problem. The real problem is when I try to use SmoteClassif with caret sampling. I've created this function:

ublsmote <- list(name = "Custom SMOTE",
                func = function (x, y) {
                  set.seed(1001)
                  library(UBL)
                  dat <- if (is.data.frame(x)) x else as.data.frame(x)
                  dat$.y <- y
                  dat <- SmoteClassif(.y ~ ., dat = dat, C.perc = "balance", dist = "HEOM")
                  list(x = dat[, !grepl(".y", colnames(dat), fixed = TRUE)],
                       y = dat$.y)
                },
                first = TRUE)

A similar function with GaussNoiseClassif works fine, but with smote, it crashes the R session. I'll highly appreciate any guidance on how to get rid of this error.

After experimenting a bit, converting ordered factors to simple factors makes the warnings go away but it still crashes the R session.

Hi,
how many examples does your data has of each class?
Could you be trying to determine 5 neighbors (the default of all functions evaluating neighbors in UBL) when your data has less than 5 examples from which to determine the distance?
This is the first thing that occurs to me, but it is hard to understand without access to an example that reproduces the error...

Thanks for the fast response. Each class has plenty of data:

> table(hd2$OUTPUT)

  y0y0.25 y0.25y1.5    y1.5y4      y4y7  y7y12.75 
      788       723       741       584       408 

I get the same problem for 3, 4 or 5 classes. It cant be a problem with neighbors as when I run the function directly it works, throwing warnings for ordered factors. (ublsmote$func is shown in my post above).

> table(ublsmote$func(hd2[,1:96],hd2[,97])$y)

  y0y0.25 y0.25y1.5    y1.5y4      y4y7  y7y12.75 
      649       649       649       648       649 
There were 12 warnings (use warnings() to see them)
> warnings()
Warning messages:
1: In if (class(dat[, col]) %in% c("factor", "character")) { ... :
  the condition has length > 1 and only the first element will be used
2: In if (class(data[, tgt]) == "numeric" & p <= -4) stop("distance measure selected only available for classification tasks") :
  the condition has length > 1 and only the first element will be used
3: In if (class(data[, col]) %in% c("factor", "character")) { ... :
  the condition has length > 1 and only the first element will be used
4: In if (class(data[, col]) %in% c("factor", "character")) { ... :
  the condition has length > 1 and only the first element will be used
5: In if (class(tgtData) != "numeric") { ... :
  the condition has length > 1 and only the first element will be used
6: In `[<-.factor`(`*tmp*`, ri, value = c(1, 2.6270190551877,  ... :
  invalid factor level, NA generated
7: In if (class(dat[, col]) %in% c("factor", "character")) { ... :
  the condition has length > 1 and only the first element will be used
8: In if (class(data[, tgt]) == "numeric" & p <= -4) stop("distance measure selected only available for classification tasks") :
  the condition has length > 1 and only the first element will be used
9: In if (class(data[, col]) %in% c("factor", "character")) { ... :
  the condition has length > 1 and only the first element will be used
10: In if (class(data[, col]) %in% c("factor", "character")) { ... :
  the condition has length > 1 and only the first element will be used
11: In if (class(tgtData) != "numeric") { ... :
  the condition has length > 1 and only the first element will be used
12: In `[<-.factor`(`*tmp*`, ri, value = c(4, 2, 4, 1, 4,  ... :
  invalid factor level, NA generated

If I change the ordered factors to unordered factors, the warnings dont appear. But when using it with caret to upsample data within a cross validation loop, it crashes. The same exact caret function works fine with the DMwR package or with GaussNoiseClassif . The crash only happens when using the HEOM and HDVM distance metrics.

cl<-makeCluster(detectCores()-1)
registerDoParallel(cl)
rf_grid<-expand.grid(mtry = c((9:13)*2))
rffit5<-caret::train(OUTPUT ~ ., 
                     data=hd2, 
                     method="rf", 
                     ntree = 250, 
                     tuneGrid = rf_grid, 
                     metric = "Kappa",
                     trControl = trainControl(allowParallel = T, 
                                              method = "cv", 
                                              number = 10, 
                                              savePredictions = T,
                                              sampling = ublsmote,
                     )
)
stopCluster(cl)

I have tried changing different parameters for caret but none has any effect on the crash, with or without parallelization.
I updated R and all packages etc recently.

R version 3.4.0 (2017-04-21)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252    LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C                           LC_TIME=English_United States.1252    

attached base packages:
[1] parallel  grid      stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] UBL_0.0.5           doParallel_1.0.10   iterators_1.0.8     foreach_1.4.3       randomForest_4.6-12 DMwR_0.4.1         
[7] caret_6.0-76        ggplot2_2.2.1       lattice_0.20-35    

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.10       compiler_3.4.0     nloptr_1.0.4       plyr_1.8.4         bitops_1.0-6       class_7.3-14      
 [7] tools_3.4.0        xts_0.9-7          rpart_4.1-10       lme4_1.1-13        tibble_1.3.0       nlme_3.1-131      
[13] gtable_0.2.0       mgcv_1.8-17        Matrix_1.2-9       SparseM_1.76       e1071_1.6-8        stringr_1.2.0     
[19] caTools_1.17.1     gtools_3.5.0       MatrixModels_0.4-1 stats4_3.4.0       nnet_7.3-12        gdata_2.17.0      
[25] minqa_1.2.4        ROCR_1.0-7         TTR_0.23-1         reshape2_1.4.2     kernlab_0.9-25     car_2.1-4         
[31] magrittr_1.5       gplots_3.0.1       scales_0.4.1       codetools_0.2-15   ModelMetrics_1.1.0 RSNNS_0.4-9       
[37] MASS_7.3-45        splines_3.4.0      quantmod_0.4-8     abind_1.4-5        pbkrtest_0.4-7     colorspace_1.3-2  
[43] quantreg_5.33      KernSmooth_2.23-15 stringi_1.1.5      lazyeval_0.2.0     munsell_0.4.3      zoo_1.8-0         

Does your data set has any features which are ordered factors?
If so, have you tried to convert them to simple factors?

Yes, when I change them into simple factors, the warnings go away but using any function in caret sampling with dist set as HEOM or HVDM causes the session to crash.

Can you please send me a script with a minimal reproducible example of your error. (please attach also your data or a sample)
I can not reproduce your error and I would really like to understand what's happening ...

When inspecting the functions you built, I noticed that all the factor variables are modified in caret. This means that, when using caret package, your data set has his nominal variables changed to dummy variables representing category membership.

This explains why UBL gives an error when using dist="Euclidean" in your data set (it is not possible to compute Euclidean distance when the data set has nominal variables), but it works fine inside caret.
In fact, when you are setting the sampling parameter in the trainControl of caret the data set is already modified, and it has no longer any nominal feature...

Still, this should not crash your session...
Please send me a minimal reproducing example on this issue.

I am not using dist="Euclidean". I have only used dist="HEOM" and dist="HVDM". I don't think the data is converted to dummy variables although I'm not sure. I am relatively new to R. I will try using dist="Euclidean", which should solve the problem if the nominal variable have been converted to dummies in caret train. The problem is I cannot share my data or a sample of it as it is confidential data. I can share the code, most of which I've already added above. I will try to see if I get the same error with some other dataset. Thanks.

I've updated the UBL version to 0.0.6. This new version has some corrections regarding the distance functions which may potentially solve your problem. Can you please check if the problem is solved?

Hi,

Sorry for the extremely delayed response. I used the package with some other data and it worked fine. I am not sure if this was due to the update but I am sure the new data did have nominal variables so whatever the problem was, whether it was some quirk of the data or some function in the package or caret etc, it must have gone away. I have not gotten the chance to use it on the original data with which I was getting the problem.

Thanks,
Murtaza