SmoteRegress challenges

Question

SmoteRegress challenges

Closed this issue 6 years ago · 7 comments

SmoteRegress may have some difficulty succeeding on some numeric features.

xy <- data.frame(x = 1:100, y = 1:100)

SmoteRegress(y ~ ., xy)

Error in SmoteRegress(y ~ ., xy) : All the points have relevance 0.
         Please, redefine your relevance function!

SmoteRegress may have some difficulty succeeding on some numeric nominal and numeric features.

data("diamonds", package = "ggplot2")

SmoteRegress(z ~ ., diamonds)

Error in neighbours(tgt, dat, dist, p, k) :
  NA/NaN/Inf in foreign function call (arg 2)
In addition: Warning messages:
1: In if (class(dat[, col]) %in% c("factor", "character")) { :
  the condition has length > 1 and only the first element will be used
2: In if (class(dat[, col]) %in% c("factor", "character")) { :
  the condition has length > 1 and only the first element will be used
3: In if (class(dat[, col]) %in% c("factor", "character")) { :
  the condition has length > 1 and only the first element will be used
4: In if (class(dat[, col]) %in% c("factor", "character")) { :
  the condition has length > 1 and only the first element will be used
5: In if (class(dat[, col]) %in% c("factor", "character")) { :
  the condition has length > 1 and only the first element will be used
6: In if (class(dat[, col]) %in% c("factor", "character")) { :
  the condition has length > 1 and only the first element will be used
7: In storage.mode(numData) <- "double" : NAs introduced by coercion

May have some difficulty succeeding on some numeric nominal features.

SmoteRegress(cut ~ ., diamonds[,c("cut","color","clarity")])

Error in if (extr$stats[3] != r[1]) { :
  missing value where TRUE/FALSE needed
In addition: Warning message:
In Ops.ordered(x[floor(d)], x[ceiling(d)]) :
  '+' is not meaningful for ordered factors

Another challenge.

diamonds2 <- diamonds
diamonds2$cut <- factor(diamonds2$cut, ordered = FALSE)
diamonds2$color <- factor(diamonds2$color, ordered = FALSE)
diamonds2$clarity <- factor(diamonds2$clarity, ordered = FALSE)

SmoteRegress(cut ~ ., diamonds2[,c("cut","color","clarity")])

Error in Summary.factor(c(5L, 4L, 2L, 4L, 2L, 3L, 3L, 3L, 1L, 3L, 2L,  :
  'range' not meaningful for factors
In addition: Warning message:
In Ops.factor(x[floor(d)], x[ceiling(d)]) : '+' not meaningful for factors

 SmoteRegress(z ~ ., diamonds2[,c("cut","color","clarity","z")])
Error in neighbours(tgt, dat, dist, p, k) :
  Can not compute Euclidean distance with nominal attributes!

SmoteRegress(z ~ ., diamonds2[,c("cut","color","clarity","y","z")], rel = "HEOM")
Error in SmoteRegress(z ~ ., diamonds2[, c("cut", "color", "clarity",  :
  future work!

> SmoteRegress(cut ~ ., diamonds2[,c("cut","y","z")], rel = "HEOM")
Error in SmoteRegress(cut ~ ., diamonds2[, c("cut", "y", "z")], rel = "HEOM") :
  future work!

> devtools::session_info()
Session info ------------------------------------------------------------------
 setting  value
 version  R version 3.4.4 (2018-03-15)
 system   x86_64, mingw32
 ui       RTerm
 language (EN)
 collate  English_United States.1252
 tz       America/Chicago
 date     2018-03-17

Packages ----------------------------------------------------------------------
 package      * version date       source
 automap      * 1.0-14  2013-08-29 CRAN (R 3.4.0)
 base         * 3.4.4   2018-03-15 local
 compiler       3.4.4   2018-03-15 local
 datasets     * 3.4.4   2018-03-15 local
 devtools       1.13.5  2018-02-18 CRAN (R 3.4.3)
 digest         0.6.15  2018-01-28 CRAN (R 3.4.3)
 FNN            1.1     2013-07-31 CRAN (R 3.4.0)
 graphics     * 3.4.4   2018-03-15 local
 grDevices    * 3.4.4   2018-03-15 local
 grid           3.4.4   2018-03-15 local
 gstat        * 1.1-5   2017-03-12 CRAN (R 3.4.0)
 intervals      0.15.1  2015-08-27 CRAN (R 3.4.0)
 lattice        0.20-35 2017-03-25 CRAN (R 3.4.4)
 magrittr     * 1.5     2014-11-22 CRAN (R 3.4.0)
 MBA          * 0.0-9   2017-03-08 CRAN (R 3.4.1)
 memoise        1.1.0   2017-04-21 CRAN (R 3.4.0)
 methods      * 3.4.4   2018-03-15 local
 plyr           1.8.4   2016-06-08 CRAN (R 3.4.0)
 randomForest * 4.6-12  2015-10-07 CRAN (R 3.4.0)
 Rcpp           0.12.16 2018-03-13 CRAN (R 3.4.3)
 reshape        0.8.7   2017-08-06 CRAN (R 3.4.1)
 sp           * 1.2-7   2018-01-19 CRAN (R 3.4.3)
 spacetime      1.2-1   2017-09-24 CRAN (R 3.4.1)
 stats        * 3.4.4   2018-03-15 local
 tools          3.4.4   2018-03-15 local
 UBL          * 0.0.6   2017-07-13 CRAN (R 3.4.1)
 utils        * 3.4.4   2018-03-15 local
 withr          2.1.2   2018-03-15 CRAN (R 3.4.3)
 xts            0.10-2  2018-03-14 CRAN (R 3.4.3)
 zoo            1.8-1   2018-01-08 CRAN (R 3.4.3)
>

Answer 1 · 2018-03-19T15:36:28.000Z

SmoteRegress does work on other data sets beyond the example data when used correctly.
Below I explain, point by point, the reasons why each example you provided fails:

Regarding the first comment:

SmoteRegress does not work on numeric features.

xy <- data.frame(x = 1:100, y = 1:100)
SmoteRegress(y ~ ., xy)
Error in SmoteRegress(y ~ ., xy) : All the points have relevance 0.
         Please, redefine your relevance function!

SmoteRegress needs to have a relevance function defined for the target variable domain.
This function can not be uniform as it sets which are the most/least relevant cases of the domain.
When the rel parameter is not set by the user, an automatic method is used that assigns a higher relevance to the rarest cases.

In your example, the automatic method is used. However, this method not able to assign a non-uniform relevance to the domain because the points are uniformly distributed!
Therefore, it provides an error message stating that it is not possible to use SmoteRegress
with a relevance function that assigns 0 to all existing cases.

Two ways for solving this:
i) You may generate the target variable, for instance by sampling from a normal distribution. Then you can apply SmoteRegress without problems!

xy <- data.frame(x = 1:100, y = rnorm(100, 0, 1))
res <- SmoteRegress(y ~ ., xy)

ii) Or, alternatively, you may provide the relevance function your self, as follows:

xy <- data.frame(x = 1:100, y = 1:100)
myrel <- matrix(c(1,0,0, 90,1,0), ncol=3, byrow=TRUE)

This is just an example of a relevance function definition that assigns higher relevance to the cases with y>=90.
Then you can simply do:

res <- SmoteRegress(y ~ ., xy, rel=myrel)

Regarding the second comment:

SmoteRegress does not work on nominal and numeric features.

data("diamonds", package = "ggplot2") 
SmoteRegress(z ~ ., diamonds)
Error in neighbours(tgt, dat, dist, p, k) :
  NA/NaN/Inf in foreign function call (arg 2)
In addition: Warning messages:
1: In if (class(dat[, col]) %in% c("factor", "character")) { :
  the condition has length > 1 and only the first element will be used
2: In if (class(dat[, col]) %in% c("factor", "character")) { :
  the condition has length > 1 and only the first element will be used
3: In if (class(dat[, col]) %in% c("factor", "character")) { :
  the condition has length > 1 and only the first element will be used
4: In if (class(dat[, col]) %in% c("factor", "character")) { :
  the condition has length > 1 and only the first element will be used
5: In if (class(dat[, col]) %in% c("factor", "character")) { :
  the condition has length > 1 and only the first element will be used
6: In if (class(dat[, col]) %in% c("factor", "character")) { :
  the condition has length > 1 and only the first element will be used
7: In storage.mode(numData) <- "double" : NAs introduced by coercion

This error is related with two issues:
i) the existence of ordered factors in the data set;
ii) the distance measure being used.

The functions provided through UBL package are not able to deal with ordered factors.
This means that these features must be converted into factors.

The second issue, regards the existence of both numeric and nominal features.
The default distance measure is the Euclidean distance, which can not be used in this setting.

Therefore, to adequately use SmoteRegress you need to:
i) convert the necessary features into factors;

diamonds$cut <- factor(diamonds$cut, ordered=FALSE)
diamonds$color <- factor(diamonds$color, ordered=FALSE)
diamonds$clarity <- factor(diamonds$clarity, ordered=FALSE)

ii) use a suitable distance measure, such as "HEOM".
The parameter through which the distance measure is passed is the dist parameter, as described in the package documentation.

After converting the ordered factors into factors, and by selecting a suitable distance measure, you may use SmoteRegress without any error.

res <- SmoteRegress(z ~ ., diamonds, dist="HEOM")

Regarding the third comment:

Does not work on only nominal features.

SmoteRegress(cut ~ ., diamonds[,c("cut","color","clarity")])

Error in if (extr$stats[3] != r[1]) { :
  missing value where TRUE/FALSE needed
In addition: Warning message:
In Ops.ordered(x[floor(d)], x[ceiling(d)]) :
  '+' is not meaningful for ordered factors

SmoteRegress function is suitable for regression problems only (the target variable must be numeric!)
In this case you are using a nominal target variable (cut is nominal)!
For this purpose you could use the SmoteClassif function also available on UBL package which is suitable for classification tasks.

If you try to use SmoteRegress with only nominal features and a numeric target variable, you could do as follows, without any errors (after converting ordered factors into factors):

SmoteRegress(y ~ ., diamonds[,c("cut","color","clarity", "y")], dist="HEOM")

Please notice that you need to select a suitable distance measure (HEOM will do) because the default Euclidean will obviously fail. Moreover, the distance measure to be used by SmoteRegress is provided through parameter dist and not rel.

Regarding the fourth comment:

Does not work.

diamonds2 <- diamonds
diamonds2$cut <- factor(diamonds2$cut, ordered = FALSE)
diamonds2$color <- factor(diamonds2$color, ordered = FALSE)
diamonds2$clarity <- factor(diamonds2$clarity, ordered = FALSE)
SmoteRegress(cut ~ ., diamonds2[,c("cut","color","clarity")])

Error in Summary.factor(c(5L, 4L, 2L, 4L, 2L, 3L, 3L, 3L, 1L, 3L, 2L,  :
  'range' not meaningful for factors
In addition: Warning message:
In Ops.factor(x[floor(d)], x[ceiling(d)]) : '+' not meaningful for factors

In this case, although you converted the ordered factors into factors, you are using an unsuitable distance function (you are using the default that is the Euclidean distance), and your target variable is nominal (SmoteRegress in only suitable for regression problems).

Regarding the fifth comment:

SmoteRegress(z ~ ., diamonds2[,c("cut","color","clarity","z")])
Error in neighbours(tgt, dat, dist, p, k) :
  Can not compute Euclidean distance with nominal attributes!

The error message should be clear: you are using an unsuitable distance measure.

Regarding the sixth comment:

SmoteRegress(z ~ ., diamonds2[,c("cut","color","clarity","y","z")], rel = "HEOM")
Error in SmoteRegress(z ~ ., diamonds2[, c("cut", "color", "clarity",  :
  future work!

The distance function "HEOM" is suitable in this case. However, it should be passed through parameter dist and not parameter rel. Parameter rel is used to provide the relevance function, and not the distance measure! The parameter rel only accepts "auto" for the automatic method of determining the relevance function, or accepts a matrix, as described in SmoteRegress documentation.

Regarding the seventh comment:

SmoteRegress(cut ~ ., diamonds2[,c("cut","y","z")], rel = "HEOM")
Error in SmoteRegress(cut ~ ., diamonds2[, c("cut", "y", "z")], rel = "HEOM") :
  future work!

Again, the distance measure is being passed to the wrong parameter! "HEOM" should be passed though parameter dist instead of parameter rel. Moreover, you can not use a nominal target variable (cut is nominal!) with SmoteRegress. In this case you should select instead SmoteClassif function.

Answer 2 · 2018-03-20T16:56:08.000Z

Dear Andre Mikulec,

First of all thank you for using our UBL package that we have freely made available to the community.

Thank you also for testing the package and raising what you thought were problems with it. As you may have seen from the careful reply of Paula, all the 7 issues you have raised are in effect no issues. They all result from either: (i) you not reading the documentation or disregarding it; (ii) not reading the error messages you got; or (iii) using the functions in an incorrect way because you apparently do not understood their purpose.

So, allow me to recommend you to be a bit more careful before you write "bombastic" comments in a public forum with words like "does not work", and do your home work before, as you may imagine we have done ours (particularly Paula) before we release a package to the world. I guess this is more a question of style or used language, but these do matter in communication between human beings.

Thanks again for your interest in our work,
Luis

Answer 3 · 2018-03-22T00:24:28.000Z

Thank you for the wonderful responses.

I did some work.

I read Chapter 3 Utility-based Regression of

Ribeiro, R.P.: Utility-based Regression. PhD thesis, Dep. Computer Science,
Faculty of Sciences - University of Porto (2011)

One the reponses(above in this issue) produced the following error.

library(UBL)

myrel <- matrix(c(1,0,0, 90,1,0), ncol=3, byrow=TRUE)
myrel
#      [,1] [,2] [,3]
# [1,]    1    0    0
# [2,]   90    1    0

set.seed(1L)
xy <- data.frame(x = 1:100, y = rnorm(100, 90, 1))
range(xy$y)
# [1] 87.78530 92.40162

res <- SmoteRegress(y ~ ., xy, rel=myrel)
# Error in last:bumps[i] : argument of length 0

I have a quesion about why.
However, I want to trace the SmoteRegress example (that works)

SmoteRegress has the following parameter.

 thr.rel: A number indicating the relevance threshold above which a
          case is considered as belonging to the rare "class".

I create a relevance function

rel <- matrix(0, ncol = 3, nrow = 0)
rel <- rbind(rel, c(2, 1, 0))
rel <- rbind(rel, c(3, 0, 0))
rel <- rbind(rel, c(4, 1, 0))
  
> rel
     [,1] [,2] [,3]
[1,]    2    1    0
[2,]    3    0    0
[3,]    4    1    0

I prep some data and make the call.

> ir <- iris[-c(95:130), ]

> range(ir$Sepal.Width)
[1] 2.0 4.4

I call

sP.ir <- SmoteRegress(Sepal.Width~., ir, rel = rel, dist = "HEOM",  C.perc = list(4, 0.5, 4))

with the default parameter thr.rel = 0.5

> args(SmoteRegress) 
function (form, dat, rel = "auto", thr.rel = 0.5, C.perc = "balance", k = 5, repl = FALSE, dist = "Euclidean", p = 2)

Tracing through the code I see.

Browse[2]> thr.rel
[1] 0.5

Browse[2]> range(y) # unsorted y
[1] 2.0 4.4

Browse[2]> y
  1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18  19  20
3.5 3.0 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 3.7 3.4 3.0 3.0 4.0 4.4 3.9 3.5 3.8 3.8
 21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40
3.4 3.7 3.6 3.3 3.4 3.0 3.4 3.5 3.4 3.2 3.1 3.4 4.1 4.2 3.1 3.2 3.5 3.6 3.0 3.4
 41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60
3.5 2.3 3.2 3.5 3.8 3.0 3.8 3.2 3.7 3.3 3.2 3.2 3.1 2.3 2.8 2.8 3.3 2.4 2.9 2.7
 61  62  63  64  65  66  67  68  69  70  71  72  73  74  75  76  77  78  79  80
2.0 3.0 2.2 2.9 2.9 3.1 3.0 2.7 2.2 2.5 3.2 2.8 2.5 2.8 2.9 3.0 2.8 3.0 2.9 2.6
 81  82  83  84  85  86  87  88  89  90  91  92  93  94 131 132 133 134 135 136
2.4 2.4 2.7 2.7 3.0 3.4 3.1 2.3 3.0 2.5 2.6 3.0 2.6 2.3 2.8 3.8 2.8 2.8 2.6 3.0
137 138 139 140 141 142 143 144 145 146 147 148 149 150
3.4 3.1 3.0 3.1 3.1 3.1 2.7 3.2 3.3 3.0 2.5 3.0 3.4 3.0

sorted y value that will be sent to phi

Browse[2]> txtplot::txtplot(sort(y)) # s.y
4.5 +-+--------+---------+--------+--------+---------+---------+
    |                                                      *   |
    |                                                     **   |
  4 +                                                    **    +
    |                                                ****      |
3.5 +                                            *****         +
    |                                     ********             |
    |                                *****                     |
  3 +                   **************                         +
    |            *******                                       |
    |        *****                                             |
2.5 +     ****                                                 +
    |  ***                                                     |
  2 +  *                                                       +
    +-+--------+---------+--------+--------+---------+---------+
      0       20        40       60       80        100

phi.control is called, it returns my matrix

> pc <- phi.control(y, method = "range", control.pts = rel)

Browse[2]> pc
$method
[1] "range"

$npts
[1] 3

$control.pts
[1] 2 1 0 3 0 0 4 1 0

call to phi returns different data

> temp <- y.relev <- phi(s.y, pc) 
# calls UBL:::phi.setup phi.range ... somwhere called last # subroutine rtophi

Browse[2]> str(temp)
 num [1:114] 1 0.896 0.896 0.784 0.784 0.784 0.784 0.648 0.648 0.648 ...
Browse[2]> temp
  [1] 1.000 0.896 0.896 0.784 0.784 0.784 0.784 0.648 0.648 0.648 0.500 0.500
 [13] 0.500 0.500 0.352 0.352 0.352 0.352 0.216 0.216 0.216 0.216 0.216 0.104
 [25] 0.104 0.104 0.104 0.104 0.104 0.104 0.104 0.028 0.028 0.028 0.028 0.028
 [37] 0.028 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
 [49] 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.028 0.028 0.028 0.028 0.028
 [61] 0.028 0.028 0.028 0.028 0.028 0.028 0.104 0.104 0.104 0.104 0.104 0.104
 [73] 0.104 0.104 0.104 0.216 0.216 0.216 0.216 0.352 0.352 0.352 0.352 0.352
 [85] 0.352 0.352 0.352 0.352 0.352 0.352 0.352 0.500 0.500 0.500 0.500 0.500
 [97] 0.500 0.648 0.648 0.648 0.784 0.784 0.784 0.896 0.896 0.896 0.896 0.896
[109] 0.972 0.972 1.000 1.000 1.000 1.000

Next, collected are the bumps(ranges), by detecting a point by point crossings of thr.rel.

Each pairs of coordinates are compared to each other using

Browse[2]> thr.rel
[1] 0.5

Here is tha comparison

  bumps <- c()
  for (i in 1:(length(y) - 1)) { 
#     if (temp[i] * temp[i + 1] < 0) bumps <- c(bumps, i) 
    if ((temp[i] >= thr.rel && temp[i+1] < thr.rel) || 
        (temp[i] < thr.rel && temp[i+1] >= thr.rel)) {
      bumps <- c(bumps, i)
    }
   }
  nbump <- length(bumps) + 1 # number of different "classes"

Essesntially this code is 'or'ing the below 1st expression below
with the second expession below
and and counting the number of TRUEs (slope changes)
and the postion of each TRUE.

Browse[2]>  ( (temp[-NROW(temp)] >= thr.rel)  & (temp[seq_along(temp[-NROW(temp)]) + 1] < thr.rel) )

( (temp[-NROW(temp)] >= thr.rel.try)  & (temp[seq_along(temp[-NROW(temp)]) + 1] < thr.rel.try) )
  [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE
 [21] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 [41] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 [61] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 [81] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[101] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
# one TRUE is found

Browse[2]> ( (temp[-NROW(temp)] <  thr.rel)  & (temp[seq_along(temp[-NROW(temp)]) + 1] >= thr.rel) )
  
  [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 [21] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 [41] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 [61] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 [81] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[101] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE

# a second true is found

So my bump coordinates are determined

Browse[2]> str(bumps)
 int [1:2] 14 91    
Browse[2]> n
Browse[2]> str(nbump)
 num 3

Below is the result of phi(temp).
The situation seems that many choices are possible for thr.rel (other than the default 0.5).
What is the best choice?

Browse[2]> txtplot::txtplot(temp) # result of phi
    +---+--------------+--------------+---------------+--------------+---------------+---------------+
  1 +   *                                                                                    ***     +
    |                                                                                       *        |
    |    **                                                                             ****         |
    |                                                                                                |
0.8 +                                                                                                +
    |      ***                                                                       ***             |
    |                                                                                                |
    |                                                                                                |
    |         **                                                                   ***               |
0.6 +                                                                                                +
    |                                                                                                |
    |           ***                                                           *****                  |
    |                                                                                                |
0.4 +                                                                                                +
    |              ***                                               **********                      |
    |                                                                                                |
    |                                                                                                |
    |                 ****                                        ***                                |
0.2 +                                                                                                +
    |                                                                                                |
    |                     ******                           *******                                   |
    |                           *****              ********                                          |
  0 +                                **************                                                  +
    +---+--------------+--------------+---------------+--------------+---------------+---------------+
        0             20             40              60             80              100

So back to this.

library(UBL)

myrel <- matrix(c(1,0,0, 90,1,0), ncol=3, byrow=TRUE)
myrel
#      [,1] [,2] [,3]
# [1,]    1    0    0
# [2,]   90    1    0

set.seed(1L)
xy <- data.frame(x = 1:100, y = rnorm(100, 90, 1))
range(xy$y)
# [1] 87.78530 92.40162

res <- SmoteRegress(y ~ ., xy, rel=myrel)
# Error in last:bumps[i] : argument of length 0

This is the normal distribution.

Browse[2]> txtplot::txtplot(s.y)
   +---+----------------+-----------------+-----------------+-----------------+----------------+-----+
   |                                                                                           *     |
   |                                                                                          *      |
92 +                                                                                          *      +
   |                                                                                                 |
   |                                                                                       ***       |
   |                                                                                    ***          |
   |                                                                               *****             |
91 +                                                                            ***                  +
   |                                                                     *******                     |
   |                                                           **********                            |
   |                                                   ********                                      |
90 +                                            *******                                              +
   |                                  **********                                                     |
   |                            ******                                                               |
   |                   *********                                                                     |
   |              ******                                                                             |
89 +            **                                                                                   +
   |         ***                                                                                     |
   |       **                                                                                        |
   |      *                                                                                          |
   |     *                                                                                           |
88 +    *                                                                                            +
   |   *                                                                                             |
   +---+----------------+-----------------+-----------------+-----------------+----------------+-----+
       0               20                40                60                80               100

The reason for the error is that

  bumps <- c()
  for (i in 1:(length(y) - 1)) { 
#     if (temp[i] * temp[i + 1] < 0) bumps <- c(bumps, i) 
    if ((temp[i] >= thr.rel && temp[i+1] < thr.rel) || 
        (temp[i] < thr.rel && temp[i+1] >= thr.rel)) {
      bumps <- c(bumps, i)
    }
   }
  nbump <- length(bumps) + 1 # number of different "classes"

did not find any bumps,
so

> is.null(bumps)
[1] TRUE

Therefore, I find I a better choice for thr.rel.

Browse[2]> txtplot::txtplot(temp) # phi
       +--+----------------+----------------+----------------+-----------------+----------------+----+
     1 +                               **********************************************************    +
       |                        *******                                                              |
       |                ********                                                                     |
       |             ***                                                                             |
       |             *                                                                               |
       |            *                                                                                |
0.9995 +           *                                                                                 +
       |        ***                                                                                  |
       |                                                                                             |
       |       *                                                                                     |
       |      **                                                                                     |
       |                                                                                             |
 0.999 +                                                                                             +
       |                                                                                             |
       |                                                                                             |
       |     *                                                                                       |
       |                                                                                             |
       |                                                                                             |
0.9985 +    *                                                                                        +
       |                                                                                             |
       |                                                                                             |
       |                                                                                             |
       |   *                                                                                         |
       +--+----------------+----------------+----------------+-----------------+----------------+----+
          0               20               40               60                80               100

X My questions are the following.
XGiven my input
X
X To get my bumps and avoid the error, the situation seems that I need to choose a smaller threshold X(thr.rel), than the default 0.5, so some number in the range 0.9985 - 0.9995 may work?
X What number would be the best choice?
X
X Please help.
X
X Note, reading Chapter 3 Utility-based Regression of
X

Ribeiro, R.P.: Utility-based Regression. PhD thesis, Dep. Computer Science,
Faculty of Sciences - University of Porto (2011)

X helps.
X
X Thank you,
X Andre Mikulec
X Andre_Mikulec@Hotmail.com

Never mind,
I found this on the internet ( and not in the package documentation ).
This explains everything else.

UBL: an R package for Utility-based Learning
Paula Branco, Rita P. Ribeiro, Luis Torgo
(Submitted on 27 Apr 2016 (v1), last revised 12 Jul 2016 (this version, v2))
https://arxiv.org/abs/1604.08079

This issue can be closed.

Thank you,
Andre Mikulec
Andre_Mikulec@Hotmail.com

Answer 4 · 2018-03-26T15:02:34.000Z

First of all, thank you for the interest in our package.
Thanks to your last comment I detected a missing verification and error message in the regression functions.

The two examples that I previously suggested (which work) were the following:

Use the normal distribution instead:

  xy <- data.frame(x = 1:100, y = rnorm(100, 0, 1))
res <- SmoteRegress(y ~ ., xy)

Or, use the data set originally provided and define a relevance function by providing a matrix in the relevance parameter:

xy <- data.frame(x = 1:100, y = 1:100)
myrel <- matrix(c(1,0,0, 90,1,0), ncol=3, byrow=TRUE)
res <- SmoteRegress(y ~ ., xy, rel=myrel)

Still, it is important to clarify why the example below is not working.
Given that this may be useful to a wider audience I will try provide a more detailed explanation.

myrel <- matrix(c(1,0,0, 90,1,0), ncol=3, byrow=TRUE)
xy <- data.frame(x = 1:100, y = rnorm(100, 90, 1))
res <- SmoteRegress(y ~ ., xy, rel=myrel)

First of all, the resampling functions for both classification and regression tasks in UBL package have default parameters thought to be applied in the context of imbalanced domains.

In a regression context, the user must provide a relevance score for the target variable values that sets which are the most and least important cases. In classification, it is assumed that the smaller classes are the most important ones while the most represented classes are the least important.

In regression it may be more difficult for the user to provide this information. Therefore, Ribeiro, 2011 provided an automatic method for obtaining a relevance function in a regression context, assuming that the least represented values are the most important ones (more details in Ribeiro, 2011). We use this automatic method in UBL functions that require the definition of a relevance function.

This means that by using the automatic method for the relevance function ( rel ="auto"), a higher relevance score is assigned to the extreme rare values of the target variable.

When the target variable distribution is uniform, this method is not able to assign different relevance scores to the problem domain, and therefore, it assigns to all target values a score of zero.
When this happens the function being used returns an error as we can observe below:

xy <- data.frame(x = 1:100, y = 1:100)
SmoteRegress(y ~ ., xy)
Error in SmoteRegress(y ~ ., xy) : All the points have relevance 0.
         Please, redefine your relevance function!

All the functions suitable for regression problems require the definition of a relevance score in [0,1]. This may be easily solved by using the automatic method for obtaining the relevance function.

Below you can see three examples of the relevance function obtained through the automatic method described in Ribeiro, 2011.

library(UBL)
data(iris)
y <- iris$Sepal.Width

s.y <- sort(y)
#  consider both extremes of the target variable distribution as potentially important
phiF.args <- phi.control(s.y,method="extremes",extr.type="both")
y.phi <- phi(s.y, control.parms=phiF.args)
plot(s.y, y.phi,"l", xlab="y", ylab=expression(phi(y)))

# assign higher relevance only to the higher values of the target variable
phiF.args <- phi.control(s.y,method="extremes",extr.type="high")
y.phi <- phi(s.y, control.parms=phiF.args)
plot(s.y, y.phi,"l", xlab="y", ylab=expression(phi(y)))

# assign higher relevance only to the lower values of the target variable
phiF.args <- phi.control(s.y,method="extremes",extr.type="low")
y.phi <- phi(s.y, control.parms=phiF.args)
plot(s.y, y.phi,"l", xlab="y", ylab=expression(phi(y)))

As you can observe, in all these cases, the relevance ranges between zero and one.
Thus, there are always examples with a relevance score below and above 0.5. which is the default value of the relevance threshold (parameter thr.rel).
The relevance function scores and the relevance threshold are used to generate data subsets containing cases with low and high relevance in relation to the threshold.

These functions require that at least two subsets (cam be more) are generated: one containing cases with low relevance score and another with the high relevance cases. These two subsets correspond to the minority and majority class cases in binary classification problems.

The problem with the example below is that all cases have high relevance.

myrel <- matrix(c(1,0,0, 90,1,0), ncol=3, byrow=TRUE)
set.seed(1L)
xy <- data.frame(x = 1:100, y = rnorm(100, 90, 1))
res <- SmoteRegress(y ~ ., xy, rel=myrel)

First, the relevance provided through the matrix myrel sets the following:

the first line (1,0,0) sets that the target variable value 1 has relevance 0 and the slope of the relevance function in this point should be 0;
the second line (90, 1, 0) sets that the value of 90 has relevance 1 and the slope of the relevance function in this point should be 0.

Let us now see the relevance scores of the domain obtained by using this relevance matrix:

phiF.args <- phi.control(y, method="range", control.pts=myrel)
y.phi <- phi(xy$y, control.parms=phiF.args)
plot(xy$y, round(y.phi,2))

As you can see all cases have relevance close to 1. This is not surprising because the target variable values are being sampled from a normal distribution with mean 90 and standard deviation 1, and the value 90 is being signaled as having relevance 1 in matrix myrel.

In this setting, and considering that thr.rel=0.5, it is not possible to use SmoteRegress, or any other function for resampling in regression from package UBL, because it is necessary to have at least two subset of the domain with high and low relevance scores.

I was not checking if this was happening in the functions for regression.
I will add an error message for this in the next UBL version.
Thank you for reporting this!

Another question was related with the following:

How can we set the threshold on the relevance values (thr.rel) to ensure that a data set is precisely partitioned in some specific values?

Let us consider, for instance, the iris data set,

data(iris)
# the target variable
y <- iris$Sepal.Width
range(y)
2.0 4.4

Let us observe what the automatic method for obtaining the relevance function provides:

s.y <- sort(y)
phiF.args <- phi.control(s.y,method="extremes",extr.type="both")
y.phi <- phi(s.y, control.parms=phiF.args)
plot(s.y, y.phi,"l", xlab="y", ylab=expression(phi(y)))

Let us assume that you want to have a relevance function with two extremes with high relevance,
and you want to partition the domain into the following subsets: [2.0,2.4], ]2.4, 3.3[, [3.3, 4.4].

This means that the domain examples with a target variable values in [2.0,2.4], or [3.3, 4.4] are the most relevant ones, while the cases with target variable values in the interval ]2.4, 3.3[ are the less important ones.

You can define this as follows and you can maintain the relevance threshold at 0.5:

# set the higher and lower relevance values (relevance 1 and 0)
# and set the values where the relevance should be precisely 0.5 to match the threshold 
myrel <- matrix(c(2.0,1,0, 2.4,0.5,-1, 3,0,0, 3.3,0.5,1, 4.4,1,0), ncol=3, byrow=TRUE)  
myrel
     [,1] [,2] [,3]
[1,]  2.0  1.0    0
[2,]  2.4  0.5   -1
[3,]  3.0  0.0    0
[4,]  3.3  0.5    1
[5,]  4.4  1.0    0
phiF.args <- phi.control(s.y, method="range", control.pts=myrel)
y.phi <- phi(s.y, control.parms=phiF.args)
plot(s.y,y.phi, "l")
abline(h=0.5, lty=2)

By using this relevance function, and a relevance threshold of 0.5 you will be able to partition your data set precisely where you wanted!

References

Ribeiro, Rita P. : Utility-based Regression. PhD thesis, Dep. Computer Science, Faculty of Sciences - University of Porto, 2011

Answer 5 · 2018-03-28T02:50:14.000Z

Thank you professor,

Yes, you have successfully explained.
Thank you for the high amount of effort.

In my case, I am tring to use UBL from the way a business person would use it.

Let me explain.

For example, as a "user" I have a distribution, for example the
normal distribution with a mean of 90 and standard deviation of one(1).

Values above 90 are important to me. Values below 90 are not important.
Therefore, I want new observations to be above 90.
I want some observations that are below 90, perhaps, to be removed
(so it will be unbalanced but this is the way I want it).

you can maintain the relevance threshold at 0.5

The "threshold itself" (or default threshold ) does not help me.
It does not have a direct/useful relation to 90.

For a 'Smote for Regression' problem, and a simple two hump problem,
I need to find a threshold ( and hopefully the best\optimal threthold ) such that I will get
new observations above 90 and loose observations that are below 90.
I am going to use this.

myrel
[,1]   [,2]   [,3]
[1,]    1    0    0
[2,]   90    1    0

I am going to tell the program to

going from left to right, find the s.y position where the relevance function first starts to equal one(1)
the relevance function can not equal to zero ( meaning it can not be on the x-axis )

Therefore, during this last weekend during Saturday March 28
using much of the code of SmoteRegress I wrote I small program that seems to do this.
It is curve tracer.
Some trial and error was involved in figuring it out.
I have a tiny amount of testing.

Again, it 'currently' only works for a 'two hump' 'Smote for Regression' problems
and the problems of which were tests.

The new logic/code is the following.

    # where the relevance function value of number 1 first appears
    temp.one.start.pos <- match(1, temp)
    
    # s.y value at temp one.start.pos
    # prefer y.extreme.start to be less then or equal to this value: s.y.start.value.at.temp.one
    s.y.start.value.at.temp.one <- s.y[temp.one.start.pos]
    message(paste0("  s.y == ", s.y.start.value.at.temp.one," at the first 'temp == 1' at s.y position index ", temp.one.start.pos))
    
    y.phi.df <- data.frame(rn = seq_along(s.y), sorty=s.y, y.phi = temp)

    # determine where the y value ( y.extreme.start ) 
    # crosses the line of the relevance function                                                 # not on x-axis                                                                                          # not on x-axis
    y.phi.df.at.thr.rel.candidate <- y.phi.df[match(TRUE,(y.extreme.start <= y.phi.df$sorty) & ( 0 != y.phi.df$y.phi )),,drop = F]
    
    message(paste0("  chose thr.rel == ", y.phi.df.at.thr.rel.candidate$y.phi, " at s.y == ", y.phi.df.at.thr.rel.candidate$sorty), " at s.y position index ", y.phi.df.at.thr.rel.candidate$rn)

So, I tell that 90 is important

myrel
[,1]   [,2]   [,3]
[1,]    1    0    0
[2,]   90    1    0

Here (copied from far below ), is a test/example.

set.seed(1L)
xy <- data.frame(x = 1:100, y = rnorm(100, 90, 1))
# range(xy$y)
# [1] 87.78530 92.40162
# I define where the extreme starts
RegressMngd <- SmoteRegressMngd(y ~ ., xy, rel=myrel, y.extreme.start = 90, C.perc=list(0.5,2.5))

Results ( new here and not copied from below ) follow.

Begin SmoteRegressMngd
       relevance function: phi results
         +-+---------+--------+---------+---------+--------+---+
       1 +            **************************************   +
         |       *****                                         |
         |      **                                             |
  0.9995 +     **                                              +
         |    *                                                |
p        |   *                                                 |
h  0.999 +                                                     +
i        |  *                                                  |
         |                                                     |
  0.9985 +  *                                                  +
         |                                                     |
         | *                                                   |
         +-+---------+--------+---------+---------+--------+---+
           0        20       40        60        80       100
                          s.y position index
  s.y == 90.0011053516316 at the first 'temp == 1' at s.y position index 47
  chose thr.rel == 1 at s.y == 90.0011053516316 at s.y position index 47
End   SmoteRegressMngd

> str(RegressMngd)
'data.frame':   158 obs. of  2 variables:
 $ x: num  3 54 98 74 75 80 16 35 62 100 ...
 $ y: num  89.2 88.9 89.4 89.1 88.7 ...

More observations are above 90 than below 90.

> table(RegressMngd$y > 90)

FALSE  TRUE
   23   135

Therefore, I guess that the '2 hump only' Smote for Regession program 'currently' works.
Plotted are the results.

> txtplot::txtplot(sort(RegressMngd$y))
   +-+----------------+----------------+----------------+------+
   |                                                       *   |
92 +                                                     **    +
   |                                                    **     |
   |                                                *****      |
91 +                                       *********           +
   |                             ***********                   |
   |              ****************                             |
90 +          *****                                            +
   |       ***                                                 |
   |    ***                                                    |
   |    *                                                      |
89 +   *                                                       +
   |  **                                                       |
   +-+----------------+----------------+----------------+------+
     0               50               100              150

Here is the original normal distribution ( mean = 90, sd = 1)

> txtplot::txtplot(sort(xy$y)) 
   +-+----------+----------+----------+---------+----------+---+
   |                                                       *   |
92 +                                                      *    +
   |                                                   ****    |
   |                                                ****       |
91 +                                          ******           +
   |                               ***********                 |
90 +                       ********                            +
   |                ********                                   |
   |         ********                                          |
89 +       ***                                                 +
   |   ****                                                    |
   |   *                                                       |
88 +  *                                                        +
   +-+----------+----------+----------+---------+----------+---+
     0         20         40         60        80         100
>

Program with examples/tests

# same as UBL::SmoteRegress ( note: extreme value are ONLY upper/higher/greater values )
# except the user must pass exactly one of the following: thr.rel xor y.extreme.start
# y.extreme.start: defines where the extreme values start
#                  on behalf of the user, parameter y.extreme.start 
#                  will deterimine the value of the parameter thr.rel 
SmoteRegressMngd <- function(form, dat, rel = "auto", thr.rel = NULL,
                         C.perc = "balance", k = 5, repl = FALSE,
                         dist   = "Euclidean", p = 2, 
                         y.extreme.start = NULL) {  
  
  message("Begin SmoteRegressMngd")
  
  # assume that the user knows what he/she is doing
  if(!is.null(thr.rel)) {
    
    # regular call
    RegressMngd <- UBL::SmoteRegress(form = form, dat = dat, rel = rel, thr.rel = thr.rel,
                                     C.perc = C.perc, k = k, repl = repl,
                                     dist = dist, p = p)
  } else {  
    
    if(is.null(y.extreme.start)) stop("  since parameter thr.rel was not provided, then parameter y.extreme.start MUST be provided")

    ops <- options()
    options(digits = 22) # print small values
    options(scipen=255)  # print small values
    options(warn = 1)
    
    require(UBL)  
    # uses UBL SmoteRegress
           
    ###############################################
    # BEGIN "copy of code from UBL::SmoteRegress" #
                    
    if (any(is.na(dat))) {
      stop("The data set provided contains NA values!")
    }
  
    # the column where the target variable is
    tgt <- which(names(dat) == as.character(form[[2]]))
  
    if (tgt < ncol(dat)) {
      orig.order <- colnames(dat)
      cols <- 1:ncol(dat)
      cols[c(tgt, ncol(dat))] <- cols[c(ncol(dat), tgt)]
      dat <- dat[, cols]
    }
    
    # END   "copy of code from UBL::SmoteRegress" #
    ###############################################
    
    ###############################################
    # BEGIN "copy of code from UBL::SmoteRegress" #
    
    y <- dat[, ncol(dat)]
    attr(y, "names") <- rownames(dat)
    s.y <- sort(y)
  
    if (is.matrix(rel)) {
      pc <- phi.control(y, method = "range", control.pts = rel)
    } else if (is.list(rel)) {
      pc <- rel
    } else if (rel == "auto") {
      pc <- phi.control(y, method = "extremes")
    } else {# handle other relevance functions and not using the threshold!
      stop("future work!")
    }
  
    temp <- y.relev <- phi(s.y, pc)
    if (!length(which(temp < 1))) {
      stop("All the points have relevance 1.
           Please, redefine your relevance function!")
    }
    if (!length(which(temp > 0))) {
      stop("All the points have relevance 0.
           Please, redefine your relevance function!")
    }
    
    # END   "copy of code from UBL::SmoteRegress" #
    ###############################################

    # see
    # Chapter 3 Utility-based Regression of
    # Ribeiro, R.P.: Utility-based Regression. PhD thesis, Dep. Computer Science,
    # Faculty of Sciences - University of Porto (2011)
    # http://www.dcc.fc.up.pt/~rpribeiro/publ/rpribeiroPhD11.pdf
    # Rita P. Ribeiro
    # Faculty of Sciences, University of Porto
    # Verified email at dcc.fc.up.pt
    # https://scholar.google.com/citations?user=ptDBgpkAAAAJ&hl=en

    # see
    # 6.4 The SmoteR Algorithm
    # UBL: an R package for Utility-based Learning
    # Paula Branco, Rita P. Ribeiro, Luis Torgo
    # (Submitted on 27 Apr 2016 (v1), last revised 12 Jul 2016 (this version, v2))
    # https://arxiv.org/abs/1604.08079

    # *new* code follows
       
    writeLines("       relevance function: phi results")
    txtplot::txtplot(temp, xlab = "s.y position index", ylab = "phi")
    
    # where the relevance function value of number 1 first appears
    temp.one.start.pos <- match(1, temp)
    
    # s.y value at temp one.start.pos
    # prefer y.extreme.start to be less then or equal to this value: s.y.start.value.at.temp.one
    s.y.start.value.at.temp.one <- s.y[temp.one.start.pos]
    message(paste0("  s.y == ", s.y.start.value.at.temp.one," at the first 'temp == 1' at s.y position index ", temp.one.start.pos))
    
    y.phi.df <- data.frame(rn = seq_along(s.y), sorty=s.y, y.phi = temp)

    # determine where the y value ( y.extreme.start ) 
    # crosses the line of the relevance function                                                 # not on x-axis                                                                                          # not on x-axis
    y.phi.df.at.thr.rel.candidate <- y.phi.df[match(TRUE,(y.extreme.start <= y.phi.df$sorty) & ( 0 != y.phi.df$y.phi )),,drop = F]
    
    message(paste0("  chose thr.rel == ", y.phi.df.at.thr.rel.candidate$y.phi, " at s.y == ", y.phi.df.at.thr.rel.candidate$sorty), " at s.y position index ", y.phi.df.at.thr.rel.candidate$rn)
    # unfortunately I can not add this line to a txtplot::txtplot
    
    # thr.rel candidates
    thr.rel.candidate <- y.phi.df.at.thr.rel.candidate$y.phi
    
    # final decision
    thr.rel <- thr.rel.candidate
    
    ###############################################
    # BEGIN "copy of code from UBL::SmoteRegress" #
                    
  #  temp[which(y.relev >= thr.rel)] <- -temp[which(y.relev >= thr.rel)]
    bumps <- c()
    for (i in 1:(length(y) - 1)) {
  #     if (temp[i] * temp[i + 1] < 0) bumps <- c(bumps, i)
      if ((temp[i] >= thr.rel && temp[i+1] < thr.rel) ||
          (temp[i] < thr.rel && temp[i+1] >= thr.rel)) {
        bumps <- c(bumps, i)
      }
     }
    nbump <- length(bumps) + 1 # number of different "classes" 
  
    # END   "copy of code from UBL::SmoteRegress" #
    ###############################################
    
    # *new* code follows

    if(nbump == 1) { # becuse bumps is NULL
      
      stop("  phi failed to produce a good relevance function: This should not happen.")

    } else {
    
      # regular call
      Regress <- UBL::SmoteRegress(form = form, dat = dat, rel = rel, thr.rel = thr.rel,
                                   C.perc = C.perc, k = k, repl = repl,
                                   dist = dist, p = p)
      
      RegressMngd <- Regress
      
      options(ops)
    
    }  
  }
  
  message("End   SmoteRegressMngd")
  
  return(RegressMngd)
 
}
# old stuff ...
#
# # pass thorugh to the default method
#
# ir  <- iris[-c(95:130), ]
# range(ir$Sepal.Width)
# # [1] 2.0000000000000000 4.4000000000000004
# rel <- matrix(0, ncol = 3, nrow = 0)
# rel <- rbind(rel, c(2, 1, 0))
# rel <- rbind(rel, c(3, 0, 0))
# rel <- rbind(rel, c(4, 1, 0))
#
# # default method
# RegressMngd <- SmoteRegressMngd(Sepal.Width~., ir, thr.rel = 0.5, dist = "HEOM", C.perc=list(0.5,2.5))
# 
# txtplot::txtplot(ir$Sepal.Width)
# txtplot::txtplot(RegressMngd$Sepal.Width)
#
# new stuff ...
#
# RegressMngd <- SmoteRegressMngd(Sepal.Width~., ir, y.extreme.start = 4, dist = "HEOM", C.perc=list(0.5,2.5))
# 
# txtplot::txtplot(ir$Sepal.Width)
# txtplot::txtplot(RegressMngd$Sepal.Width)

# # use y.extreme.start to calculate thr.rel
# 
# myrel <- matrix(c(1,0,0, 90,1,0), ncol=3, byrow=TRUE)
# myrel
# #      [,1] [,2] [,3]
# # [1,]    1    0    0
# # [2,]   90    1    0
# set.seed(1L)
# xy <- data.frame(x = 1:100, y = rnorm(100, 90, 1))
# range(xy$y)
# # [1] 87.78530 92.40162
# 
# # I define where the extreme starts
# RegressMngd <- SmoteRegressMngd(y ~ ., xy, rel=myrel, y.extreme.start = 90, C.perc=list(0.5,2.5))
# 
# txtplot::txtplot(xy$y)
# txtplot::txtplot(RegressMngd$y)

# data(ImbR)
# mysm <- SmoteRegressMngd(Tgt~., ImbR,  y.extreme.start = 20 , dist="Manhattan", C.perc= list(0.1, 8))
# txtplot::txtplot(ImbR$Tgt)
# txtplot::txtplot(mysm$Tgt)

Thanks,
Andre Mikulec
Andre_Mikulec@Hotmail.com

Answer 6 · 2018-03-28T10:03:28.000Z

Thank you for the effort in explaining your problem.

Maybe what I've explained was not that clear, because I think you can use SmoteRegress original function as it is. The problem is with the relevance matrix definition.
As long as you can precisely specify (as you did) which values are important and which are not.

Let us consider the data set

set.seed(1L)
xy <- data.frame(x = 1:100, y = rnorm(100, 90, 1))

What you want is:

Values above 90 are important to me. Values below 90 are not important.
Therefore, I want new observations to be above 90.
I want some observations that are below 90, perhaps, to be removed

So you just need to specify this in the relevance function, as follows:

Choose a threshold on the relevance (whatever you want, it is only important to set which will be the relevant values); Let us assume you choose the default (0.5);
Define your relevance function specifying that:
- values below 90 are not important (matrix row: 89, 0, 0);
- 90 is the first important value: assign the threshold to this value (matrix row: 90, 0.5, 0.5);
- values above 90 are important (matrix row: 91, 1, 0)

And that's it!

This means that 90 is the first important value (it matches the selected threshold) and the values above 90 are also important. Values below 90 will have a relevance below the threshold (0.5) and are not important.

Using the example data:

set.seed(1L)
xy <- data.frame(x = 1:100, y = rnorm(100, 90, 1))

myrel <- matrix(c(89, 0, 0, 90, 0.5, 0.5, 91, 1, 0), ncol = 3, byrow = TRUE)

# undersample at 0.5 and oversample at 2.5
res <- SmoteRegress(y ~ ., xy, rel = myrel, C.perc = list(0.5, 2.5))

table(res$y > 90)

FALSE  TRUE 
   23   135

Finally, the original y distribution:

> txtplot::txtplot(sort(xy$y))
   +-+----------+-----------+----------+----------+----------+---+
   |                                                         *   |
92 +                                                        *    +
   |                                                     ***     |
   |                                                 *****       |
91 +                                           *******           +
   |                                ************                 |
90 +                        ********                             +
   |                 ********                                    |
   |          *******                                            |
89 +       ***                                                   +
   |   ****                                                      |
   |   *                                                         |
88 +  *                                                          +
   +-+----------+-----------+----------+----------+----------+---+
     0         20          40         60         80         100

and the y distribution after applying SmoteRegress:

> txtplot::txtplot(sort(res$y))
   +-+-----------------+-----------------+----------------+------+
   |                                                        **   |
92 +                                                       **    +
   |                                                      **     |
   |                                                 ******      |
91 +                                         *********           +
   |                              ***********                    |
   |              *****************                              |
90 +          *****                                              +
   |       ****                                                  |
   |     ***                                                     |
   |    *                                                        |
89 +   **                                                        +
   |  **                                                         |
   +-+-----------------+-----------------+----------------+------+
     0                50                100              150

Hope this helps!

Answer 7 · 2018-03-29T07:53:11.000Z

Professor,
Now, I completely understand.
(I know how to create a simple solution.)
Thanks,
Andre Mikulec
Andre_Mikulec@Hotmail.com