yPred from predict is not the class label

Question

yPred from predict is not the class label

Closed this issue 6 years ago · 2 comments

Hello,
I am wondering if it's the intended result, or have I misunderstood what yPred really is ?
I was caught using yPred as the class label prediction which seems not to be the case,
below a MRE

# From package help:
library(Rborist)
# Classification example:
data(iris)
# Generic invocation:
rb <- Rborist(iris[,-5], iris[,5])
pred <- predict(rb, iris[,-5], ctgCensus = "prob")
yPred <- pred$yPred

yPred
[1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2
[55] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3
[109] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3

Training is made using column nr 5 Species as target variable.

levels(iris[,5])
[1] "setosa"     "versicolor" "virginica"
yPred
[1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2
[56] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3
[111] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3

Prediction returns numerics. From that, I understand that 1 corresponds to level setosa, 2 to level versicolor and 3 to virginica.

What if I encode the levels as numerics?

iris_mod <- iris %>%
mutate(species_num = as.factor(as.numeric(Species)))

rb_with <- Rborist(iris_mod[,-c(5,6)], iris_mod$species_num)

rb_with$validation$yPred
[1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
[67] 2 2 2 2 3 2 2 2 2 2 2 3 2 2 2 2 2 3 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 2 3 3 3 3 3 3 3 3 3 3 3 3 2 3 3 3 3 3 3 3 3 3 3 3 3
[133] 3 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3

pred_with <- predict(rb_with, iris_mod[,-c(5,6)])
pred_with$yPred
[1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2
[56] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3
[111] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3

# Missing level 2 (versicolor)
iris_mod2 <- iris %>%
  filter(!Species %in% 'versicolor') %>%
  mutate(species_num = as.factor(as.numeric(Species)))

levels(iris_mod2$species_num)
[1] "1" "3"

Class label "2" is missing from the training dataset and thus cannot be predicted.

rb_without <- Rborist(iris_mod2[,-c(5,6)], iris_mod2$species_num, ctgCensus = "prob")

# Level "2" in yPred ?
rb_without$validation$yPred
[1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 1 2 2 2 2 2 2 2 2 2
[67] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2

# Labels are Ok
rb_without$validation$confusion
1  3
1 50  0
3  1 49

# Labels are Ok
head(rb_without$validation$prob)
1            3
[1,] 1.000000 0.0000000000
[2,] 1.000000 0.0000000000
[3,] 1.000000 0.0000000000
[4,] 1.000000 0.0000000000
[5,] 1.000000 0.0000000000
[6,] 0.999098 0.0009020076

pred_without <- predict(rb_without, iris[,-5], ctgCensus = "prob")

# Levels "2" in yPred ?
pred_without$yPred
[1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2
[56] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 2 2 2 2 2 2 2 2 2 2
[111] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2

# Labels are Ok
head(pred_without$census)
1 3
[1,] 500 0
[2,] 500 0
[3,] 500 0
[4,] 500 0
[5,] 500 0
[6,] 500 0

# Labels are Ok
head(pred_without$prob)
[1,] 0.9999556 4.444507e-05
[2,] 0.9999556 4.444507e-05
[3,] 0.9999556 4.444507e-05
[4,] 0.9999556 4.444507e-05
[5,] 0.9999556 4.444507e-05
[6,] 0.9983056 1.694357e-03

But that's ok, I can get the predicted class label returned using

colnames(pred_without$prob)[apply(pred_without$prob, 1, which.max)]
  [1] "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1"
 [34] "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "3" "3" "3" "3" "3" "3" "3" "3" "3" "3" "3" "3" "3" "3" "3" "3"
[67] "3" "3" "3" "3" "3" "3" "3" "3" "3" "3" "3" "3" "3" "3" "3" "3" "3" "3" "3" "3" "3" "3" "3" "3" "3" "3" "3" "3" "3" "3" "3" "3" "1"
[100] "3" "3" "3" "3" "3" "3" "3" "3" "3" "3" "3" "3" "3" "3" "3" "3" "3" "3" "3" "3" "3" "3" "3" "3" "3" "3" "3" "3" "3" "3" "3" "3" "3"
[133] "3" "3" "3" "3" "3" "3" "3" "3" "3" "3" "3" "3" "3" "3" "3" "3" "3" "3"

Thanks.

Answer 1 · 2018-07-24T21:34:40.000Z

yPred has integer type, not numeric. The census and confusion matrices are decorated with the class labels, although not the inferred response.
As you note, the inferred factor levels employ the same mapping as those used to train. The level-to-string mapping is available from the trained object, and should be applied when attempting to reconcile separately-trained cases. Perhaps we should consider offering the decorations automatically.
FWIW, when performing inference with differing _predictor_factor levels, appropriate adjustments are made internally, with warnings issued when appropriate.
Closing this, but please feel free to reopen if there is more to discuss.

Answer 2 · 2018-07-25T19:45:52.000Z

Inferred and trained response now include the level decorations. Thank you for the suggestion.