henckr/distRforest

subsampling

Closed this issue · 2 comments

Great project!

I've got two questions related to the "weights" argument in rforest as well to subsampling.

  1. This code snipped from rforest:

data = substitute(data[sample.int(nrow(data), size = subsamp * nrow(data), replace = subsamp < 1)

For subsample = 1, I'd expect sampling with replacement (like with bootstrap resp. bagging). In the code above, however, replace is then set to FALSE. Is this as intended? How to set subsample to get bootstrap sampling?

  1. How to use the "weights" argument? The rforest code is like this:

'weights' = unname(as.list(match.call())[['weights']])

Is the weights vector really shuffled as well?

Thanks for your feedback! Below my answers to your questions:

1: The rforest.R code is updated to always subsample with replacement, so bootstrapping is possible now with subsample = 1.
2: To use weights there should be a variable in your data containing the weights, for example named weights_var. Then you simply specify weights = weights_var in the function call, much like you would do in the original rpart function call or most other modelling algos. This will make sure that the correctly sampled values are used during model training.

I hope this answers your questions. I will close this issue for now, but feel free to reopen if you have any other questions related to this!

Wonderful - both issues brilliantly solved. Thanks a lot.