bnaras/bcaboot

Construction of Imat in bcajack() function.

Opened this issue · 0 comments

I've been attempting to understand the implementation of the bcajack to adapt it for use in internal code at my workplace. In looking at the code for the case where m < n (bcajack.R: lines 153-173), I can see that the original construction of Imat using matrix(sample(etc...) is replaced with sapply(seq_len_m, matrix, etc...), as shown below:

##Imat <- matrix(sample(1:n, n - r), m)
Imat <- sapply(seq_len_m, sample.int, n = n, size = n - r)

If these are intended to be equivalent, my two concerns are:

  1. In the example where n = 1000, m = 45 and thus r = 10, and seq_len_m = 1:45, the dimensions of the original Imat object are 45x22, while the dimensions of the new Imat object are 990x45. This appears problematic, particularly because Imat is still accessed as Imat[Ij,] in line 164, which does not seem to account for this change in dimensions. Try as I might, I cannot understand how this is equivalent, nor is there any justification or explanation I can find anywhere in the code, notes, or associated slides/documents regarding this change. If the goal in this change was avoidance of the 1:n, then I would expect the original line could be replaced with Imat <- matrix(sample(seq_len(n), n - r), m) as it is for Iout in line 156.

  2. The use of sapply on a positive vector, together with the sample.int function results in each value passed to sample.int to be used as the value for the "replacement" argument of sample.int, as it is the first unused argument (since "n", and "size" are already used). Since the value is always positive, "replacement" is always true, and thus duplicate indices produced, which I suspect was not the intention. If the use of sapply is necessary/desirable and the sampling with replacement is not desired, then I believe Imat <- sapply(seq_len_m, function(x){ sample.int(n = n, size = n - r)}) should avoid this behavior.

Thanks for your work on this package/approach as it will greatly help us with our large n datasets...