ramhiser/itertools2

Random selection from an iterator/iterable object

Closed this issue · 3 comments

In many cases, it makes a lot of sense to randomly select from some object (or iterator) by sampling the indices of the object. Example:

set.seed(42)
n <- nrow(iris)
indices <- seq_len(n)
train_idx <- sample(indices, 2/3 * n)

train_data <- iris[train_idx, ]
test_data <- iris[-train_idx, ]

If n is extremely large, the indices vector becomes extremely large. To avoid this overhead, it makes sense to have some interface like:

it <- isample(iseq_len(n), 50)
as.list(it) # vector of length 50

Given that iterators are sequential in nature, sampling the exact number of elements will be difficult if n is unknown. If n were known, this would simply be a binomial/hypergeometric sampling depending on whether sampling with or without replacement.

Hmm, think on this before moving forward. Scrap the idea?

The idea makes sense in some contexts though -- randomly selecting from the Cartesian product as in Python's random_product itertools recipe. This makes even more sense when the Cartesian product from expand.grid is HUGE, and we care only about some random subset.

Passing on this issue. PITA.