question about the withReplacement param in BaggedPoint.scala
hailuoS opened this issue · 2 comments
Thanks for the grate work,I have a little question, why using PoissonDistribution means with replacement and BinomialDistribution means without replacement here.
In the code you share above, numSubsamples
is set to the value of the numEstimators
parameter from the IsolationForest
object. This is the number of isolation trees in the ensemble.
Sampling without replacement
Sampling without replacement means that each instance from the original dataset can show up a maximum of one time in the sampled dataset.
For a toy example, imagine we use BinomialDistribution
with subsamplingRate = 0.3
, numEstimators = 1
, and we pass in an input that has 10 instances:
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
In our example here, we would run BinomialDistribution(trials=1, p=subsamplingRate)
for each instance in the input array (each run is an independent biased coin flip). This gives the "weights" for each instance in the input data. On average, since subsamplingRate = 0.3
we would get 30% 1's and 70% 0's.
[0, 0, 1, 0, 0, 1, 1, 0, 0, 0]
Applying this mask to the input, the final sampled result would be:
[2, 5, 6].
Sampling with replacement
If we sample with replacement, we can't simply apply this masking one time to generate the final dataset, because we need to allow for the same instance in the input data to be selected multiple times.
If we wanted to draw a sample with subsamplingRate = 0.3
with replacement, we'd need to run BinomialDistribution(trials=N, p=subsamplingRate/N)
for each element in the input data. For example, if N = 10
then we would have 10 trials for each of the 10 elements in the input data and the probability, p
, of selection for each trial would subsamplingRate / N = 0.3 / 10 = 0.03
. The key here is that multiple trials for each instance in the input array can result in an input instance being selected more than once. For example, it is possible to get the following weights:
[2, 0, 0, 0, 1, 0, 0, 0, 0, 0]
Applying this mask to the input, the final sampled result would be:
[0, 0, 4]
However, it is expensive to do N
trials when the input dataset is large. As a result, we use the Poisson approximation to the binomial distribution:
https://en.wikipedia.org/wiki/Binomial_distribution#Poisson_approximation
lambda = N * p = N * subsamplingRate / N = subsamplingRate
So we can use PoissonDistribution(lambda = subsamplingRate)
to produce the weights to sample with replacement. This is much more computationally efficient.
In the code you share above,
numSubsamples
is set to the value of thenumEstimators
parameter from theIsolationForest
object. This is the number of isolation trees in the ensemble.Sampling without replacement
Sampling without replacement means that each instance from the original dataset can show up a maximum of one time in the sampled dataset.
For a toy example, imagine we use
BinomialDistribution
withsubsamplingRate = 0.3
,numEstimators = 1
, and we pass in an input that has 10 instances:
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
In our example here, we would run
BinomialDistribution(trials=1, p=subsamplingRate)
for each instance in the input array (each run is an independent biased coin flip). This gives the "weights" for each instance in the input data. On average, sincesubsamplingRate = 0.3
we would get 30% 1's and 70% 0's.
[0, 0, 1, 0, 0, 1, 1, 0, 0, 0]
Applying this mask to the input, the final sampled result would be:
[2, 5, 6].
Sampling with replacement
If we sample with replacement, we can't simply apply this masking one time to generate the final dataset, because we need to allow for the same instance in the input data to be selected multiple times.
If we wanted to draw a sample with
subsamplingRate = 0.3
with replacement, we'd need to runBinomialDistribution(trials=N, p=subsamplingRate/N)
for each element in the input data. For example, ifN = 10
then we would have 10 trials for each of the 10 elements in the input data and the probability,p
, of selection for each trial wouldsubsamplingRate / N = 0.3 / 10 = 0.03
. The key here is that multiple trials for each instance in the input array can result in an input instance being selected more than once. For example, it is possible to get the following weights:
[2, 0, 0, 0, 1, 0, 0, 0, 0, 0]
Applying this mask to the input, the final sampled result would be:
[0, 0, 4]
However, it is expensive to do
N
trials when the input dataset is large. As a result, we use the Poisson approximation to the binomial distribution:https://en.wikipedia.org/wiki/Binomial_distribution#Poisson_approximation
lambda = N * p = N * subsamplingRate / N = subsamplingRate
So we can use
PoissonDistribution(lambda = subsamplingRate)
to produce the weights to sample with replacement. This is much more computationally efficient.
Thanks a lot for your patience, the reply is really clear and I have understood yet.