Showing posts with label Reservoir sampling. Show all posts
Showing posts with label Reservoir sampling. Show all posts

Wednesday, May 27, 2015

Reservoir sampling

Reservoir sampling is algorithm to randomly choose m samples from a large sample pool. The algorithm is as following: Supposed index starts from 0 1. choose first m sample from the pool 2. for each sample after that, e.g. index at k, randomly choose number i in the range of [0, k] if(i<=m) swap number at i and k Output the numbers at index from 0 to m.


public int[] ReservoirSampling(int[] input, int m) {
if(input.length<=m)
return input;

int[] output = new int[m];
for(int i = 0; i<input.length; i++){
if(i<m)
output[i] = input[i];
else{
int pick = (int) (Math.random()*(i+1));
if(pick<m){
output[pick] = input[i];
}
}
}

return output;
}

Proof by induction:

If at step i-1, the probability of number in the list is m/i-1

The at step i, 

the probability of any number in the output remains is probability of number in step i-1 and
probability it survives the step 1: m/(i-1) * i-1/i = m/i, 

the probability of the number at i being chosen is m/i.

Thank you for my friend Peng Li for presenting this problem. Please see his link on this: http://allenlipeng47.com/PersonalPage/index/view/161/nkey