Wednesday, May 27, 2015

Reservoir sampling

Reservoir sampling is algorithm to randomly choose m samples from a large sample pool. The algorithm is as following: Supposed index starts from 0 1. choose first m sample from the pool 2. for each sample after that, e.g. index at k, randomly choose number i in the range of [0, k] if(i<=m) swap number at i and k Output the numbers at index from 0 to m.


public int[] ReservoirSampling(int[] input, int m) {
if(input.length<=m)
return input;

int[] output = new int[m];
for(int i = 0; i<input.length; i++){
if(i<m)
output[i] = input[i];
else{
int pick = (int) (Math.random()*(i+1));
if(pick<m){
output[pick] = input[i];
}
}
}

return output;
}

Proof by induction:

If at step i-1, the probability of number in the list is m/i-1

The at step i, 

the probability of any number in the output remains is probability of number in step i-1 and
probability it survives the step 1: m/(i-1) * i-1/i = m/i, 

the probability of the number at i being chosen is m/i.

Thank you for my friend Peng Li for presenting this problem. Please see his link on this: http://allenlipeng47.com/PersonalPage/index/view/161/nkey

2 comments:

  1. you can avoid "if(i<m)" if you load first m items before the loop.

    ReplyDelete
    Replies
    1. yes, i know.
      but i treat it as streaming problem..

      Delete