Reservoir sampling is algorithm to randomly choose m samples from a large sample pool.
The algorithm is as following:
Supposed index starts from 0
1. choose first m sample from the pool
2. for each sample after that, e.g. index at k, randomly choose number i in the range of [0, k]
if(i<=m) swap number at i and k
Output the numbers at index from 0 to m.
public int[] ReservoirSampling(int[] input, int m) {
if(input.length<=m)
return input;
int[] output = new int[m];
for(int i = 0; i<input.length; i++){
if(i<m)
output[i] = input[i];
else{
int pick = (int) (Math.random()*(i+1));
if(pick<m){
output[pick] = input[i];
}
}
}
return output;
}
Proof by induction:
If at step i-1, the probability of number in the list is m/i-1
The at step i,
the probability of any number in the output remains is probability of number in step i-1 and
probability it survives the step 1: m/(i-1) * i-1/i = m/i,
the probability of the number at i being chosen is m/i.
Thank you for my friend Peng Li for presenting this problem. Please see his link on this: http://allenlipeng47.com/PersonalPage/index/view/161/nkey
you can avoid "if(i<m)" if you load first m items before the loop.
ReplyDeleteyes, i know.
Deletebut i treat it as streaming problem..