Sunday, September 6, 2009

Reservoir Sampling Algorithm in Python and Perl

Algorithms that perform calculations on evolving data streams, but in fixed memory, have increasing relevance in the Age of Big Data.

The reservoir sampling algorithm outputs a sample of N lines from a file of undetermined size. It does so in a single pass, using memory proportional to N.

These two features -- (i) a constant memory footprint and (ii) a capacity to operate on files of indeterminate size -- make it ideal for working with very large data sets common to event processing.

While it has likely been multiply discovered and implemented, like many algorithms, it was codified by Knuth's The Art of Computer Programming.

The trick of this algorithm is to first fill up the sample buffer, and afterwards, to probabilistically replace it with additional lines of input.

Python version

#!/usr/bin/python
import sys
import random

if len(sys.argv) == 3:
input = open(sys.argv[2],'r')
elif len(sys.argv) == 2:
input = sys.stdin;
else:
sys.exit("Usage: python samplen.py <lines> <?file>")

N = int(sys.argv[1]);
sample = [];

for i,line in enumerate(input):
if i < N:
sample.append(line)
elif i >= N and random.random() < N/float(i+1):
replace = random.randint(0,len(sample)-1)
sample[replace] = line

for line in sample:
sys.stdout.write(line)


Perl version

#!/usr/bin/perl -sw

$IN = 'STDIN' if (@ARGV == 1);
open($IN, '<'.$ARGV[1]) if (@ARGV == 2);
die "Usage: perl samplen.pl <lines> <?file>\n" if (!defined($IN));

$N = $ARGV[0];
@sample = ();

while (<$IN>) {
if ($. <= $N) {
$sample[$.-1] = $_;
} elsif (($. > $N) && (rand() < $N/$.)) {
$replace = int(rand(@sample));
$sample[$replace] = $_;
}
}

print foreach (@sample);
close($IN);


For example, imagine we are to sample 5 lines randomly from a 6-line file. Call i the line number of the input, and N the size of sample desired. For the first 5 lines (where i < = N), our sample fills entirely. (For the non-Perl hackers: the current line number i is held by the variable $., just as the special variable $_ holds the current line value).

It's at successive lines of input that the probabilistic sampling starts: the 6th line has a 5/6th (N/i) chance of being sampled, and if chosen, it will replace one of the previously 5 chosen lines with a 1/5 chance: leaving them a (5/6 * 1/5) = 5/6 chance of being sampled. Thus all 6 lines have an equal chance of being sampled.

In general, as more lines are seen, the chance that any additional line is chosen for the sample falls; but the chance that any previously chosen line could be replaced grows. These two balance such that the probability for any given line of input to be sampled is identical.

A more sophisticated variation of this algorithm is one that can take into consideration a weighted sampling.

No comments: