Saturday, June 8, 2013

Selecting randomly from an unknown sequence

Problem:
How to choose randomly from a sequence of unknown length. The typical use is to choose a random line from a file.

Detailed description:
You are given a file (and you do not know how big the file is, nor how big the lines in the file are). Write an algorithm that will generate a random number (which will be used to index into the file) such that the output of lines will be roughly proportional? That is if the file contained 4 lines, and if I ran the program 1 million times I should get each line printed approximately 250K times.

Solution:

import random

def random_element(seq):
    """Return an element chosen at random from `seq`."""
    it = None
    for n, elem in enumerate(seq):
        if random.randint(0, n) == 0:
            it = elem
    return it

To see why this function works, consider it inductively. For the initial case, we'll always pick the first element as "it", since a random int between 0 and 0 will be 0. So after one element, we've chosen uniformly from the one element we've seen.

For the inductive case, imagine we've read N-1 elements already, and "it" is a random element chosen uniformly from those N-1 elements. The chance that the Nth element should be chosen instead is 1/N, because at this point, the chance that any particular line should be "it" is 1/N. So we choose a random number, and if it's the 1/N outcome, we take the Nth line as "it." In the case that we don't take the new line, which is (N-1)/N, the old line is kept. It was chosen randomly from the N-1 lines that preceded this one, so each line has a final chance of ((N-1)/N)/(N-1), or 1/N, or being chosen. Therefore the new line and each of the old lines is equally likely, so the distribution is uniform.

Since the initial case meets our uniform-selection criterion, and we've shown that if N-1 satisfies it then N satisfies it, then by induction, our selection will be uniformly random for any N.

Complexity:
time - O(n)
space - O(1)
Links and credits:
http://nedbatchelder.com/blog/201208/selecting_randomly_from_an_unknown_sequence.html


No comments:

Post a Comment