Thursday, June 13, 2013

Check if two huge files contain the same strings

Problem:

given two very big files , each having around 1 billion lines , u have to print all the lines that are common to both files . Please suggest a good optimal approach


Solution:
Use a Bloom filter. From Wikipedia:
An empty Bloom filter is a bit array of m bits, all set to 0. There must also be k different hash functions defined, each of which maps or hashes some set element to one of the m array positions with a uniform random distribution.
To add an element, feed it to each of the k hash functions to get k array positions. Set the bits at all these positions to 1.
To query for an element (test whether it is in the set), feed it to each of the k hash functions to get k array positions. If any of the bits at these positions are 0, the element is definitely not in the set – if it were, then all the bits would have been set to 1 when it was inserted. If all are 1, then either the element is in the set, or the bits have by chance been set to 1 during the insertion of other elements, resulting in a false positive. In a simple bloom filter, there is no way to distinguish between the two cases, but more advanced techniques can address this problem.

Complexity:
time - O(n)
space - O(1)
Links and credits:

Algorithms and Data Structures Development

(a discussion posted in this members only group on LinkedIn)
http://en.wikipedia.org/wiki/Bloom_filter

No comments:

Post a Comment