given two very big files , each having around 1 billion lines , u have to print all the lines that are common to both files . Please suggest a good optimal approach
Solution:
Use a Bloom filter. From Wikipedia:
An empty Bloom filter is a bit array of m bits, all set to 0. There must also be k different hash functions defined, each of which maps or hashes some set element to one of the m array positions with a uniform random distribution.
To add an element, feed it to each of the k hash functions to get k array positions. Set the bits at all these positions to 1.
To query for an element (test whether it is in the set), feed it to each of the k hash functions to get k array positions. If any of the bits at these positions are 0, the element is definitely not in the set – if it were, then all the bits would have been set to 1 when it was inserted. If all are 1, then either the element is in the set, or the bits have by chance been set to 1 during the insertion of other elements, resulting in a false positive. In a simple bloom filter, there is no way to distinguish between the two cases, but more advanced techniques can address this problem.
Complexity:
time - O(n)Links and credits:
space - O(1)
Algorithms and Data Structures Development
(a discussion posted in this members only group on LinkedIn)http://en.wikipedia.org/wiki/Bloom_filter
No comments:
Post a Comment