211
Bloom filters and HyperLogLog
In this case, you sum up all the elements in the array:
1 + 2 + 3 + 4
+ 5 = 15
! I won’t
explain
reduce
in more detail here, because there are
plenty of tutorials online.
MapReduce uses these two simple concepts
to run queries about data
across multiple machines. When you have a large dataset (billions
of rows), MapReduce can give you an answer in minutes where a
traditional database might take hours.
Bloom filters and HyperLogLog
Suppose you’re running Reddit.
When someone posts a link, you want
to see if it’s been posted before. Stories that haven’t
been posted before
are considered more valuable. So you need to figure out whether this
link has been posted before.
Or suppose you’re Google, and you’re crawling web pages. You only
want to crawl a web page if you haven’t crawled it already. So you need
to figure out whether this page has been crawled before.
Or suppose you’re running bit.ly, which is a URL shortener. You don’t
want to redirect users to malicious websites.
You have a set of URLs that
are considered malicious. Now you need to figure out whether you’re
redirecting the user to a URL in that set.
All of these examples have the same problem. You have a very large set.