The reduce function
The
reduce
function confuses people sometimes. The idea is that you
“reduce” a whole list of items down to one item. With
map
, you go from
one array to another.
With
reduce
, you transform an array to a single item.
Here’s an example:
>>> arr1 = [1, 2, 3, 4, 5]
>>> reduce(lambda x,y: x+y, arr1)
15
211
Bloom filters and HyperLogLog
In this case, you sum up all the elements in the array:
1 + 2 + 3 + 4
+ 5 = 15
! I won’t explain
reduce
in more detail here, because there are
plenty of tutorials online.
MapReduce uses these two simple concepts to run queries about data
across multiple machines. When you have a large dataset (billions
of rows), MapReduce can give you an answer in minutes where a
traditional database might take hours.
Bloom filters and HyperLogLog
Suppose you’re running Reddit. When someone posts a link, you want
to see if it’s been posted before. Stories that haven’t been posted before
are considered more valuable. So you need to figure out whether this
link has been posted before.
Or suppose you’re Google, and you’re crawling web pages. You only
want to crawl a web page if you haven’t crawled it already. So you need
to figure out whether this page has been crawled before.
Or suppose you’re running bit.ly, which is a URL shortener. You don’t
want to redirect users to malicious websites. You have a set of URLs that
are considered malicious. Now you need to figure out whether you’re
redirecting the user to a URL in that set.
All of these examples have the same problem. You have a very large set.
212
Chapter 11
Dostları ilə paylaş: |