Grokking Algorithms

Yüklə 348,95 Kb.

Pdf görüntüsü

səhifə	105/122
tarix	05.12.2023
ölçüsü	348,95 Kb.
	#173611

1 ... 101 102 103 104 105 106 107 108 ... 122

grokking-algorithms-illustrated-programmers-curious

Bloom filters

I

Where to go next
Now you have a new item, and you want to see whether it belongs in
that set. You could do this quickly with a hash. For example, suppose
Google has a big hash in which the keys are all the pages it has crawled.
You want to see whether you’ve already crawled adit.io. Look it up in
the hash.
adit.io
is a key in the hash, so you’ve already crawled it. The average
lookup time for hash tables is O(1).
adit.io
is in the hash, so you’ve
already crawled it. You found that out in constant time. Pretty good!
Except that this hash needs to be
huge
. Google indexes trillions of web
pages. If this hash has all the URLs that Google has indexed, it will take
up a lot of space. Reddit and bit.ly have the same space problem. When
you have so much data, you need to get creative!
Bloom filters
Bloom filters offer a solution. Bloom filters are
probabilistic data
structures
. They give you an answer that could be wrong but is probably
correct. Instead of a hash, you can ask your bloom filter if you’ve
crawled this URL before. A hash table would give you an accurate
answer. A bloom filter will give you an answer that’s probably correct:
• False positives are possible. Google might say, “You’ve already crawled
this site,” even though you haven’t.
• False negatives aren’t possible. If the bloom filter says, “You haven’t
crawled this site,” then you
definitely
haven’t crawled this site.
Bloom filters are great because they take up very little space. A hash
table would have to store every URL crawled by Google, but a bloom
filter doesn’t have to do that. They’re great when you don’t need an exact
answer, as in all of these examples. It’s okay for bit.ly to say, “We think
this site might be malicious, so be extra careful.”

213
The SHA algorithms
HyperLogLog
Along the same lines is another algorithm called HyperLogLog.
Suppose Google wants to count the number of
unique
searches
performed by its users. Or suppose Amazon wants to count the number
of unique items that users looked at today. Answering these questions
takes a lot of space! With Google, you’d have to keep a log of all the
unique searches. When a user searches for something, you have to see
whether it’s already in the log. If not, you have to add it to the log. Even
for a single day, this log would be massive!
HyperLogLog approximates the number of unique elements in a set.
Just like bloom filters, it won’t give you an exact answer, but it comes
very close and uses only a fraction of the memory a task like this would
otherwise take.
If you have a lot of data and are satisfied with approximate answers,
check out probabilistic algorithms!

Yüklə 348,95 Kb.

Dostları ilə paylaş:

1 ... 101 102 103 104 105 106 107 108 ... 122