Grokking Algorithms



Yüklə 348,95 Kb.
Pdf görüntüsü
səhifə105/122
tarix05.12.2023
ölçüsü348,95 Kb.
#173611
1   ...   101   102   103   104   105   106   107   108   ...   122
grokking-algorithms-illustrated-programmers-curious

 
 
I
 
 
Where to go next
Now you have a new item, and you want to see whether it belongs in 
that set. You could do this quickly with a hash. For example, suppose 
Google has a big hash in which the keys are all the pages it has crawled.
You want to see whether you’ve already crawled adit.io. Look it up in 
the hash.
adit.io
is a key in the hash, so you’ve already crawled it. The average 
lookup time for hash tables is O(1). 
adit.io
is in the hash, so you’ve 
already crawled it. You found that out in constant time. Pretty good!
Except that this hash needs to be 
huge
. Google indexes trillions of web 
pages. If this hash has all the URLs that Google has indexed, it will take 
up a lot of space. Reddit and bit.ly have the same space problem. When 
you have so much data, you need to get creative!
Bloom filters
Bloom filters offer a solution. Bloom filters are 
probabilistic data 
structures
. They give you an answer that could be wrong but is probably 
correct. Instead of a hash, you can ask your bloom filter if you’ve 
crawled this URL before. A hash table would give you an accurate 
answer. A bloom filter will give you an answer that’s probably correct:
• False positives are possible. Google might say, “You’ve already crawled 
this site,” even though you haven’t.
• False negatives aren’t possible. If the bloom filter says, “You haven’t 
crawled this site,” then you 
definitely
haven’t crawled this site. 
Bloom filters are great because they take up very little space. A hash 
table would have to store every URL crawled by Google, but a bloom 
filter doesn’t have to do that. They’re great when you don’t need an exact 
answer, as in all of these examples. It’s okay for bit.ly to say, “We think 
this site might be malicious, so be extra careful.”


213
The SHA algorithms
HyperLogLog
Along the same lines is another algorithm called HyperLogLog. 
Suppose Google wants to count the number of 
unique
searches 
performed by its users. Or suppose Amazon wants to count the number 
of unique items that users looked at today. Answering these questions 
takes a lot of space! With Google, you’d have to keep a log of all the 
unique searches. When a user searches for something, you have to see 
whether it’s already in the log. If not, you have to add it to the log. Even 
for a single day, this log would be massive! 
HyperLogLog approximates the number of unique elements in a set. 
Just like bloom filters, it won’t give you an exact answer, but it comes 
very close and uses only a fraction of the memory a task like this would 
otherwise take.
If you have a lot of data and are satisfied with approximate answers
check out probabilistic algorithms! 

Yüklə 348,95 Kb.

Dostları ilə paylaş:
1   ...   101   102   103   104   105   106   107   108   ...   122




Verilənlər bazası müəlliflik hüququ ilə müdafiə olunur ©azkurs.org 2024
rəhbərliyinə müraciət

gir | qeydiyyatdan keç
    Ana səhifə


yükləyin