ELEMENT x HASHES INTO THESE CELLS
|
|
|
|
|
|
|
|
|
|
|
|
|
(Bits Set to 1)
|
|
|
|
|
|
|
|
|
|
h1(x)
|
|
h2(x)
|
|
|
w= 4
|
|
|
h3(x)
|
|
h4(x)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
0
|
1
|
|
1
|
0
|
|
1
|
0
|
0
|
1
|
1
|
1
|
|
0
|
1
|
|
1
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
m
MEMBERSHIP OF y (BOOLEAN RESULT) = AND { h1(y), h2(y), …, hw(y) }
Figure 12.1: The bloom filter
Algorithm BloomConstruct(Stream: S, Size: m, Num. Hash Functions: w)
begin
Initialize all elements in a bit array B of size m to 0;
repeat
Receive next stream element x ∈ S;
for i = 1 to w do
Update hi(x)th element in bit array B to 1; until end of stream S;
return B;
end
Figure 12.2: Update of bloom filter
are not possible with the bloom filter. On the other hand, if all the entries h1 (y) . . . hw(y) in the bit array have a value of 1, then it is reported that y has occurred in the data stream. This can be checked efficiently by applying an “AND” logical operator to all the bit entries corresponding to the indices h1(y) . . . hw(y) in the bit array. The overall procedure of membership checks is illustrated in Fig. 12.3. The binary result for the decision problem for checking membership is tracked by the variable BooleanF lag. This binary flag is reported at the end of the procedure.
The bloom filter approach can lead to false positives, but not false negatives. A false positive occurs, if all the w different bloom filter array elements hi(y) for i ∈ {1 . . . w} have been set to 1 by some spurious element other than y. This is a direct result of collisions. As the number of elements in the data stream increases, all elements in the bloom filter are eventually set to 1. In such a case, all set-membership queries will yield a positive response. This is, of course, not a useful application of the bloom filter. Therefore, it is instructive to bound the false positive probability in terms of the size of the filter and the number of distinct elements in the data stream.
Lemma 12.2.2 Consider a bloom filter B with m elements, and w different hash functions. Let n be the number of distinct elements in the stream S so far. Consider an element y, which has not appeared in the stream so far. Then, the probability F that an element y is reported as a false positive is given by the following:
|
|
1
|
w·n w
|
|
F = 1
|
− 1
|
−
|
|
(12.18)
|
|
m
|
|
12.2. SYNOPSIS DATA STRUCTURES FOR STREAMS
|
401
|
Algorithm BloomQuery(Element: y, Bloom Filter: B)
begin
Initialize BooleanF lag = 1;
for i = 1 to w do
BooleanF lag = BooleanF lag AND hi(y); return BooleanF lag;
end
Figure 12.3: Membership check using bloom filter
Proof: Consider a particular bit corresponding to the bit array element hr(y) for some fixed value of the index r ∈ {1 . . . w}. Each element x ∈ S sets w different bits h1(x) . . . hw(x) to
The probability that none of these bits is the same as hr(y) is given by (1 − 1/m)w. Over n distinct stream elements, this probability is (1 − 1/m)w·n. Therefore, the probability that the bit array index hr(y) is set to 1, by at least one of the n spurious elements in S is given by Q = 1 − (1 − 1/m)w·n. A false positive occurs, when all bit array indices hr(y) (over varying values of r ∈ {1 . . . w}) have been set to 1. The probability of this is F = Qw. The result follows.
While the false-positive probability is expressed above in terms of the number of distinct stream elements, it is trivially true for the total number of stream elements (including repetitions), as an upper bound.
The expression in the aforementioned lemma can be simplified by observing that (1 − 1/m)m ≈ e−1, where e is the base of the natural logarithm. Correspondingly, the expression can be rewritten as follows:
F = (1 − e−n·w/m)w.
|
(12.19)
|
Values of w that are too small or too large lead to poor performance. The value of w needs be selected optimally in terms of m and n to minimize the number of false positives. The number of false positives is minimized, when w = m · ln(2)/n. Substituting this value in Eq. 12.19, it can be shown that the probability of false positives for optimal number of hash functions is:
(12.20)
The expression above can be written purely as an expression of m/n. Therefore, for a
fixed value of the false-positive probability F , the length of the bloom filter m needs to be proportional to the number of distinct elements n in the data stream. Furthermore, the constant of proportionality for a particular false-positive probability F can be shown to be
m = ln(1/F ) . While this may not seem like a significant compression, it needs to be pointed
n (ln(2))2
out that bloom filters use elementary bits to track the membership of arbitrary elements, such as strings. Furthermore, because of bitwise operations, which can be implemented very efficiently with low-level implementations, the overall approach is generally very efficient.
It does need to be kept in mind that the value of n is not known in advance for many applications. Therefore, one strategy is to use a cascade of bloom filters for geometrically increasing values of w , and to use a logical AND of the membership query result over different bloom filters. This is a practical approach that provides more stable performance over the life of the data stream.
The bloom filter is referred to as a “filter” because it is often used to make decisions on which elements to exclude from a data stream, when they meet the membership condition.
402 CHAPTER 12. MINING DATA STREAMS
For example, if one wanted to filter all duplicates from the data stream, the bloom filter is an effective strategy. Another strategy is to filter forbidden elements from a universe of values, such as a set of spammer e-mail addresses in an e-mail stream. In such a case, the bloom filter needs to be constructed up front with the spam e-mail addresses.
Many variations of the basic bloom filter strategy provide different capabilities and trade-offs:
The bloom filter can be used to approximate the number of distinct elements in a data stream. If m0 < m is the number of bits with a value of 0 in the bloom filter, then the number of distinct elements n can be estimated as follows (see Exercise 13):
n
|
≈
|
|
m · ln(m/m0)
|
(12.21)
|
|
|
w
|
|
|
|
|
|
The accuracy of this estimate reduces drastically, as the bloom filter fills up. When m0 = 0, the value of n is estimated to be ∞, and therefore the estimate is practically useless.
The bloom filter can be used to estimate the size of the intersection and union of sets corresponding to different streams, by creating one bloom filter for each stream. To determine the size of the union, the bitwise OR of the two filters is determined. The bitwise OR of the filter can be shown to be exactly the same as the bloom filter representation of the union of the two sets. Then, the formula of Eq. 12.21 is used. However, such an approach cannot be used for determining the size of the intersection. While the intersection of two sets can be approximated by using a bitwise AND operation on the two filters, the resulting bit positions in the filter will not be the same as that obtained by constructing the filter directly on the intersection. The resulting filter might contain false negatives, and, therefore, such an approach is lossy. To estimate the size of the intersection, one can first estimate the size of the union and then use the following simple setwise relationship:
|S1 ∩ S2| = |S1| + |S2| − |S1 ∪ S2|
|
(12.22)
|
The bloom filter is primarily designed for membership queries, and is not the most space-efficient data structure, when used purely for distinct element counting. In a later section, a space-efficient technique, referred to as the Flajolet–Martin algorithm, will be discussed.
The bloom filter can allow a limited (one-time) tracking of deletions by setting the corresponding bit elements to zero, when an element is deleted. In such a case, false negatives are also possible.
Variants of the bloom filter can be designed in which the w hash functions can map onto separate bit arrays. A further generalization of this principle is to track counts of elements rather than simply binary bit values to enable richer queries. This gener-alization, discussed in the next section, is also referred to as the count-min sketch.
Bloom filters are commonly used in many streaming settings in the text domain.
12.2. SYNOPSIS DATA STRUCTURES FOR STREAMS
|
403
|
|
|
|
m
|
|
|
|
|
|
|
|
Dostları ilə paylaş: |