n
X = Xi
i=1
Then, for any δ ∈ (0, 1), we can show the following:
P(X < (1
|
−
|
δ)E[X]) < e−E[X]δ2
|
/2
|
,
|
(12.6)
|
|
|
|
|
|
|
|
where e is the base of the natural logarithm.
Proof: The first step is to show the following inequality:
|
|
e−δ
|
|
|
E[X]
|
|
P (X < (1 − δ)E[X]) <
|
|
|
|
(12.7)
|
|
|
|
|
|
|
|
(1
|
−
|
δ)(1
|
−
|
δ)
|
|
|
|
|
|
|
|
The unknown parameter t > 0 is introduced to create a parameterized bound. The lower-tail inequality of X is converted into an upper-tail inequality on e−tX . This can be bounded by the Markov inequality, and it provides a bound that is a function of t. This function of
396 CHAPTER 12. MINING DATA STREAMS
t can be optimized, to obtain the tightest possible bound. By using the Markov inequality on the exponentiated form, the following can be derived:
|
P(X < (1
|
− δ)E[X]) ≤
|
E[e−tX]
|
|
|
|
|
.
|
|
|
|
e−t(1−δ)E[X]
|
|
|
By expanding X =
|
n
|
|
|
|
|
|
i=1 Xi in the exponent, the following can be obtained:
|
|
|
|
P(X < (1
|
− δ)E[X]) ≤
|
i E[e−tXi]
|
(12.8)
|
|
|
|
.
|
|
|
e−t(1−δ)E[X]
|
|
The aforementioned simplification uses the fact that the expectation of the product of inde-pendent variables is equal to the product of the expectations. Because each Xi is Bernoulli, the following can be shown by summing up the probabilities over the cases where Xi = 0 and 1, respectively:
E[e−tXi ] = 1 + E[Xi](e−t − 1) < eE[Xi](e−t−1).
The second inequality follows from the polynomial expansion of eE[Xi](e−t −1). By substitut-ing this inequality back into Eq. 12.8, and using E[X] = i E[Xi], the following may be obtained:
eE[X](e−t−1)
P (X < (1 − δ)E[X]) ≤ e−t(1−δ)E[X] .
The expression on the right is true for any value of t > 0. It is desired to pick a value t that provides the tightest possible bound. Such a value of t may be obtained by using the standard optimization process of using the derivative of the expression with respect to t. It can be shown by working out the details of this optimization process that the optimum value of t = t∗ is as follows:
t∗ = ln(1/(1
|
−
|
δ)).
|
(12.9)
|
|
|
|
|
|
By using this value of t∗ in the inequality above, it can be shown to be equivalent to Eq. 12.7.
This completes the first part of the proof.
The first two terms of the Taylor expansion of the logarithmic term in (1 − δ)ln(1 − δ) can be expanded to show that (1 − δ)(1−δ) > e−δ+δ2 /2. By substituting this inequality in the denominator of Eq. 12.7, the desired result is obtained.
A similar result for the upper-tail Chernoff bound may be obtained that has a slightly different form.
Theorem 12.2.4 (Upper-Tail Chernoff Bound) Let X be a random variable that can be expressed as the sum of n independent binary (Bernoulli) random variables, each of which takes on the value of 1 with probability pi.
n
X = Xi.
i=1
Then, for any δ ∈ (0, 2e − 1), the following is true:
P (X > (1 + δ)E[X]) < e−E[X]δ2/4,
|
(12.10)
|
where e is the base of the natural logarithm.
12.2. SYNOPSIS DATA STRUCTURES FOR STREAMS
|
397
|
Proof: The first step is to show the following inequality:
|
eδ
|
E[X]
|
|
|
P (X > (1 + δ)E[X]) <
|
.
|
(12.11)
|
|
|
|
(1 + δ)(1+δ)
|
|
|
|
|
|
As before, this can be done by introducing the unknown parameter t > 0, and converting the upper-tail inequality on X into that on etX . This can be bounded by the Markov inequality, and it provides a bound that is a function of t. This function of t can be optimized to obtain the tightest possible bound.
It can be further shown by algebraic simplification that the inequality in Eq. 12.11 provides the desired result, when δ ∈ (0, 2e − 1).
Next, the Hoeffding inequality will be introduced. The Hoeffding inequality is a more gen-eral tail inequality than the Chernoff bound because it does not require the underlying data values to be Bernoulli. In this case, the ith data value needs to be drawn from the bounded interval [li, ui ]. The corresponding probability bound is expressed in terms of the parameters li and ui. Thus, the scenario for the Chernoff bound is a special case of that for the Hoeffding’s inequality. We state the Hoeffding inequality below, for which both the upper- and lower-tail inequalities are identical.
Theorem 12.2.5 (Hoeffding Inequality) Let X be a random variable that can be expressed as the sum of n independent random variables, each of which is bounded in the range [li, ui].
n
X = Xi.
i=1
Then, for any θ > 0, the following can be shown:
P (X − E[X] > θ) ≤ e
|
|
2θ2
|
|
|
(12.12)
|
|
−
|
n
|
(ui
|
−
|
li)2
|
|
|
|
i=1
|
|
|
|
|
P (E[X] − X > θ) ≤ e
|
|
2θ2
|
|
|
(12.13)
|
|
−
|
n
|
(ui
|
−
|
li)2
|
|
|
|
i=1
|
|
|
|
|
Proof: The proof for the upper tail will be briefly described here. The proof of the lower-tail inequality is identical. For an unknown parameter t, the following is true:
P (X
|
−
|
E[X] > θ) = P (et(X−E[X]) > etθ)
|
(12.14)
|
|
|
|
|
|
The Markov inequality can be used to show that the right-hand probability is at most E[e(X−E[X])]e−tθ. The expression within E[e(X−E[X]) ] can be expanded in terms of the individual components Xi. Because the expectation of the product is equal to the product of the expectations of independent random variables, the following can be shown:
P (X − E[X] > θ) ≤ e−tθ E[et(Xi−E[Xi])].
|
(12.15)
|
i
|
|
The key is to show that the value of E[et(Xi− E[Xi])] is at most equal to et2 (ui−li)2 /8. This can be shown with the use of an argument that uses the convexity of the exponential function et(Xi−E[Xi]) in conjunction with Taylor’s theorem (see Exercise 12).
Therefore, the following is true:
P (X − E[X] > θ) ≤ e−tθ et2 (ui−li)2/8.
|
(12.16)
|
i
398
|
|
|
CHAPTER 12. MINING DATA STREAMS
|
|
|
Table 12.1: Comparison of different methods used to bound tail probabilities
|
|
|
|
|
|
|
|
|
|
Result
|
|
Scenario
|
Strength
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Chebychev
|
|
Any random variable
|
Weak
|
|
|
Markov
|
|
Nonnegative random variable
|
Weak
|
|
|
|
Hoeffding
|
|
Sum of independent bounded random variables
|
Strong
|
|
|
|
Chernoff
|
|
Sum of independent Bernoulli variables
|
Strong
|
|
|
|
|
|
|
|
|
|
This inequality holds for any nonnegative value of t. Therefore, to find the tightest bound, the value of t that minimizes the right-hand side of the above equation needs to be deter-mined. The optimal value of t = t∗ can be shown to be:
t∗ =
|
|
4θ
|
|
.
|
(12.17)
|
|
|
|
|
|
n
|
(ui
|
− li)2
|
|
|
i=1
|
|
|
|
By substituting the value of t = t∗ in Eq. 12.16, the desired result may be obtained. The lower-tail bound may be derived by applying the aforementioned steps to P (E[X] − X > θ) rather than P (X − E[X] > θ).
Thus, the different inequalities may apply to scenarios of different generality, and they may also have different levels of strength. These different scenarios are illustrated in Table 12.1.
12.2.2 Synopsis Structures for the Massive-Domain Scenario
As discussed in the introduction, many streaming applications contain discrete attributes, whose domain is drawn on a large number of distinct values. A classical example would be the value of the IP address in a network stream, or the e-mail address in an e-mail stream. Such scenarios are more common in data streams, simply because the massive number of data items in the stream are often associated with discrete identifiers of different types. E-mail addresses and IP addresses are examples of such identifiers. The streaming objects are often associated with pairs of identifiers. For example, each element of an e-mail stream may have both a sender and recipient. In some applications, it may be desirable to store statistics using pairwise identifiers, and therefore the pairwise combination is treated as a single integrated attribute. The domain of possible values can be rather large. For example, for an e-mail application with over a hundred million different senders and receivers, the number of possible pairwise combinations is 1016. In such cases, even storing simple summary statistics
such as set membership, frequent item counts, and distinct element counts becomes more challenging from the perspective of space constraints.
If the number of distinct elements were small, one might simply use an array, and update the counts in these arrays in order to create an effective summary. Such a summary could address all the aforementioned queries. However, such an approach would not be practical in the massive-domain scenario because an array with 1016 elements would require more than 10 petabytes. Furthermore, for many queries, such as set membership and distinct element counting, a reservoir sample would not work well. This is because the vast majority of the stream may contain infrequent elements, and the reservoir will disproportionately overrepresent the frequent elements for queries that are agnostic to the absolute frequency. Set membership and distinct-element counting are examples of such queries.
It is often difficult to create a single synopsis structure that can address all queries. Therefore, different synopsis structures are designed for different classes of queries. In the
12.2. SYNOPSIS DATA STRUCTURES FOR STREAMS
|
399
|
following, a number of different synopsis structures will be described. Each of these synop-sis structures is optimized to different kinds of queries. For each of the synopsis structures discussed in this chapter, the relevant queries and query-processing approach will also be described.
12.2.2.1 Bloom Filter
Bloom filters are designed for set-membership queries of discrete elements. The set-membership query is as follows:
Given a particular element, has it ever occurred in the data stream?
Bloom filters provide a way to maintain a synopsis of the stream, so that this query can be resolved with a probabilistic bound on the accuracy. One property of this data structure is that false positives are possible, but false negatives are not. In other words, if the bloom filter reports that an element does not belong to the stream, then this will always be the case. Bloom filters are referred to as “filters” because they can be used for making important selection decisions in a stream in real time. This is because the knowledge of membership of an item in a set of stream elements plays an important role in filtering decisions, such as the removal of duplicates. This will be discussed in more detail later. First, the simple case of stream membership queries will be discussed.
A bloom filter is a binary bit array of length m. Thus, the space requirement of the bloom filter is m/8 bytes. The elements of the bit array are indexed starting with 0 and ending at (m − 1). Therefore, the index range is {0, 1, 2, . . . m − 1}. The bloom filter is associated with
set of w independent hash functions denoted by h1(·) . . . hw(·). The argument of each of these hash functions is an element of the data stream. The hash function maps uniformly at random to an integer in the range {0 . . . m − 1}.
Consider a stream of discrete elements. These discrete elements could be e-mail addresses (either individually or sender–receiver pairs), IP addresses, or another set of discrete values drawn on a massive domain of possible values. The bits on the bloom filter are used to keep track of the distinct values encountered. The hash functions are used to map the stream elements to the bits in the bloom filter. For the following discussion, it will be assumed that the bloom filter data structure is denoted by B.
The bloom filter is constructed from a stream S of values as follows. All bits in the bloom filter are initialized to 0. For each incoming stream element x, the functions h1(x) . . . hw(x) are applied to it. For each i ∈ {1 . . . w}, the element hi(x) in the bloom filter is set to 1. In many cases, the value of some of these bits might already be 1. In such cases, the value does not need to be changed. A pictorial representation of the bloom filter and the update process is illustrated in Fig. 12.1. The pseudocode for the overall update process is illustrated in Fig. 12.2. In the pseudocode, the stream is denoted by S, and the bloom filter data structure is denoted by B. The input parameters include the size of the bloom filter m, and the number of hash functions w. It is important to note that multiple elements can map onto the same bit in the bloom filter. This is referred to as a collision. As discussed later, collisions may lead to false positives in set-membership checking.
The bloom filter can be used to check for membership of an item y in the data stream. The first step is to compute the hash functions h1(y) . . . hw(y). Then, it is checked whether the hi(y)th element is 1. If at least one of the these values is 0, we are guaranteed that the element has not occurred in the data stream. This is because, if that element had occurred in the stream, the entry would have already been set to 1. Thus, false negatives
400
|
|
|
|
|
|
|
|
|
CHAPTER 12. MINING DATA STREAMS
|
|
|
|
|
|
|
Dostları ilə paylaş: |