Data Mining: The Textbook

Yüklə 17,13 Mb.

səhifə	244/423
tarix	07.01.2024
ölçüsü	17,13 Mb.
	#211690

1 ... 240 241 242 243 244 245 246 247 ... 423

1-Data Mining tarjima

n
X = X_i

i=1

Then, for any δ ∈ (0, 1), we can show the following:

P(X < (1	−	δ)E[X]) < e^−E^[^X^]^δ²	/2	,	(12.6)
	−

where e is the base of the natural logarithm.

Proof: The first step is to show the following inequality:

		_e−δ				E[X]
P (X < (1 − δ)E[X]) <		_e−δ				(12.7)

	(1	−	_δ)(1	−	δ)
		−		−

The unknown parameter t > 0 is introduced to create a parameterized bound. The lower-tail inequality of X is converted into an upper-tail inequality on e−tX . This can be bounded by the Markov inequality, and it provides a bound that is a function of t. This function of

396 CHAPTER 12. MINING DATA STREAMS

t can be optimized, to obtain the tightest possible bound. By using the Markov inequality on the exponentiated form, the following can be derived:

	P(X < (1	− δ)E[X]) ≤	_E[_e−tX]
				.
			_e−t(1−δ)E[X]
By expanding X =	n
	_i₌₁X_i in the exponent, the following can be obtained:
	P(X < (1	− δ)E[X]) ≤	_i_E[_e−^tXi^]		(12.8)
				.
			_e−t(1−δ)E[X]

The aforementioned simplification uses the fact that the expectation of the product of inde-pendent variables is equal to the product of the expectations. Because each X_i is Bernoulli, the following can be shown by summing up the probabilities over the cases where X_i = 0 and 1, respectively:

_E[_e−^tXi ] = 1 + _E[_X_i](_e−^t ₋ 1) _{< e}^E^[^Xi^](^e^−t−¹⁾_.

The second inequality follows from the polynomial expansion of eE[Xi](e^−t −1). By substitut-ing this inequality back into Eq. 12.8, and using E[X] = _i E[X_i], the following may be obtained:

eE[X](e^−t−1)
P (X < (1 − δ)E[X]) ≤ _e_−t₍₁_−δ₎_E_[_X_] .

The expression on the right is true for any value of t > 0. It is desired to pick a value t that provides the tightest possible bound. Such a value of t may be obtained by using the standard optimization process of using the derivative of the expression with respect to t. It can be shown by working out the details of this optimization process that the optimum value of t = t∗ is as follows:

t^∗ = ln(1/(1	−	δ)).	(12.9)
	−

By using this value of t∗ in the inequality above, it can be shown to be equivalent to Eq. 12.7.

This completes the first part of the proof.

The first two terms of the Taylor expansion of the logarithmic term in (1 − δ)ln(1 − δ) can be expanded to show that (1 − δ)(1−δ) > e−δ+δ² /2. By substituting this inequality in the denominator of Eq. 12.7, the desired result is obtained.

A similar result for the upper-tail Chernoﬀ bound may be obtained that has a slightly diﬀerent form.

Theorem 12.2.4 (Upper-Tail Chernoﬀ Bound) Let X be a random variable that can be expressed as the sum of n independent binary (Bernoulli) random variables, each of which takes on the value of 1 with probability p_i.

n
X = X_i.

i=1

Then, for any δ ∈ (0, 2e − 1), the following is true:

P (X > (1 + δ)E[X]) < e^−E^[^X^]^δ²^/⁴,

(12.10)

where e is the base of the natural logarithm.

12.2. SYNOPSIS DATA STRUCTURES FOR STREAMS

397

Proof: The first step is to show the following inequality:

	_eδ	E[X]
P (X > (1 + δ)E[X]) <	_eδ	.	(12.11)

	(1 + _δ)(1+δ)
	(1 + _δ)(1+δ)

As before, this can be done by introducing the unknown parameter t > 0, and converting the upper-tail inequality on X into that on etX . This can be bounded by the Markov inequality, and it provides a bound that is a function of t. This function of t can be optimized to obtain the tightest possible bound.

It can be further shown by algebraic simplification that the inequality in Eq. 12.11 provides the desired result, when δ ∈ (0, 2e − 1).

Next, the Hoeﬀding inequality will be introduced. The Hoeﬀding inequality is a more gen-eral tail inequality than the Chernoﬀ bound because it does not require the underlying data values to be Bernoulli. In this case, the ith data value needs to be drawn from the bounded interval [l_i, u_i ]. The corresponding probability bound is expressed in terms of the parameters l_i and u_i. Thus, the scenario for the Chernoﬀ bound is a special case of that for the Hoeﬀding’s inequality. We state the Hoeﬀding inequality below, for which both the upper- and lower-tail inequalities are identical.

Theorem 12.2.5 (Hoeﬀding Inequality) Let X be a random variable that can be expressed as the sum of n independent random variables, each of which is bounded in the range [l_i, u_i].

n
X = X_i.

i=1

Then, for any θ > 0, the following can be shown:

P (X − E[X] > θ) ≤ e		2θ²			(12.12)
−	n	(u_i	−	l_i)²
	i=1		−
P (E[X] − X > θ) ≤ e		2θ²			(12.13)
−	n	(u_i	−	l_i)²
	i=1		−

Proof: The proof for the upper tail will be briefly described here. The proof of the lower-tail inequality is identical. For an unknown parameter t, the following is true:

P (X	−	E[X] > θ) = P (e^t⁽^X−E^[^X^]) > e^tθ)	(12.14)
	−

The Markov inequality can be used to show that the right-hand probability is at most E[e(X−E[X])]e−tθ. The expression within E[e(X−E[X]) ] can be expanded in terms of the individual components X_i. Because the expectation of the product is equal to the product of the expectations of independent random variables, the following can be shown:

_P(_X ₋ _E[_X] _{> θ}) _≤ _e−^tθ _E[_e^t⁽^Xi−^E^[^Xi^])]_.	(12.15)
i

The key is to show that the value of E[e^t⁽^Xi⁻ ^E^[^Xi^])] is at most equal to e^t² ⁽^ui^−li⁾² ^/⁸. This can be shown with the use of an argument that uses the convexity of the exponential function et(Xi−E[Xi]) in conjunction with Taylor’s theorem (see Exercise 12).

Therefore, the following is true:

_P(_X ₋ _E[_X] _{> θ}) _≤ _e−^tθ _e^t² ⁽^ui−^li⁾²^/⁸_.

(12.16)

398				CHAPTER 12. MINING DATA STREAMS
		Table 12.1: Comparison of diﬀerent methods used to bound tail probabilities

	Result		Scenario		Strength


	Chebychev		Any random variable		Weak
	Markov		Nonnegative random variable		Weak
	Hoeﬀding		Sum of independent bounded random variables		Strong
	Chernoﬀ		Sum of independent Bernoulli variables		Strong

This inequality holds for any nonnegative value of t. Therefore, to find the tightest bound, the value of t that minimizes the right-hand side of the above equation needs to be deter-mined. The optimal value of t = t∗ can be shown to be:

t^∗ =		4θ		.	(12.17)

	n	(u_i	− l_i)²
	i=1	(u_i	− l_i)²

By substituting the value of t = t∗ in Eq. 12.16, the desired result may be obtained. The lower-tail bound may be derived by applying the aforementioned steps to P (E[X] − X > θ) rather than P (X − E[X] > θ).

Thus, the diﬀerent inequalities may apply to scenarios of diﬀerent generality, and they may also have diﬀerent levels of strength. These diﬀerent scenarios are illustrated in Table 12.1.

12.2.2 Synopsis Structures for the Massive-Domain Scenario

As discussed in the introduction, many streaming applications contain discrete attributes, whose domain is drawn on a large number of distinct values. A classical example would be the value of the IP address in a network stream, or the e-mail address in an e-mail stream. Such scenarios are more common in data streams, simply because the massive number of data items in the stream are often associated with discrete identifiers of diﬀerent types. E-mail addresses and IP addresses are examples of such identifiers. The streaming objects are often associated with pairs of identifiers. For example, each element of an e-mail stream may have both a sender and recipient. In some applications, it may be desirable to store statistics using pairwise identifiers, and therefore the pairwise combination is treated as a single integrated attribute. The domain of possible values can be rather large. For example, for an e-mail application with over a hundred million diﬀerent senders and receivers, the number of possible pairwise combinations is 10¹⁶. In such cases, even storing simple summary statistics

such as set membership, frequent item counts, and distinct element counts becomes more challenging from the perspective of space constraints.

If the number of distinct elements were small, one might simply use an array, and update the counts in these arrays in order to create an eﬀective summary. Such a summary could address all the aforementioned queries. However, such an approach would not be practical in the massive-domain scenario because an array with 1016 elements would require more than 10 petabytes. Furthermore, for many queries, such as set membership and distinct element counting, a reservoir sample would not work well. This is because the vast majority of the stream may contain infrequent elements, and the reservoir will disproportionately overrepresent the frequent elements for queries that are agnostic to the absolute frequency. Set membership and distinct-element counting are examples of such queries.

It is often diﬃcult to create a single synopsis structure that can address all queries. Therefore, diﬀerent synopsis structures are designed for diﬀerent classes of queries. In the

12.2. SYNOPSIS DATA STRUCTURES FOR STREAMS

399

following, a number of diﬀerent synopsis structures will be described. Each of these synop-sis structures is optimized to diﬀerent kinds of queries. For each of the synopsis structures discussed in this chapter, the relevant queries and query-processing approach will also be described.

12.2.2.1 Bloom Filter

Bloom filters are designed for set-membership queries of discrete elements. The set-membership query is as follows:

Given a particular element, has it ever occurred in the data stream?

Bloom filters provide a way to maintain a synopsis of the stream, so that this query can be resolved with a probabilistic bound on the accuracy. One property of this data structure is that false positives are possible, but false negatives are not. In other words, if the bloom filter reports that an element does not belong to the stream, then this will always be the case. Bloom filters are referred to as “filters” because they can be used for making important selection decisions in a stream in real time. This is because the knowledge of membership of an item in a set of stream elements plays an important role in filtering decisions, such as the removal of duplicates. This will be discussed in more detail later. First, the simple case of stream membership queries will be discussed.

A bloom filter is a binary bit array of length m. Thus, the space requirement of the bloom filter is m/8 bytes. The elements of the bit array are indexed starting with 0 and ending at (m − 1). Therefore, the index range is {0, 1, 2, . . . m − 1}. The bloom filter is associated with

set of w independent hash functions denoted by h₁(·) . . . h_w(·). The argument of each of these hash functions is an element of the data stream. The hash function maps uniformly at random to an integer in the range {0 . . . m − 1}.

Consider a stream of discrete elements. These discrete elements could be e-mail addresses (either individually or sender–receiver pairs), IP addresses, or another set of discrete values drawn on a massive domain of possible values. The bits on the bloom filter are used to keep track of the distinct values encountered. The hash functions are used to map the stream elements to the bits in the bloom filter. For the following discussion, it will be assumed that the bloom filter data structure is denoted by B.

The bloom filter is constructed from a stream S of values as follows. All bits in the bloom filter are initialized to 0. For each incoming stream element x, the functions h₁(x) . . . h_w(x) are applied to it. For each i ∈ {1 . . . w}, the element h_i(x) in the bloom filter is set to 1. In many cases, the value of some of these bits might already be 1. In such cases, the value does not need to be changed. A pictorial representation of the bloom filter and the update process is illustrated in Fig. 12.1. The pseudocode for the overall update process is illustrated in Fig. 12.2. In the pseudocode, the stream is denoted by S, and the bloom filter data structure is denoted by B. The input parameters include the size of the bloom filter m, and the number of hash functions w. It is important to note that multiple elements can map onto the same bit in the bloom filter. This is referred to as a collision. As discussed later, collisions may lead to false positives in set-membership checking.

The bloom filter can be used to check for membership of an item y in the data stream. The first step is to compute the hash functions h₁(y) . . . h_w(y). Then, it is checked whether the h_i(y)th element is 1. If at least one of the these values is 0, we are guaranteed that the element has not occurred in the data stream. This is because, if that element had occurred in the stream, the entry would have already been set to 1. Thus, false negatives

400

CHAPTER 12. MINING DATA STREAMS

Yüklə 17,13 Mb.

Dostları ilə paylaş:

1 ... 240 241 242 243 244 245 246 247 ... 423