Data Mining: The Textbook



Yüklə 17,13 Mb.
səhifə242/423
tarix07.01.2024
ölçüsü17,13 Mb.
#211690
1   ...   238   239   240   241   242   243   244   245   ...   423
1-Data Mining tarjima

f (r, n) = e−λ(n−r)

(12.2)

The parameter λ defines the bias rate and typically lies in the range [0, 1]. In general, this parameter λ is chosen in an application-specific way. A choice of λ = 0 represents the unbiased case. The exponential bias function defines the class of memoryless functions in which the future probability of retaining a current point in the reservoir is independent of its past history or arrival time. It can be shown that this problem is interesting only in space-constrained scenarios, where the size of the reservoir k is strictly less than 1. This is because it can be shown [35] that an exponentially biased sample from a stream of infinite length, will not exceed 1 in expected size. This is also referred to as the maximum reservoir requirement. The following discussion is based on the assumption that k < 1.


The algorithm starts with an empty reservoir. The following replacement policy is used to fill up the reservoir. Assume that at the time of (just before) the arrival of the nth point, the fraction of the reservoir filled is F (n) [0, 1]. When the (n + 1)th point arrives, it is inserted into the reservoir with insertion probability1 λ · k. However, one of the old points in the reservoir is not necessarily deleted because the reservoir is only partially full. A coin is flipped with the success probability F (n). In the event of a success, one of the points in the reservoir is randomly selected and replaced with the incoming (n + 1)th point. In the event of a failure, there is no deletion, and the (n + 1)th point is added to the reservoir. In the latter case, the number of points in the reservoir (the current sample size) increases by 1. In this approach, the reservoir fills up fast early in the process, but then levels off, as it reaches near its capacity. The reader is referred to the bibliographic notes for the proof of correctness of this approach. A variant of this approach that fills up the reservoir even faster is also discussed in the same work.





1This value is always at most 1, because k < 1.



394 CHAPTER 12. MINING DATA STREAMS

12.2.1.2 Useful Theoretical Bounds for Sampling


While the reservoir method provides data samples, it is often desirable to obtain quality bounds on the results obtained with the use of the sampling process. A common applica-tion of sampling is the estimation of statistical aggregates with the reservoir sample. The accuracy of these aggregates is often quantified with the use of tail inequalities.


The simplest tail inequality is the Markov inequality. This inequality is defined for prob-ability distributions that take on only nonnegative values. Let X be a random variable with the probability distribution fX (x), a mean of E[X], and a variance of V ar[X].


Theorem 12.2.1 (Markov Inequality) Let X be a random variable that takes on only nonnegative random values. Then, for any constant α satisfying E[X] < α, the following is true:





P (X > α) ≤ E[X|/α

(12.3)

Proof: Let fX (x) represent the density function for the random variable X. Then, we have:


E[X] = x xfX (x)dx



  • 0≤x≤α xfX (x)dx + x>α xfX (x)dx

x>α xfX (x)dx x>α αfX (x)dx.

The first inequality follows from the nonnegativity of x, and the second follows from the fact that the integral is computed only over cases where x > α. The term on the right-hand side of the last line is exactly equal to αP (X > α). Therefore, the following is true:






Yüklə 17,13 Mb.

Dostları ilə paylaş:
1   ...   238   239   240   241   242   243   244   245   ...   423




Verilənlər bazası müəlliflik hüququ ilə müdafiə olunur ©azkurs.org 2024
rəhbərliyinə müraciət

gir | qeydiyyatdan keç
    Ana səhifə


yükləyin