Data Mining: The Textbook



Yüklə 17,13 Mb.
səhifə241/423
tarix07.01.2024
ölçüsü17,13 Mb.
#211690
1   ...   237   238   239   240   241   242   243   244   ...   423
1-Data Mining tarjima

n(n−1)



  1. The incoming data point is inserted into the reservoir. The probability of Case II is equal to insertion probability k/n of incoming data points. Subsequently, existing reservoir points are retained with probability (k − 1)/k because exactly one of them is ejected. Because the inductive assumption implies that any of the earlier points in the stream was originally present in the reservoir with probability k/(n − 1), it implies that the probability of a point being included in the reservoir and Case II event is given by the product p2 of the three aforementioned probabilities:




p2 =

k




k − 1




k




=

k(k − 1)

(12.1)




n

k




n − 1

n(n − 1)






















Therefore, the total probability of a stream point being retained in the reservoir after the nth data point arrival is given by the sum of p1 and p2. It can be shown that this is equal to k/n.


The major problem with this approach is that it cannot handle concept drift because the data is uniformly sampled without decay.



12.2. SYNOPSIS DATA STRUCTURES FOR STREAMS

393

12.2.1.1 Handling Concept Drift


In streaming scenarios, recent data are generally considered more important than older data. This is because the data generating process may change over time, and the older data are often considered “stale” from the perspective of analytical insights. A uniform random sample from the reservoir will contain data points that are distributed uniformly over time. Typically, most streaming applications use a decay-based framework to regulate the relative importance of data points, so that more recent data points have a higher probability to be included in the sample. This is achieved with the use of a bias function.


The bias function associated with the rth data point, at the time of arrival of the nth data point, is given by f (r, n). This function is related to the probability p(r, n) of the rth data point belonging to the reservoir at the time of arrival of the nth data point. In other words, the value of p(r, n) is proportional to f (r, n). It is reasonable to assume that the function f (r, n) decreases monotonically with n (for fixed r ), and increases monotonically with r (for fixed n). In other words, recent data points have a higher probability of belonging to the reservoir. This kind of sampling will result in a bias-sensitive sample S(n) of data points.


Definition 12.2.1 Let f (r, n) be the bias function for the rth point at the time of arrival of the nth point. A biased sample S(n) at the time of arrival of the nth point in the stream is defined as a sample such that the relative probability p(r, n) of the rth point belonging to the sample S(n) (of size n) is proportional to f (r, n).


In general, it is an open problem to perform reservoir sampling with arbitrary bias functions.


However, methods exist for the case of the commonly used exponential bias function:






Yüklə 17,13 Mb.

Dostları ilə paylaş:
1   ...   237   238   239   240   241   242   243   244   ...   423




Verilənlər bazası müəlliflik hüququ ilə müdafiə olunur ©azkurs.org 2024
rəhbərliyinə müraciət

gir | qeydiyyatdan keç
    Ana səhifə


yükləyin