Data Mining: The Textbook

Yüklə 17,13 Mb.

səhifə	241/423
tarix	07.01.2024
ölçüsü	17,13 Mb.
	#211690

1 ... 237 238 239 240 241 242 243 244 ... 423

1-Data Mining tarjima

n(n−1)

The incoming data point is inserted into the reservoir. The probability of Case II is equal to insertion probability k/n of incoming data points. Subsequently, existing reservoir points are retained with probability (k − 1)/k because exactly one of them is ejected. Because the inductive assumption implies that any of the earlier points in the stream was originally present in the reservoir with probability k/(n − 1), it implies that the probability of a point being included in the reservoir and Case II event is given by the product p₂ of the three aforementioned probabilities:

p₂ =	k	k − 1	k	=	k(k − 1)	(12.1)
	n	k	n − 1		n(n − 1)

Therefore, the total probability of a stream point being retained in the reservoir after the nth data point arrival is given by the sum of p₁ and p₂. It can be shown that this is equal to k/n.

The major problem with this approach is that it cannot handle concept drift because the data is uniformly sampled without decay.

12.2. SYNOPSIS DATA STRUCTURES FOR STREAMS

393

12.2.1.1 Handling Concept Drift

In streaming scenarios, recent data are generally considered more important than older data. This is because the data generating process may change over time, and the older data are often considered “stale” from the perspective of analytical insights. A uniform random sample from the reservoir will contain data points that are distributed uniformly over time. Typically, most streaming applications use a decay-based framework to regulate the relative importance of data points, so that more recent data points have a higher probability to be included in the sample. This is achieved with the use of a bias function.

The bias function associated with the rth data point, at the time of arrival of the nth data point, is given by f (r, n). This function is related to the probability p(r, n) of the rth data point belonging to the reservoir at the time of arrival of the nth data point. In other words, the value of p(r, n) is proportional to f (r, n). It is reasonable to assume that the function f (r, n) decreases monotonically with n (for fixed r ), and increases monotonically with r (for fixed n). In other words, recent data points have a higher probability of belonging to the reservoir. This kind of sampling will result in a bias-sensitive sample S(n) of data points.

Definition 12.2.1 Let f (r, n) be the bias function for the rth point at the time of arrival of the nth point. A biased sample S(n) at the time of arrival of the nth point in the stream is defined as a sample such that the relative probability p(r, n) of the rth point belonging to the sample S(n) (of size n) is proportional to f (r, n).

In general, it is an open problem to perform reservoir sampling with arbitrary bias functions.

However, methods exist for the case of the commonly used exponential bias function:

Yüklə 17,13 Mb.

Dostları ilə paylaş:

1 ... 237 238 239 240 241 242 243 244 ... 423