Data Mining: The Textbook

Yüklə 17,13 Mb.

səhifə	31/423
tarix	07.01.2024
ölçüsü	17,13 Mb.
	#211690

1 ... 27 28 29 30 31 32 33 34 ... 423

1-Data Mining tarjima

2.4. DATA REDUCTION AND TRANSFORMATION

δt time units ago, is proportional to an exponential decay function value regulated by the decay parameter λ:

p(		) _∝ _e−λ·δt	(2.4)
	X

Here e is the base of the natural logarithm. By using diﬀerent values of λ, the impact of temporal decay can be regulated appropriately.

Stratified sampling: In some data sets, important parts of the data may not be suﬃ-ciently represented by sampling because of their rarity. A stratified sample, therefore, first partitions the data into a set of desired strata, and then independently samples from each of these strata based on predefined proportions in an application-specific way.

For example, consider a survey that measures the economic diversity of the lifestyles of diﬀerent individuals in the population. Even a sample of 1 million participants may not capture a billionaire because of their relative rarity. However, a stratified sample (by income) will independently sample a predefined fraction of participants from each income group to ensure greater robustness in analysis.

Numerous other forms of biased sampling are possible. For example, in density-biased sam-pling, points in higher-density regions are weighted less to ensure greater representativeness of the rare regions in the sample.

2.4.1.2 Reservoir Sampling for Data Streams

A particularly interesting form of sampling is that of reservoir sampling for data streams. In reservoir sampling, a sample of k points is dynamically maintained from a data stream. Recall that a stream is of an extremely large volume, and therefore one cannot store it on a disk to sample it. Therefore, for each incoming data point in the stream, one must use a set of eﬃciently implementable operations to maintain the sample.

In the static case, the probability of including a data point in the sample is k/n where k is the sample size, and n is the number of points in the “data set.” In this case, the “data set” is not static and cannot be stored on disk. Furthermore, the value of n is constantly increasing as more points arrive and previous data points (outside the sample) have already been discarded. Thus, the sampling approach works with incomplete knowledge about the previous history of the stream at any given moment in time. In other words, for each incoming data point in the stream, we need to dynamically make two simple admission control decisions:

What sampling rule should be used to decide whether to include the newly incoming data point in the sample?

What rule should be used to decide how to eject a data point from the sample to “make room” for the newly inserted data point?

Fortunately, it is relatively simple to design an algorithm for reservoir sampling in data streams [498 ]. For a reservoir of size k, the first k data points in the stream are used to initialize the reservoir. Subsequently, for the nth incoming stream data point, the following two admission control decisions are applied:

Insert the nth incoming stream data point into the reservoir with probability k/n.

If the newly incoming data point was inserted, then eject one of the old k data points at random to make room for the newly arriving point.

40 CHAPTER 2. DATA PREPARATION

It can be shown that the aforementioned rule maintains an unbiased reservoir sample from the data stream.

Lemma 2.4.1 After n stream points have arrived, the probability of any stream point being included in the reservoir is the same, and is equal to k/n.

Proof: This result is easy to show by induction. At initialization of the first k data points, the theorem is trivially true. Let us (inductively) assume that it is also true after (n − 1) data points have been received, and therefore the probability of each point being included in the reservoir is k/(n − 1). The probability of the arriving point being included in the stream is k/n, and therefore the lemma holds true for the arriving data point. It remains to prove the result for the remaining points in the data stream. There are two disjoint case events that can arise for an incoming data point, and the final probability of a point being included in the reservoir is the sum of these two cases:

I: The incoming data point is not inserted into the reservoir. The probability of this is (n −k)/n. Because the original probability of any point being included in the reservoir by the inductive assumption, is k/(n − 1), the overall probability of a point being

included in the reservoir and Case I event, is the multiplicative value of p1 = ^k⁽^n−k⁾ .

Yüklə 17,13 Mb.

Dostları ilə paylaş:

1 ... 27 28 29 30 31 32 33 34 ... 423