Data Mining: The Textbook

Yüklə 17,13 Mb.

səhifə	257/423
tarix	07.01.2024
ölçüsü	17,13 Mb.
	#211690

1 ... 253 254 255 256 257 258 259 260 ... 423

1-Data Mining tarjima

12.4. CLUSTERING DATA STREAMS

415

To make this decision, the cluster feature vector of M_p is used to decide if this data point falls within the maximum boundary of the microcluster M_p. If so, then the data point X_i is added to the microcluster M_p by using the additivity property of microclusters. The maximum boundary of the microcluster M_p is defined as a factor t of the root- mean-square deviation of the data points in M_p from the centroid. The value of t is a user-defined parameter, and it is typically set to 3.

If the data point does not lie within the maximum boundary of the nearest microcluster, then a new microcluster must be created containing the data point X_i. However, to create this new microcluster, the number of other microclusters must be reduced by 1 to free memory availability. This can be achieved by either deleting an old microcluster or merging two of the older clusters. This decision is made by examining the staleness of the diﬀerent clusters, and the number of points in them. The time-stamp statistics of the microclusters are examined to determine whether one of them is “suﬃciently” stale to merit removal. If this is not the case, then a merging of the two microclusters is initiated.

How is staleness of a microcluster determined? The microclusters are used to approx-imate the average time-stamp of the last m data points of the cluster M. This value is not known explicitly because the last m data points are not explicitly retained in order to minimize memory requirements. The mean μ and variance σ2 of the time-stamps in the microcluster can be used together with a normal distribution assumption of the distribution of time stamps to estimate this value. Thus, if the cluster contains m₀ > m data points, then the m/(2 · m₀)th percentile of the normal distribution with mean μ and variance σ2 may be used as the estimate. This value is referred to as the relevance stamp of cluster M. Note that μ and σ2 can be computed from the temporal components of the cluster feature vec-tors. When the smallest such relevance stamp of any microcluster is below a user-defined threshold δ , it can be eliminated. In cases where no microclusters can be safely deleted, the closest microclusters are merged. The merging operation can be eﬀectively performed because of the existence of the cluster feature vector. Distances between microclusters can be easily computed using the cluster-feature vector. When two microclusters are merged, their statistics are added together, because of the additivity property of microclusters.

12.4.2.3 Pyramidal Time Frame

The microclusters statistics are stored periodically to enable horizon-specific analysis of the clusters. This maintenance is performed during the microclustering phase. In this approach, the microcluster snapshots are stored at varying levels of granularity depending on the recency of the snapshot. Snapshots are classified into diﬀerent orders that can vary from 1 to log(T ), where T is the clock time elapsed since the beginning of the stream. The order of a snapshot regulates the level of temporal granularity at which it is stored, according to the following rules:

Snapshots of the ith order are stored at time intervals of αi, where α is an integer and α ≥ 1. Specifically, each snapshot of the ith order is stored when the clock value is exactly divisible by αi.

At any given time, only the last αl + 1 snapshots of order i are stored.

The aforementioned definition allows for considerable redundancy in storage of snapshots. For example, the clock time of 8 is divisible by 20 , 21, 22, and 23 (where α = 2) . Therefore, the state of the microclusters at a clock time of 8 simultaneously corresponds to order 0, order 1, order 2, and order 3 snapshots. From an implementation point of view, a snapshot needs to be maintained only once.

416				CHAPTER 12. MINING DATA STREAMS
Table 12.2: An example [39] of snapshots stored for α = 2 and l = 2


	Order of Snapshots		Clock times (last five snapshots)

	0	55 54 53			52 51
	1	54 52 50			48 46
	2	52 48 44			40 36
	3	48 40 32			24 16
	4	48 32			16
	5	32

Yüklə 17,13 Mb.

Dostları ilə paylaş:

1 ... 253 254 255 256 257 258 259 260 ... 423