Data Mining: The Textbook



Yüklə 17,13 Mb.
səhifə257/423
tarix07.01.2024
ölçüsü17,13 Mb.
#211690
1   ...   253   254   255   256   257   258   259   260   ...   423
1-Data Mining tarjima

12.4. CLUSTERING DATA STREAMS

415

To make this decision, the cluster feature vector of Mp is used to decide if this data point falls within the maximum boundary of the microcluster Mp. If so, then the data point Xi is added to the microcluster Mp by using the additivity property of microclusters. The maximum boundary of the microcluster Mp is defined as a factor t of the root- mean-square deviation of the data points in Mp from the centroid. The value of t is a user-defined parameter, and it is typically set to 3.


If the data point does not lie within the maximum boundary of the nearest microcluster, then a new microcluster must be created containing the data point Xi. However, to create this new microcluster, the number of other microclusters must be reduced by 1 to free memory availability. This can be achieved by either deleting an old microcluster or merging two of the older clusters. This decision is made by examining the staleness of the different clusters, and the number of points in them. The time-stamp statistics of the microclusters are examined to determine whether one of them is “sufficiently” stale to merit removal. If this is not the case, then a merging of the two microclusters is initiated.


How is staleness of a microcluster determined? The microclusters are used to approx-imate the average time-stamp of the last m data points of the cluster M. This value is not known explicitly because the last m data points are not explicitly retained in order to minimize memory requirements. The mean μ and variance σ2 of the time-stamps in the microcluster can be used together with a normal distribution assumption of the distribution of time stamps to estimate this value. Thus, if the cluster contains m0 > m data points, then the m/(2 · m0)th percentile of the normal distribution with mean μ and variance σ2 may be used as the estimate. This value is referred to as the relevance stamp of cluster M. Note that μ and σ2 can be computed from the temporal components of the cluster feature vec-tors. When the smallest such relevance stamp of any microcluster is below a user-defined threshold δ , it can be eliminated. In cases where no microclusters can be safely deleted, the closest microclusters are merged. The merging operation can be effectively performed because of the existence of the cluster feature vector. Distances between microclusters can be easily computed using the cluster-feature vector. When two microclusters are merged, their statistics are added together, because of the additivity property of microclusters.


12.4.2.3 Pyramidal Time Frame


The microclusters statistics are stored periodically to enable horizon-specific analysis of the clusters. This maintenance is performed during the microclustering phase. In this approach, the microcluster snapshots are stored at varying levels of granularity depending on the recency of the snapshot. Snapshots are classified into different orders that can vary from 1 to log(T ), where T is the clock time elapsed since the beginning of the stream. The order of a snapshot regulates the level of temporal granularity at which it is stored, according to the following rules:





  • Snapshots of the ith order are stored at time intervals of αi, where α is an integer and α ≥ 1. Specifically, each snapshot of the ith order is stored when the clock value is exactly divisible by αi.




  • At any given time, only the last αl + 1 snapshots of order i are stored.

The aforementioned definition allows for considerable redundancy in storage of snapshots. For example, the clock time of 8 is divisible by 20 , 21, 22, and 23 (where α = 2) . Therefore, the state of the microclusters at a clock time of 8 simultaneously corresponds to order 0, order 1, order 2, and order 3 snapshots. From an implementation point of view, a snapshot needs to be maintained only once.





416




CHAPTER 12. MINING DATA STREAMS




Table 12.2: An example [39] of snapshots stored for α = 2 and l = 2











































Order of Snapshots

Clock times (last five snapshots)































0

55 54 53

52 51













1

54 52 50

48 46













2

52 48 44

40 36













3

48 40 32

24 16













4

48 32

16













5

32




































Yüklə 17,13 Mb.

Dostları ilə paylaş:
1   ...   253   254   255   256   257   258   259   260   ...   423




Verilənlər bazası müəlliflik hüququ ilə müdafiə olunur ©azkurs.org 2024
rəhbərliyinə müraciət

gir | qeydiyyatdan keç
    Ana səhifə


yükləyin