follows:
1. For each dimension, the sum of the squares of the data values is maintained in CF 2x.
Thus, CF 2x contains d values. The p-th entry of CF 2x is equal to n (xp )2.
j=1 ij
2. For each dimension, the sum of the data values is maintained in CF 1x. Thus, CF 1x
contains d values. The p-th entry of CF 1x is equal to n xp .
j=1 ij
3. The sum of the squares of the time stamps Ti1 . . . Tin is maintained in CF 2t.
The sum of the time stamps Ti1 . . . Tin is maintained in CF 1t.
The number of data points is maintained in n.
An important property of microclusters is that they are additive. In other words, the micro-clusters can be updated by purely additive operations. Note that each of the 2 ·d + 3 compo-nents of the microcluster can be expressed as a linearly separable sum over the constituent data points in the microcluster. This is an important property for enabling the efficient maintenance of the microclusters in the online streaming scenario. When a data point Xi is added to a microcluster, the corresponding statistics of the data point Xi need to be added to each of the (2 · d + 3) components. Similarly, the microclusters for the stream period (t1, t2) can be obtained by subtracting the microclusters at time t1 from those at time t2. This property is important for enabling the computation of the higher-level macroclusters over an arbitrary time horizon (t1, t2) from the microclusters stored at different times.
12.4.2.2 Microclustering Algorithm
The data stream clustering algorithm can generate approximate clusters in any user-specified length of history from the current instant. This is achieved by storing the micro-clusters at particular moments in the stream that are referred to as snapshots. At the same time, the current snapshot of microclusters is always maintained by the algorithm. The additive property can be used to extract microclusters from any time horizon. The macroclustering phase is applied to this representation.
The input to the algorithm is the number of microclusters, denoted by k. The online phase of the algorithm works in an iterative fashion, by always maintaining a current set of microclusters. Whenever a new data point Xi arrives, the microclusters are updated to reflect the changes. Each data point either needs to be absorbed by a microcluster, or it needs to be put in a cluster of its own. The first preference is to absorb the data point into a currently existing microcluster. The distance of the data point to the current microcluster centroids M1 . . . Mk is determined. The distance value of the data point Xi to the centroid of the microcluster Mj is denoted by dist (Mj , X i). Because the centroid of the microcluster can be derived from the cluster feature vector, this distance value can be computed easily. The closest centroid Mp is determined. The data point Xi is assigned to its closest cluster Mp , unless it is deemed that the data point does not “naturally” belong to that (or any other) microcluster. In such cases, the data point Xi needs to be assigned a (new) microcluster of its own. Therefore, before assigning a data point to a microcluster, it first needs to be decided whether it naturally belongs to its closest microcluster centroid Mp.
Dostları ilə paylaş: |