Data Mining: The Textbook

Yüklə 17,13 Mb.

səhifə	131/423
tarix	07.01.2024
ölçüsü	17,13 Mb.
	#211690

1 ... 127 128 129 130 131 132 133 134 ... 423

1-Data Mining tarjima

7.3. SCALABLE DATA CLUSTERING

215

variance of any set of points can be expressed in terms of the their first and second moments. It is easy to see that the centroid (vector) of the cluster is simply LS/m. The variance of

random variable Z is defined to be E[Z2] − E[Z]2, where E[·] denotes expected values. Therefore, the variances along the ith dimension can be expressed as SS_i/m − (LS_i/m)2. Here SS_i and LS_i represent the component of the corresponding moment vector along the ith dimension. The sum of these dimension-specific variances yields the variance of the entire cluster. Furthermore, the distance of any point to the centroid can be computed using the cluster feature by using the computed centroid LS/m. Therefore, the cluster feature vector contains all the information needed to insert a data point into the CF-Tree.

Each leaf node in the CF-Tree has a diameter threshold T . The diameter2 can be any spread measure of the cluster such as its radius or variance, as long as it can be computed directly from the cluster feature vector. The value of T regulates the granularity of the clustering, the height of the tree, and the aggregate number of clusters at the leaf nodes. Lower values of T will result in a larger number of fine-grained clusters. Because the CF-Tree is always assumed to be main-memory resident, the size of the data set will typically have

critical impact on the value of T . Smaller data sets will allow the use of a small threshold T , whereas a larger data set will require a larger value of the threshold T . Therefore, an incremental approach such as BIRCH gradually increases the value of T to balance the greater need for memory with increasing data size. In other words, the value of T may need to be increased whenever the tree can no longer be kept within main-memory availability.

The incremental insertion of a data point into the tree is performed with a top-down approach. Specifically, the closest centroid is selected at each level for insertion into the tree structure. This approach is similar to the insertion process in a traditional database index such as a B-Tree. The cluster feature vectors are updated along the corresponding path of the tree by simple addition. At the leaf node, the data point is inserted into its closest cluster only if the insertion does not increase the cluster diameter beyond the threshold T . Otherwise, a new cluster must be created containing only that data point. This new cluster is added to the leaf node, if it is not already full. If the leaf node is already full, then it needs to be split into two nodes. Therefore, the cluster feature entries in the old leaf node need to be assigned to one of the two new nodes. The two cluster features in the leaf node, whose centroids are the furthest apart, can serve as the seeds for the split. The remaining entries are assigned to the seed node to which they are closest. As a result, the branching factor of the parent node of the leaf increases by 1. Therefore, the split might result in the branching factor of the parent increasing beyond B. If this is the case, then the parent would need to be split as well in a similar way. Thus, the split may be propagated upward until the branching factors of all nodes are below B. If the split propagates all the way to the root node, then the height of the CF-Tree increases by 1.

These repeated splits may sometimes result in the tree running out of main memory. In such cases, the CF-Tree needs to be rebuilt by increasing the threshold T , and reinserting the old leaf nodes into a new tree with a higher threshold T . Typically, this reinsertion will result in the merging of some older clusters into larger clusters that meet the new modified threshold T . Therefore, the memory requirement of the new tree is lower. Because the old leaf nodes are reinserted with the use of cluster feature vectors, this step can be accomplished without reading the original database from disk. Note that the cluster feature

The original BIRCH algorithm proposes to use the pairwise root mean square (RMS) distance between cluster data points as the diameter. This is one possible measure of the intracluster distance. This value

^d (2·m·SS_i−2·LS²)
can also be shown to be computable from the CF vector as i=1 i .

Yüklə 17,13 Mb.

Dostları ilə paylaş: