tailored to determining arbitrarily-shaped clusters from a small number of (pseudo)-data points. Many variations of this broader principle exist, depending on the particular type of building blocks that are chosen. For example, in grid-based methods, the fine-grained clus-ters are grid-like regions in the data space. When pre-selected data points in dense regions are clustered with a single-linkage method, the approach is referred to as DBSCAN. Other more sophisticated density- based methods, such as DENCLUE, use gradient ascent on the kernel-density estimates to create the building blocks.
6.6.1 Grid-Based Methods
In this technique, the data is discretized into p intervals that are typically equi-width inter-vals. Other variations such as equi-depth intervals are possible, though they are often not used in order to retain the intuitive notion of density. For a d-dimensional data set, this leads to a total of pd hyper-cubes in the underlying data. Examples of grids of different granularity with p = 3, 25, and 80 are illustrated in Figures 6.11b, c, and d, respectively. The resulting hyper-cubes (rectangles in Fig. 6.11) are the building blocks in terms of which the clustering is defined. A density threshold τ is used to determine the subset of the pd hyper-cubes that are dense. In most real data sets, an arbitrarily shaped cluster will result in multiple dense regions that are connected together by a side or at least a corner. There-fore, two grid regions are said to be adjacently connected, if they share a side in common. A weaker version of this definition considers two regions to be adjacently connected if they share a corner in common. Many grid-clustering algorithms use the strong definition of adjacent connectivity, where a side is used instead of a corner. In general, for data points in k-dimensional space, two k-dimensional cubes may be defined as adjacent, if they have share a surface of dimensionality at least r, for some user-defined parameter r < k.
This directly adjacent connectivity can be generalized to indirect density connectivity between grid regions that are not immediately adjacent to one another. Two grid regions are density connected, if a path can be found from one grid to the other containing only a sequence of adjacently connected grid regions. The goal of grid-based clustering is to determine the connected regions created by such grid cells. It is easy to determine such connected grid regions by using a graph-based model on the grids. Each dense grid node is associated with a node in the graph, and each edge represents adjacent connectivity. The connected components in the graph may be determined by using breadth-first or depth-first traversal on the graph, starting from nodes in different components. The data points in these connected components are reported as the final clusters. An example of the construction of the clusters of arbitrary shape from the building blocks is illustrated in Fig. 6.13. Note that the corners of the clusters found are artificially rectangular, which is one of the limitations of grid-based methods. The generic pseudocode for the grid-based approach is discussed in Fig. 6.12.
One desirable property of grid-based (and most other density-based) algorithms is that the number of data clusters is not pre-defined in advance, as in k- means algorithms. Rather, the goal is to return the natural clusters in the data together with their corresponding shapes. On the other hand, two different parameters need to be defined corresponding to the number of grid ranges p and the density threshold τ . The correct choice of these parameters is often difficult and semantically un-intuitive to guess. An inaccurate choice can lead to unintended consequences:
When the number of grid ranges selected is too small, as in Fig. 6.11b, the data points from multiple clusters will be present in the same grid region. This will result in the
180 CHAPTER 6. CLUSTER ANALYSIS
Algorithm GenericGrid(Data: D, Ranges: p, Density: τ )
begin
Discretize each dimension of data D into p ranges; Determine dense grid cells at density level τ ;
Create graph in which dense grids are connected if they are adjacent; Determine connected components of graph;
return points in each connected component as a cluster; end
Figure 6.12: Generic grid-based algorithm
(a) Data points and grid (b) Agglomerating adjacent grids
Figure 6.13: Agglomerating adjacent grids
undesirable merging of clusters. When the number of grid ranges selected is too large, as in Fig. 6.11d, this will result in many empty grid cells even within the clusters. As a result, natural clusters in the data may be disconnected by the algorithm. A larger number of grid ranges also leads to computational challenges because of the increasing number of grid cells.
The choice of the density threshold has a similar effect on the clustering. For example, when the density threshold τ is too low, all clusters, including the ambient noise, will be merged into a single large cluster. On the other hand, an unnecessarily high density can partially or entirely miss a cluster.
The two drawbacks discussed above are serious ones, especially when there are significant variations in the cluster size and density over different local regions.
Practical Issues
Grid-based methods do not require the specification of the number of clusters, and also do not assume any specific shape for the clusters. However, this comes at the expense of having to specify a density parameter τ , which is not always intuitive from an analytical perspective. The choice of grid resolution is also challenging because it is not clear how it can be related to the density τ . As will be evident later, this is much easier with DBSCAN,
Dostları ilə paylaş: |