6.6. GRID-BASED AND DENSITY-BASED ALGORITHMS
|
181
|
Figure 6.14: Impact of local distributions on density-based methods
where the resolution of the density-based approach is more easily related to the specified density threshold.
A major challenge with many density-based methods, including grid-based methods, is that they use a single density parameter τ globally. However, the clusters in the underlying data may have varying density, as illustrated in Fig. 6.14. In this particular case, if the density threshold is selected to be too high, then cluster C may be missed. On the other hand, if the density threshold is selected to be too low, then clusters A and B may be merged artificially. In such cases, distance-based algorithms, such as k-means, may be more effective than a density-based approach. This problem is not specific to the grid-based method but is generally encountered by all density-based methods.
The use of rectangular grid regions is an approximation of this class of methods. This approximation degrades with increasing dimensionality because high-dimensional rectan-gular regions are poor approximations of the underlying clusters. Furthermore, grid-based methods become computationally infeasible in high dimensions because the number of grid cells increase exponentially with the underlying data dimensionality.
6.6.2 DBSCAN
The DBSCAN approach works on a very similar principle as grid-based methods. However, unlike grid-based methods, the density characteristics of data points are used to merge them into clusters. Therefore, the individual data points in dense regions are used as building blocks after classifying them on the basis of their density.
The density of a data point is defined by the number of points that lie within a radius Eps of that point (including the point itself). The densities of these spherical regions are used to classify the data points into core, border, or noise points. These notions are defined as follows:
Core point: A data point is defined as a core point, if it contains4 at least τ data points.
Border point: A data point is defined as a border point, if it contains less than τ points, but it also contains at least one core point within a radius Eps.
The parameter M inP ts is used in the original DBSCAN description. However, the notation τ is used here to retain consistency with the grid-clustering description.
182 CHAPTER 6. CLUSTER ANALYSIS
Algorithm DBSCAN(Data: D, Radius: Eps, Density: τ )
begin
Determine core, border and noise points of D at level (Eps, τ ); Create graph in which core points are connected
if they are within Eps of one another; Determine connected components in graph; Assign each border point to connected component
with which it is best connected;
return points in each connected component as a cluster;
end
Figure 6.15: Basic DBSCAN algorithm
Noise point: A data point that is neither a core point nor a border point is defined as a noise point.
Examples of core points, border points, and noise points are illustrated in Fig. 6.16 for
= 10. The data point A is a core point because it contains 10 data points within the illustrated radius Eps. On the other hand, data point B contains only 6 points within a radius of Eps, but it contains the core point A. Therefore, it is a border point. The data point C is a noise point because it contains only 4 points within a radius of Eps, and it does not contain any core point.
After the core, border, and noise points have been determined, the DBSCAN clustering algorithm proceeds as follows. First, a connectivity graph is constructed with respect to the core points, in which each node corresponds to a core point, and an edge is added between a pair of core points, if and only if they are within a distance of Eps from one another. Note that the graph is constructed on the data points rather than on partitioned regions, as in grid-based algorithms. All connected components of this graph are identified. These corre-spond to the clusters constructed on the core points. The border points are then assigned to the cluster with which they have the highest level of connectivity. The resulting groups are reported as clusters and noise points are reported as outliers. The basic DBSCAN algorithm is illustrated in Fig. 6.15. It is noteworthy that the first step of graph-based clustering is identical to a single-linkage agglomerative clustering algorithm with termination-criterion of Eps-distance, which is applied only to the core points. Therefore, the DBSCAN algorithm may be viewed as an enhancement of single-linkage agglomerative clustering algorithms by treating marginal (border) and noisy points specially. This special treatment can reduce the outlier-sensitive chaining characteristics of single-linkage algorithms without losing the abil-ity to create clusters of arbitrary shape. For example, in the pathological case of Fig. 6.9(b), the bridge of noisy data points will not be used in the agglomerative process if Eps and τ are selected appropriately. In such cases, DBSCAN will discover the correct clusters in spite of the noise in the data.
Practical Issues
The DBSCAN approach is very similar to grid-based methods, except that it uses circular regions as building blocks. The use of circular regions generally provides a smoother contour to the discovered clusters. Nevertheless, at more detailed levels of granularity, the two methods will tend to become similar. The strengths and weaknesses of DBSCAN are also
Dostları ilə paylaş: |