Figure 6.16: Examples of core, border, and noise points
similar to those of grid-based methods. The DBSCAN method can discover clusters of arbitrary shape, and it does not require the number of clusters as an input parameter. As in the case of grid-based methods, it is susceptible to variations in the local cluster density. For example, in Figs. 6.4b and 6.14, DBSCAN will either not discover the sparse cluster, or it might merge the two dense clusters. In such cases, algorithms such as Mahalanobis k-means are more effective because of their ability to normalize the distances with local density. On the other hand, DBSCAN will be able to effectively discover the clusters of Fig. 6.4a, which is not possible with the Mahalanobis k-means method.
The major time complexity of DBSCAN is in finding the neighbors of the different data points within a distance of Eps . For a database of size n, the time complexity can be O(n2) in the worst case. However, for some special cases, the use of a spatial index for finding the nearest neighbors can reduce this to approximately O(n · log(n)) distance computations. The O(log(n)) query performance is realized only for low-dimensional data, in which nearest neighbor indexes work well. In general, grid-based methods are more efficient because they partition the space, rather than opting for the more computationally intensive approach of finding the nearest neighbors.
The parameters τ and Eps are related to one another in an intuitive way, which is useful for parameter setting. In particular, after the value of τ has been set by the user, the value of Eps can be determined in a data-driven way. The idea is to use a value of Eps that can capture most of the data points in clusters as core points. This can be achieved as follows. For each data point, its τ -nearest neighbor distance is determined. Typically, the vast majority of the data points inside clusters will have a small value of the τ -nearest neighbor distance. However, the value of the τ -nearest neighbor often increases suddenly for a small number of noisy points (or points at the fringes of clusters). Therefore, the key is to identify the tail of the distribution of τ -nearest neighbor distances. Statistical tests, such as the Z-value test, can be used in order to determine the value of Eps at which the
-nearest neighbor distance starts increasing abruptly. This value of the τ -nearest neighbor distance at this cutoff point provides a suitable value of Eps.
184 CHAPTER 6. CLUSTER ANALYSIS
Algorithm DENCLUE(Data: D, Density: τ )
begin
Determine density attractor of each data point in D with gradient-ascent of Equation 6.20;
Create clusters of data points that converge to the same density attractor;
Discard clusters whose density attractors have density less than τ and report as outliers;
Merge clusters whose density attractors are connected with a path of density at least τ ;
return points in each cluster;
end
Figure 6.17: Basic DENCLUE algorithm
6.6.3 DENCLUE
The DENCLUE algorithm is based on firm statistical foundations that are rooted in kernel-density estimation. Kernel- density estimation can be used to create a smooth profile of the density distribution. In kernel-density estimation, the density f (X) at coordinate X is defined as a sum of the influence (kernel) functions K(·) over the n different data points in the database D:
Dostları ilə paylaş: |