Data Mining: The Textbook

Yüklə 17,13 Mb.

səhifə	114/423
tarix	07.01.2024
ölçüsü	17,13 Mb.
	#211690

1 ... 110 111 112 113 114 115 116 117 ... 423

1-Data Mining tarjima

f (X) =	K(X − X_i).	(6.18)


n

i=1

A wide variety of kernel functions may be used, and a common choice is the Gaussian kernel.

For a d-dimensional data set, the Gaussian kernel is defined as follows:

||X−X_i||²

e⁻

K(X

X_i) =

(6.19)

2·h²

−

h^√2π

The term ||X − X_i || represents the Euclidean distance between these d-dimensional data points. Intuitively, the eﬀect of kernel-density estimation is to replace each discrete data point with a smooth “bump,” and the density at a point is the sum of these “bumps.” This results in a smooth profile of the data in which the random artifacts of the data are suppressed, and a smooth estimate of the density is obtained. Here, h represents the bandwidth of the estimation that regulates the smoothness of the estimation. Large values of the bandwidth h smooth out the noisy artifacts but may also lose some detail about the distribution. In practice, the value of h is chosen heuristically in a data-driven manner. An example of a kernel-density estimate in a data set with three natural clusters is illustrated in Fig. 6.18.

The goal is to determine clusters by using a density threshold τ that intersects this smooth density profile. Examples are illustrated in Figs. 6.18 and 6.19. The data points that lie in each (arbitrarily shaped) connected contour of this intersection will belong to the corresponding cluster. Some of the border data points of a cluster that lie just outside this contour may also be included because of the way in which data points are associated with clusters with the use of a hill-climbing approach. The choice of the density threshold will impact the number of clusters in the data. For example, in Fig. 6.18, a low-density threshold is used, and therefore two distinct clusters are merged. As a result, the approach will report

6.6. GRID-BASED AND DENSITY-BASED ALGORITHMS									185

Figure 6.18: Density-based profile with lower density threshold

Figure 6.19: Density-based profile with higher density threshold

only two clusters. In Fig. 6.19, a higher density threshold is used, and therefore the approach will report three clusters. Note that, if the density threshold is increased further, one or more of the clusters will be completely missed. Such a cluster, whose peak density is lower than the user-specified threshold, is considered a noise cluster, and not reported by the DENCLUE algorithm.

The DENCLUE algorithm uses the notion of density attractors to partition data points into clusters. The idea is to treat each local peak of the density distribution as a density attractor, and associate each data point with its relevant peak by hill climbing toward its relevant peak. The diﬀerent peaks that are connected by a path of density at least τ are then merged. For example, in each of Figs. 6.18 and 6.19, there are three density attractors. However, for the density threshold of Fig 6.18, only two clusters will be discovered because of the merging of a pair of peaks.

The DENCLUE algorithm uses an iterative gradient ascent approach in which each data point X ∈ D is iteratively updated by using the gradient of the density profile with respect to X. Let X(t) be the updated value of X in the tth iteration. The value of X(t) is updated as follows:

_X(t+1)

_X(t)

+ α∇f (

_X(t)

(6.20)

Here, ∇f (X(t)) denotes the d-dimensional vector of partial derivatives of the kernel density with respect to each coordinate, and α is the step size. The data points are continually updated using the aforementioned rule, until they converge to a local optimum, which will always be one of the density attractors. Therefore, multiple data points may converge to the same density attractor. This creates an implicit clustering of the points, corresponding to the diﬀerent density attractors (or local peaks). The density at each attractor is computed according to Eq. 6.18. Those attractors whose density does not meet the user-specified threshold τ are excluded because they are deemed to be small “noise” clusters. Furthermore, any pair of clusters whose density attractors are connected to each other by a path of density at least τ will be merged. This step addresses the merging of multiple density peaks, as illustrated in Fig. 6.18, and is analogous to the postprocessing step used in grid-based methods and DBSCAN. The overall DENCLUE algorithm is illustrated in Fig. 6.17.

186 CHAPTER 6. CLUSTER ANALYSIS

One advantage of kernel-density estimation is that the gradient values ∇f (X) can be computed easily using the gradient of the constituent kernel-density values:

	1 ⁿ
∇f (X) =	_n_i₌₁^∇^K(^X ⁻ ^Xi)^.	(6.21)

The precise value of the gradient will depend on the choice of kernel function, though the diﬀerences across diﬀerent choices are often not significant when the number of data points is large. In the particular case of the Gaussian kernel, the gradient can be shown to take on the following special form because of the presence of the negative squared distance in the exponent:

∇K(

−

) ∝ (

−

)K(

−

(6.22)

X_i

This is because the derivative of an exponential function is itself, and the gradient of the negative squared distance is proportional to (X_i − X). The gradient of the kernel is the product of these two terms. Note that the constant of proportionality in Eq. 6.22 is irrelevant because it is indirectly included in the step size α of the gradient-ascent method.

A diﬀerent way of determining the local optimum is by setting the gradient ∇f (X ) to 0 as the optimization condition for f (X), and solving the resulting system of equations using an iterative method, but using diﬀerent starting points corresponding to the various data points. For example, by setting the gradient in Eq. 6.21 for the Gaussian kernel to 0, we obtain the following by substituting Eq. 6.22 in Eq. 6.21:

−

(6.23)

XK(X − X_i) =

X_i

i=1

This is a nonlinear system of equations in terms of the d coordinates of X and it will have multiple solutions corresponding to diﬀerent density peaks (or local optima). Such systems of equations can be solved numerically using iterative update methods and the choice of the starting point will yield diﬀerent peaks. When a particular data point is used as the starting point in the iterations, it will always reach its density attractor. Therefore, one obtains the following modified update rule instead of the gradient ascent method:

X_iK(X⁽^t⁾ − X_i)

(6.24)

_X(t+1)

i=1

_i₌₁K(X⁽^t⁾ − X_i)

This update rule replaces Eq. 6.20 and has a much faster rate of convergence. Interestingly, this update rule is widely known as the mean-shift method. Thus, there are interesting con-nections between DENCLUE and the mean-shift method. The bibliographic notes contain pointers to this optimized method and the mean-shift method.

The approach requires the computation of the density at each data point, which is O(n). Therefore, the overall computational complexity is O(n2). This computational complexity can be reduced by observing that the density of a data point is mostly influenced only by its neighboring data points, and that the influence of distant data points is relatively small for exponential kernels such as the Gaussian kernel. In such cases, the data is discretized into grids, and the density of a point is computed only on the basis of the data points inside its grid and immediately neighboring grids. Because the grids can be eﬃciently accessed with the use of an index structure, this implementation is more eﬃcient. Interestingly, the clustering of the DBSCAN method can be shown to be a special case of DENCLUE by using a binary kernel function that takes on the value of 1 within a radius of Eps of a point, and 0 otherwise.

6.7. GRAPH-BASED ALGORITHMS

187

Practical Issues

The DENCLUE method can be more eﬀective than other density-based methods, when the number of data points is relatively small, and, therefore, a smooth estimate provides a more accurate understanding of the density distribution. DENCLUE is also able to handle data points at the borders of clusters in a more elegant way by using density attractors to attract relevant data points from the fringes of the cluster, even if they have density less than τ . Small groups of noisy data points will be appropriately discarded if their density attractor does not meet the user-specified density threshold τ . The approach also shares many advantages of other density-based algorithms. For example, the approach is able to discover arbitrarily shaped clusters, and it does not require the specification of the number of clusters. On the other hand, as in all density-based methods, it requires the specification of density threshold τ , which is diﬃcult to determine in many real applications. As discussed earlier in the context of Fig. 6.14, local variations of density can be a significant challenge for any density-based algorithm. However, by varying the density threshold τ , it is possible to create a hierarchical dendrogram of clusters. For example, the two diﬀerent values of τ in Figs. 6.18 and 6.19 will create a natural hierarchical arrangement of the clusters.

6.7 Graph-Based Algorithms

Graph-based methods provide a general meta-framework, in which data of virtually any type can be clustered. As discussed in Chap. 2, data of virtually any type can be converted to similarity graphs for analysis. This transformation is the key that allows the implicit clustering of any data type by performing the clustering on the corresponding transformed graph.

This transformation will be revisited in the following discussion. The notion of pairwise similarity is defined with the use of a neighborhood graph. Consider a set of data objects O = {O₁ . . . O_n}, on which a neighborhood graph can be defined. Note that these objects can be of any type, such as time series or discrete sequences. The main constraint is that it should be possible to define a distance function on these objects. The neighborhood graph is constructed as follows:

A single node is defined for each object in O. This is defined by the node set N , containing n nodes, where the node i corresponds to the object O_i.

An edge exists between O_i and O_j , if the distance d(O_i, O_j ) is less than a particular threshold . A better approach is to compute the k-nearest neighbors of both O_i and O_j , and add an edge when either one is a k-nearest neighbor of the other. The weight w_ij of the edge (i, j) is equal to a kernelized function of the distance between the objects O_i and O_j , so that larger weights indicate greater similarity. An example is the heat kernel, which is defined in terms of a parameter t:

_w_ij= _e−^d⁽^Oi^,Oj ⁾²^/t² _. (6.25)

For multidimensional data, the Euclidean distance is typically used to instantiate d(O_i, O_j ).

3. (Optional step) This step can be helpful for reducing the impact of local density vari-ations such as those discussed in Fig. 6.14. Note that the quantity deg(i) = ⁿ wir

r=1
can be viewed as a proxy for the local kernel-density estimate near object O_i. Each

188 CHAPTER 6. CLUSTER ANALYSIS

Algorithm GraphMetaFramework(Data: D)

begin
Construct the neighborhood graph G on D; Determine clusters (communities) on the nodes in G; return clusters corresponding to the node partitions;

end

Figure 6.20: Generic graph-based meta-algorithm

edge weight w_ij is normalized by dividing it with deg(i) · deg(j). Such an approach ensures that the clustering is performed after normalization of the similarity values with local densities. This step is not essential when algorithms such as normalized spectral clustering are used for finally clustering nodes in the neighborhood graph. This is because spectral clustering methods perform a similar normalization under the covers.

After the neighborhood graph has been constructed, any network clustering or community detection algorithm (cf. Sect. 19.3 of Chap. 19) can be used to cluster the nodes in the neighborhood graph. The clusters on the nodes can be used to map back to clusters on the original data objects. The spectral clustering method, which is a specific instantiation of the final node clustering step, is discussed in some detail below. However, the graph-based approach should be treated as a generic meta-algorithm that can use any community detection algorithm in the final node clustering step. The overall meta-algorithm for graph-based clustering is provided in Fig. 6.20.

Let G = (N, A) be the undirected graph with node set N and edge set A, which is created by the aforementioned neighborhood-based transformation. A symmetric n × n weight matrix W defines the corresponding node similarities, based on the specific choice of neighborhood transformation, as in Eq. 6.25. All entries in this matrix are assumed to be non-negative, and higher values indicate greater similarity. If an edge does not exist between a pair of nodes, then the corresponding entry is assumed to be 0. It is desired to embed the nodes of this graph into a k -dimensional space, so that the similarity structure of the data is approximately preserved for the clustering process. This embedding is then used for a second phase of clustering.

First, let us discuss the much simpler problem of mapping the nodes into a 1-dimensional space. The generalization to the k -dimensional case is relatively straightforward. We would like to map the nodes in N into a set of 1-dimensional real values y₁ . . . y_n on a line, so that the distances between these points reflect the connectivity among the nodes. It is undesirable for nodes that are connected with high-weight edges to be mapped onto distant points on this line. Therefore, we would like to determine values of y_i that minimize the following objective function O:

n	n
O =	w_ij · (y_i − y_j )².	(6.26)

i=1 j=1

This objective function penalizes the distances between y_i and y_j with weight proportional to w_ij . Therefore, when w_ij is very large (more similar nodes), the data points y_i and y_j will be more likely to be closer to one another in the embedded space. The objective function O can be rewritten in terms of the Laplacian matrix L of the weight matrix W = [w_ij ].

6.7. GRAPH-BASED ALGORITHMS

189

The Laplacian matrix L is defined as Λ − W , where Λ is a diagonal matrix satisfying Λii = ⁿ wij . Let the n-dimensional column vector of embedded values be denoted by

j=1

= (y₁ . . . y_n)T . It can be shown after some algebraic simplification that the objective function O can be rewritten in terms of the Laplacian matrix:

O = 2		^TLy.	(6.27)
	y

The Laplacian matrix L is positive semi-definite with non-negative eigenvalues because the sum-of-squares objective function O is always non-negative. We need to incorporate a scaling constraint to ensure that the trivial value of y_i = 0 for all i is not selected by the optimization solution. A possible scaling constraint is as follows:

	^TΛ		= 1.	(6.28)
y		y

The presence of Λ in the constraint ensures better local normalization of the embedding. It can be shown using constrained optimization techniques, that the optimal solution for y that minimizes the objective function O is equal to the smallest eigenvector of Λ−1L, satisfying the relationship Λ−1Ly = λy. Here, λ is an eigenvalue. However, the smallest eigenvalue of Λ−1L is always 0, and it corresponds to the trivial solution where y is proportional to the vector containing only 1s. This trivial eigenvector is non-informative because it embeds every node to the same point on the line. Therefore, it can be discarded, and it is not used in the analysis. The second-smallest eigenvector then provides an optimal solution that is more informative.

This optimization formulation and the corresponding solution can be generalized to find-ing an optimal k-dimensional embedding. This is achieved by determining eigenvectors of Λ−1L with successively increasing eigenvalues. After discarding the first trivial eigenvec-tor e₁ with eigenvalue λ₁ = 0, this results in a set of k eigenvectors e₂, e₃ . . . e_k₊₁, with corresponding eigenvalues λ₂ ≤ λ₃ ≤ . . . ≤ λ_k₊₁. Each eigenvector is an n-dimensional vector and is scaled to unit norm. The ith component of the jth eigenvector represents the jth coordinate of the ith data point. Because a total of k eigenvectors were selected, this approach creates an n × k matrix, corresponding to a new k -dimensional representation of each of the n data points. A k-means clustering algorithm can then be applied to the transformed representation.

Why is the transformed representation more suitable for an oﬀ-the-shelf k-means algo-rithm than the original data? It is important to note that the spherical clusters naturally found by the Euclidean-based k-means in the new embedded space may correspond to arbi-trarily shaped clusters in the original space. As discussed in the next section, this behavior is a direct result of the way in which the similarity graph and objective function O are defined. This is also one of the main advantages of using a transformation to similarity graphs. For example, if the approach is applied to the arbitrarily shaped clusters in Fig. 6.11, the simi-larity graph will be such that a k-means algorithm on the transformed data (or a community detection algorithm on the similarity graph) will typically result in the correct arbitrarily-shaped clusters in the original space. Many variations of the spectral approach are discussed in detail in Sect. 19.3.4 of Chap. 19.

6.7.1 Properties of Graph-Based Algorithms

One interesting property of graph-based algorithms is that clusters of arbitrary shape can be discovered with the approach. This is because the neighborhood graph encodes the rele-vant local distances (or k-nearest neighbors), and therefore the communities in the induced

190	CHAPTER 6. CLUSTER ANALYSIS

Yüklə 17,13 Mb.

Dostları ilə paylaş:

1 ... 110 111 112 113 114 115 116 117 ... 423