CONTENTS
|
|
|
xi
|
|
|
6.2.1.3
|
Entropy . . . . . . . . . . . . . . . . . . . . . . . . . .
|
156
|
|
|
6.2.1.4
|
Hopkins Statistic . . . . . . . . . . . . . . . . . . . . .
|
157
|
|
6.2.2
|
Wrapper Models . . . . . . . . . . . . . . . . . . . . . . . . . . . .
|
158
|
6.3
|
Representative-Based Algorithms . . . . . . . . . . . . . . . . . . . . . . .
|
159
|
|
6.3.1
|
The k-Means Algorithm . . . . . . . . . . . . . . . . . . . . . . .
|
162
|
|
6.3.2
|
The Kernel k-Means Algorithm . . . . . . . . . . . . . . . . . . .
|
163
|
|
6.3.3
|
The k-Medians Algorithm . . . . . . . . . . . . . . . . . . . . . .
|
164
|
|
6.3.4
|
The k-Medoids Algorithm . . . . . . . . . . . . . . . . . . . . . .
|
164
|
6.4
|
Hierarchical Clustering Algorithms . . . . . . . . . . . . . . . . . . . . . .
|
166
|
|
6.4.1
|
Bottom-Up Agglomerative Methods . . . . . . . . . . . . . . . . .
|
167
|
|
|
6.4.1.1
|
Group-Based Statistics . . . . . . . . . . . . . . . . . .
|
169
|
|
6.4.2
|
Top-Down Divisive Methods . . . . . . . . . . . . . . . . . . . . .
|
172
|
|
|
6.4.2.1
|
Bisecting k-Means . . . . . . . . . . . . . . . . . . . . .
|
173
|
6.5
|
Probabilistic Model-Based Algorithms . . . . . . . . . . . . . . . . . . . .
|
173
|
|
6.5.1
|
Relationship of EM to k-means and Other Representative
|
|
|
|
Methods
|
................................
|
176
|
6.6
|
Grid-Based and Density-Based Algorithms . . . . . . . . . . . . . . . . . .
|
178
|
|
6.6.1
|
Grid-Based Methods . . . . . . . . . . . . . . . . . . . . . . . . .
|
179
|
|
6.6.2
|
DBSCAN
|
...............................
|
181
|
|
6.6.3
|
DENCLUE...............................
|
184
|
6.7
|
Graph-Based Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . .
|
187
|
|
6.7.1
|
Properties of Graph-Based Algorithms . . . . . . . . . . . . . . .
|
189
|
6.8
|
Non-negative Matrix Factorization . . . . . . . . . . . . . . . . . . . . . .
|
191
|
|
6.8.1
|
Comparison with Singular Value Decomposition . . . . . . . . . .
|
194
|
6.9
|
Cluster Validation
|
................................
|
195
|
|
6.9.1
|
Internal Validation Criteria . . . . . . . . . . . . . . . . . . . . . .
|
196
|
|
|
6.9.1.1
|
Parameter Tuning with Internal Measures . . . . . . .
|
198
|
|
6.9.2
|
External Validation Criteria . . . . . . . . . . . . . . . . . . . . .
|
198
|
|
6.9.3
|
General Comments . . . . . . . . . . . . . . . . . . . . . . . . . .
|
201
|
6.10
|
Summary . . . .
|
................................
|
201
|
6.11
|
Bibliographic Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
|
201
|
6.12
|
Exercises . . . . .
|
................................
|
202
|
7 Cluster Analysis: Advanced Concepts
|
205
|
7.1
|
Introduction . . .
|
................................
|
205
|
7.2
|
Clustering Categorical Data . . . . . . . . . . . . . . . . . . . . . . . . . .
|
206
|
|
7.2.1
|
Representative-Based Algorithms . . . . . . . . . . . . . . . . . .
|
207
|
|
|
7.2.1.1
|
k-Modes Clustering . . . . . . . . . . . . . . . . . . . .
|
208
|
|
|
7.2.1.2
|
k-Medoids Clustering . . . . . . . . . . . . . . . . . . .
|
209
|
|
7.2.2
|
Hierarchical Algorithms . . . . . . . . . . . . . . . . . . . . . . . .
|
209
|
|
|
7.2.2.1
|
ROCK ...........................
|
209
|
|
7.2.3
|
Probabilistic Algorithms . . . . . . . . . . . . . . . . . . . . . . .
|
211
|
|
7.2.4
|
Graph-Based Algorithms . . . . . . . . . . . . . . . . . . . . . . .
|
212
|
7.3
|
Scalable Data Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . .
|
212
|
|
7.3.1
|
CLARANS...............................
|
213
|
|
7.3.2
|
BIRCH .
|
................................
|
214
|
|
7.3.3
|
CURE .
|
................................
|
216
|
7.4
|
High-Dimensional Clustering . . . . . . . . . . . . . . . . . . . . . . . . . .
|
217
|
|
7.4.1
|
CLIQUE
|
................................
|
219
|
|
7.4.2
|
PROCLUS...............................
|
220
|
xii
|
|
|
|
CONTENTS
|
|
7.4.3
|
ORCLUS
|
...........................
|
..... 222
|
7.5
|
Semisupervised Clustering . . . . . . . . . . . . . . . . . . . . . .
|
..... 224
|
|
7.5.1
|
Pointwise Supervision . . . . . . . . . . . . . . . . . . . .
|
..... 225
|
|
7.5.2
|
Pairwise Supervision . . . . . . . . . . . . . . . . . . . .
|
..... 226
|
7.6
|
Human and Visually Supervised Clustering . . . . . . . . . . . .
|
..... 227
|
|
7.6.1
|
Modifications of Existing Clustering Algorithms . . . . .
|
..... 228
|
|
7.6.2
|
Visual Clustering . . . . . . . . . . . . . . . . . . . . . .
|
..... 228
|
7.7
|
Cluster Ensembles
|
..........................
|
..... 231
|
|
7.7.1
|
Selecting Different Ensemble Components . . . . . . . .
|
..... 231
|
|
7.7.2
|
Combining Different Ensemble Components . . . . . . .
|
..... 232
|
|
|
7.7.2.1
|
Hypergraph Partitioning Algorithm . . . . . .
|
..... 232
|
|
|
7.7.2.2
|
Meta-clustering Algorithm . . . . . . . . . . .
|
..... 232
|
7.8
|
Putting Clustering to Work: Applications . . . . . . . . . . . . .
|
..... 233
|
|
7.8.1
|
Applications to Other Data Mining Problems . . . . . .
|
..... 233
|
|
|
7.8.1.1
|
Data Summarization . . . . . . . . . . . . . .
|
..... 233
|
|
|
7.8.1.2
|
Outlier Analysis . . . . . . . . . . . . . . . . .
|
..... 233
|
|
|
7.8.1.3
|
Classification . . . . . . . . . . . . . . . . . . .
|
..... 233
|
|
|
7.8.1.4
|
Dimensionality Reduction . . . . . . . . . . .
|
..... 234
|
|
|
7.8.1.5
|
Similarity Search and Indexing . . . . . . . . .
|
..... 234
|
|
7.8.2
|
Customer Segmentation and Collaborative Filtering . . .
|
..... 234
|
|
7.8.3
|
Text Applications . . . . . . . . . . . . . . . . . . . . . .
|
..... 234
|
|
7.8.4
|
Multimedia Applications . . . . . . . . . . . . . . . . . .
|
..... 234
|
|
7.8.5
|
Temporal and Sequence Applications . . . . . . . . . . .
|
..... 234
|
|
7.8.6
|
Social Network Analysis . . . . . . . . . . . . . . . . . .
|
..... 235
|
7.9
|
Summary . . . .
|
...........................
|
..... 235
|
7.10
|
Bibliographic Notes . . . . . . . . . . . . . . . . . . . . . . . . . .
|
..... 235
|
7.11
|
Exercises . . . . .
|
...........................
|
..... 236
|
8 Outlier Analysis
|
|
237
|
8.1
|
Introduction . . .
|
...........................
|
..... 237
|
8.2
|
Extreme Value Analysis . . . . . . . . . . . . . . . . . . . . . . .
|
..... 239
|
|
8.2.1
|
Univariate Extreme Value Analysis . . . . . . . . . . . .
|
..... 240
|
|
8.2.2
|
Multivariate Extreme Values . . . . . . . . . . . . . . . .
|
..... 242
|
|
8.2.3
|
Depth-Based Methods . . . . . . . . . . . . . . . . . . .
|
..... 243
|
8.3
|
Probabilistic Models . . . . . . . . . . . . . . . . . . . . . . . . .
|
..... 244
|
8.4
|
Clustering for Outlier Detection . . . . . . . . . . . . . . . . . . .
|
..... 246
|
8.5
|
Distance-Based Outlier Detection . . . . . . . . . . . . . . . . . .
|
..... 248
|
|
8.5.1
|
Pruning Methods . . . . . . . . . . . . . . . . . . . . . .
|
..... 249
|
|
|
8.5.1.1
|
Sampling Methods . . . . . . . . . . . . . . . .
|
..... 249
|
|
|
8.5.1.2
|
Early Termination Trick with Nested Loops .
|
..... 250
|
|
8.5.2
|
Local Distance Correction Methods . . . . . . . . . . . .
|
..... 251
|
|
|
8.5.2.1
|
Local Outlier Factor (LOF) . . . . . . . . . .
|
..... 252
|
|
|
8.5.2.2
|
Instance-Specific Mahalanobis Distance . . . .
|
..... 254
|
8.6
|
Density-Based Methods . . . . . . . . . . . . . . . . . . . . . . .
|
..... 255
|
|
8.6.1
|
Histogram- and Grid-Based Techniques . . . . . . . . . .
|
..... 255
|
|
8.6.2
|
Kernel Density Estimation . . . . . . . . . . . . . . . . .
|
..... 256
|
8.7
|
Information-Theoretic Models . . . . . . . . . . . . . . . . . . . .
|
..... 256
|
8.8
|
Outlier Validity .
|
...........................
|
..... 258
|
|
8.8.1
|
Methodological Challenges . . . . . . . . . . . . . . . . .
|
..... 258
|