CONTENTS
|
|
|
ix
|
|
|
3.2.1.7
|
Nonlinear Distributions: ISOMAP . . . . . . . . . . . .
|
70
|
|
|
3.2.1.8
|
Impact of Local Data Distribution . . . . . . . . . . . .
|
72
|
|
|
3.2.1.9
|
Computational Considerations . . . . . . . . . . . . . .
|
73
|
|
3.2.2
|
Categorical Data . . . . . . . . . . . . . . . . . . . . . . . . . . .
|
74
|
|
3.2.3
|
Mixed Quantitative and Categorical Data . . . . . . . . . . . . .
|
75
|
3.3
|
Text Similarity Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . .
|
75
|
|
3.3.1
|
Binary and Set Data . . . . . . . . . . . . . . . . . . . . . . . . .
|
77
|
3.4
|
Temporal Similarity Measures . . . . . . . . . . . . . . . . . . . . . . . . .
|
77
|
|
3.4.1
|
Time-Series Similarity Measures . . . . . . . . . . . . . . . . . . .
|
77
|
|
|
3.4.1.1
|
Impact of Behavioral Attribute Normalization . . . . .
|
78
|
|
|
3.4.1.2
|
Lp-Norm . . . . . . . . . . . . . . . . . . . . . . . . . .
|
79
|
|
|
3.4.1.3
|
Dynamic Time Warping Distance . . . . . . . . . . . .
|
79
|
|
|
3.4.1.4
|
Window-Based Methods . . . . . . . . . . . . . . . . .
|
82
|
|
3.4.2
|
Discrete Sequence Similarity Measures . . . . . . . . . . . . . . .
|
82
|
|
|
3.4.2.1
|
Edit Distance . . . . . . . . . . . . . . . . . . . . . . .
|
82
|
|
|
3.4.2.2
|
Longest Common Subsequence . . . . . . . . . . . . . .
|
84
|
3.5
|
Graph Similarity Measures . . . . . . . . . . . . . . . . . . . . . . . . . . .
|
85
|
|
3.5.1
|
Similarity between Two Nodes in a Single Graph . . . . . . . . .
|
85
|
|
|
3.5.1.1
|
Structural Distance-Based Measure . . . . . . . . . . .
|
85
|
|
|
3.5.1.2
|
Random Walk-Based Similarity . . . . . . . . . . . . .
|
86
|
|
3.5.2
|
Similarity Between Two Graphs . . . . . . . . . . . . . . . . . . .
|
86
|
3.6
|
Supervised Similarity Functions . . . . . . . . . . . . . . . . . . . . . . . .
|
87
|
3.7
|
Summary . . . .
|
................................
|
88
|
3.8
|
Bibliographic Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
|
89
|
3.9
|
Exercises . . . . .
|
................................
|
90
|
4 Association Pattern Mining
|
93
|
4.1
|
Introduction . . .
|
................................
|
93
|
4.2
|
The Frequent Pattern Mining Model . . . . . . . . . . . . . . . . . . . . .
|
94
|
4.3
|
Association Rule Generation Framework . . . . . . . . . . . . . . . . . . .
|
97
|
4.4
|
Frequent Itemset Mining Algorithms . . . . . . . . . . . . . . . . . . . . .
|
99
|
|
4.4.1
|
Brute Force Algorithms . . . . . . . . . . . . . . . . . . . . . . . .
|
99
|
|
4.4.2
|
The Apriori Algorithm . . . . . . . . . . . . . . . . . . . . . . . .
|
100
|
|
|
4.4.2.1
|
Efficient Support Counting . . . . . . . . . . . . . . . .
|
102
|
|
4.4.3
|
Enumeration-Tree Algorithms . . . . . . . . . . . . . . . . . . . .
|
103
|
|
|
4.4.3.1
|
Enumeration-Tree-Based Interpretation of Apriori . . .
|
105
|
|
|
4.4.3.2
|
TreeProjection and DepthProject . . . . . . . . . . . .
|
106
|
|
|
4.4.3.3
|
Vertical Counting Methods . . . . . . . . . . . . . . . .
|
110
|
|
4.4.4
|
Recursive Suffix-Based Pattern Growth Methods . . . . . . . . . .
|
112
|
|
|
4.4.4.1
|
Implementation with Arrays but No Pointers . . . . . .
|
114
|
|
|
4.4.4.2
|
Implementation with Pointers but No FP-Tree . . . . .
|
114
|
|
|
4.4.4.3
|
Implementation with Pointers and FP-Tree . . . . . . .
|
116
|
|
|
4.4.4.4
|
Trade-offs with Different Data Structures . . . . . . . .
|
118
|
|
|
4.4.4.5
|
Relationship Between FP-Growth and Enumeration-
|
|
|
|
|
Tree Methods . . . . . . . . . . . . . . . . . . . . . . .
|
119
|
4.5
|
Alternative Models: Interesting Patterns . . . . . . . . . . . . . . . . . . .
|
122
|
|
4.5.1
|
Statistical Coefficient of Correlation . . . . . . . . . . . . . . . . .
|
123
|
|
4.5.2
|
χ2 Measure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
|
123
|
|
4.5.3
|
Interest Ratio . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
|
124
|
x
|
|
|
|
CONTENTS
|
|
4.5.4
|
Symmetric Confidence Measures . . . . . . . . . . . . . .
|
..... 124
|
|
4.5.5
|
Cosine Coefficient on Columns . . . . . . . . . . . . . . .
|
..... 125
|
|
4.5.6
|
Jaccard Coefficient and the Min-hash Trick . . . . . . . .
|
..... 125
|
|
4.5.7
|
Collective Strength . . . . . . . . . . . . . . . . . . . . .
|
..... 126
|
|
4.5.8
|
Relationship to Negative Pattern Mining . . . . . . . . .
|
..... 127
|
4.6
|
Useful Meta-algorithms . . . . . . . . . . . . . . . . . . . . . . . .
|
..... 127
|
|
4.6.1
|
Sampling Methods . . . . . . . . . . . . . . . . . . . . .
|
..... 128
|
|
4.6.2
|
Data Partitioned Ensembles . . . . . . . . . . . . . . . .
|
..... 128
|
|
4.6.3
|
Generalization to Other Data Types . . . . . . . . . . .
|
..... 129
|
|
|
4.6.3.1
|
Quantitative Data . . . . . . . . . . . . . . . .
|
..... 129
|
|
|
4.6.3.2
|
Categorical Data . . . . . . . . . . . . . . . .
|
..... 129
|
4.7
|
Summary . . . .
|
...........................
|
..... 129
|
4.8
|
Bibliographic Notes . . . . . . . . . . . . . . . . . . . . . . . . . .
|
..... 130
|
4.9
|
Exercises . . . . .
|
...........................
|
..... 132
|
5 Association Pattern Mining: Advanced Concepts
|
135
|
5.1
|
Introduction . . .
|
...........................
|
..... 135
|
5.2
|
Pattern Summarization . . . . . . . . . . . . . . . . . . . . . . . .
|
..... 136
|
|
5.2.1
|
Maximal Patterns . . . . . . . . . . . . . . . . . . . . . .
|
..... 136
|
|
5.2.2
|
Closed Patterns . . . . . . . . . . . . . . . . . . . . . . .
|
..... 137
|
|
5.2.3
|
Approximate Frequent Patterns . . . . . . . . . . . . . .
|
..... 139
|
|
|
5.2.3.1
|
Approximation in Terms of Transactions . . .
|
..... 139
|
|
|
5.2.3.2
|
Approximation in Terms of Itemsets . . . . . .
|
..... 140
|
5.3
|
Pattern Querying
|
...........................
|
..... 141
|
|
5.3.1
|
Preprocess-once Query-many Paradigm . . . . . . . . . .
|
..... 141
|
|
|
5.3.1.1
|
Leveraging the Itemset Lattice . . . . . . . . .
|
..... 142
|
|
|
5.3.1.2
|
Leveraging Data Structures for Querying . . .
|
..... 143
|
|
5.3.2
|
Pushing Constraints into Pattern Mining . . . . . . . . .
|
..... 146
|
5.4
|
Putting Associations to Work: Applications . . . . . . . . . . . .
|
..... 147
|
|
5.4.1
|
Relationship to Other Data Mining Problems . . . . . .
|
..... 147
|
|
|
5.4.1.1
|
Application to Classification . . . . . . . . . .
|
..... 147
|
|
|
5.4.1.2
|
Application to Clustering . . . . . . . . . . . .
|
..... 148
|
|
|
5.4.1.3
|
Applications to Outlier Detection . . . . . . .
|
..... 148
|
|
5.4.2
|
Market Basket Analysis . . . . . . . . . . . . . . . . . . .
|
..... 148
|
|
5.4.3
|
Demographic and Profile Analysis . . . . . . . . . . . . .
|
..... 148
|
|
5.4.4
|
Recommendations and Collaborative Filtering . . . . . .
|
..... 149
|
|
5.4.5
|
Web Log Analysis . . . . . . . . . . . . . . . . . . . . . .
|
..... 149
|
|
5.4.6
|
Bioinformatics . . . . . . . . . . . . . . . . . . . . . . . .
|
..... 149
|
|
5.4.7
|
Other Applications for Complex Data Types . . . . . . .
|
..... 150
|
5.5
|
Summary . . . .
|
...........................
|
..... 150
|
5.6
|
Bibliographic Notes . . . . . . . . . . . . . . . . . . . . . . . . . .
|
..... 151
|
5.7
|
Exercises . . . . .
|
...........................
|
..... 152
|
6 Cluster Analysis
|
|
153
|
6.1
|
Introduction . . .
|
...........................
|
..... 153
|
6.2
|
Feature Selection for Clustering . . . . . . . . . . . . . . . . . . .
|
..... 154
|
|
6.2.1
|
Filter Models . . . . . . . . . . . . . . . . . . . . . . . .
|
..... 155
|
|
|
6.2.1.1
|
Term Strength . . . . . . . . . . . . . . . . . .
|
..... 155
|
|
|
6.2.1.2
|
Predictive Attribute Dependence . . . . . . .
|
..... 155
|