CONTENTS
|
|
|
xv
|
|
|
11.3.2.1 Relationship Between Weighting and Sampling . . . . .
|
350
|
|
|
11.3.2.2
|
Synthetic Oversampling: SMOTE . . . . . . . . . . . .
|
350
|
11.4
|
Scalable Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
|
350
|
|
11.4.1
|
Scalable Decision Trees . . . . . . . . . . . . . . . . . . . . . . . .
|
351
|
|
|
11.4.1.1
|
RainForest . . . . . . . . . . . . . . . . . . . . . . . . .
|
351
|
|
|
11.4.1.2
|
BOAT ...........................
|
351
|
|
11.4.2
|
Scalable Support Vector Machines . . . . . . . . . . . . . . . . . .
|
352
|
11.5
|
Regression Modeling with Numeric Classes . . . . . . . . . . . . . . . . . .
|
353
|
|
11.5.1
|
Linear Regression . . . . . . . . . . . . . . . . . . . . . . . . . . .
|
353
|
|
|
11.5.1.1 Relationship with Fisher’s Linear Discriminant . . . . .
|
356
|
|
11.5.2
|
Principal Component Regression . . . . . . . . . . . . . . . . . . .
|
356
|
|
11.5.3
|
Generalized Linear Models . . . . . . . . . . . . . . . . . . . . . .
|
357
|
|
11.5.4
|
Nonlinear and Polynomial Regression . . . . . . . . . . . . . . . .
|
359
|
|
11.5.5
|
From Decision Trees to Regression Trees . . . . . . . . . . . . . .
|
360
|
|
11.5.6
|
Assessing Model Effectiveness . . . . . . . . . . . . . . . . . . . .
|
361
|
11.6
|
Semisupervised Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . .
|
361
|
|
11.6.1
|
Generic Meta-algorithms . . . . . . . . . . . . . . . . . . . . . . .
|
363
|
|
|
11.6.1.1 Self-Training . . . . . . . . . . . . . . . . . . . . . . . .
|
363
|
|
|
11.6.1.2 Co-training . . . . . . . . . . . . . . . . . . . . . . . . .
|
363
|
|
11.6.2
|
Specific Variations of Classification Algorithms . . . . . . . . . . .
|
364
|
|
|
11.6.2.1 Semisupervised Bayes Classification with EM . . . . . .
|
364
|
|
|
11.6.2.2 Transductive Support Vector Machines . . . . . . . . .
|
366
|
|
11.6.3
|
Graph-Based Semisupervised Learning . . . . . . . . . . . . . . .
|
367
|
|
11.6.4
|
Discussion of Semisupervised Learning . . . . . . . . . . . . . . .
|
367
|
11.7
|
Active Learning .
|
................................
|
368
|
|
11.7.1
|
Heterogeneity-Based Models . . . . . . . . . . . . . . . . . . . . .
|
370
|
|
|
11.7.1.1
|
Uncertainty Sampling . . . . . . . . . . . . . . . . . . .
|
370
|
|
|
11.7.1.2 Query-by-Committee . . . . . . . . . . . . . . . . . . .
|
371
|
|
|
11.7.1.3
|
Expected Model Change . . . . . . . . . . . . . . . . .
|
371
|
|
11.7.2
|
Performance-Based Models . . . . . . . . . . . . . . . . . . . . . .
|
372
|
|
|
11.7.2.1
|
Expected Error Reduction . . . . . . . . . . . . . . . .
|
372
|
|
|
11.7.2.2
|
Expected Variance Reduction . . . . . . . . . . . . . .
|
373
|
|
11.7.3
|
Representativeness-Based Models . . . . . . . . . . . . . . . . . .
|
373
|
11.8
|
Ensemble Methods
|
...............................
|
373
|
|
11.8.1
|
Why Does Ensemble Analysis Work? . . . . . . . . . . . . . . . .
|
375
|
|
11.8.2
|
Formal Statement of Bias-Variance Trade-off . . . . . . . . . . . .
|
377
|
|
11.8.3
|
Specific Instantiations of Ensemble Learning . . . . . . . . . . . .
|
379
|
|
|
11.8.3.1
|
Bagging . . . . . . . . . . . . . . . . . . . . . . . . . .
|
379
|
|
|
11.8.3.2
|
Random Forests . . . . . . . . . . . . . . . . . . . . . .
|
380
|
|
|
11.8.3.3
|
Boosting . . . . . . . . . . . . . . . . . . . . . . . . . .
|
381
|
|
|
11.8.3.4
|
Bucket of Models . . . . . . . . . . . . . . . . . . . . .
|
383
|
|
|
11.8.3.5
|
Stacking . . . . . . . . . . . . . . . . . . . . . . . . . .
|
384
|
11.9
|
Summary . . . .
|
................................
|
384
|
11.10
|
Bibliographic Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
|
385
|
11.11
|
Exercises . . . . .
|
................................
|
386
|
xvi
|
|
|
|
CONTENTS
|
12 Mining Data Streams
|
|
389
|
12.1
|
Introduction . . .
|
............................
|
.... 389
|
12.2
|
Synopsis Data Structures for Streams . . . . . . . . . . . . . . . . .
|
.... 391
|
|
12.2.1
|
Reservoir Sampling . . . . . . . . . . . . . . . . . . . . . .
|
.... 391
|
|
|
12.2.1.1
|
Handling Concept Drift . . . . . . . . . . . . . .
|
.... 393
|
|
|
12.2.1.2 Useful Theoretical Bounds for Sampling . . . . .
|
.... 394
|
|
12.2.2
|
Synopsis Structures for the Massive-Domain Scenario . . .
|
.... 398
|
|
|
12.2.2.1
|
Bloom Filter . . . . . . . . . . . . . . . . . . . .
|
.... 399
|
|
|
12.2.2.2
|
Count-Min Sketch . . . . . . . . . . . . . . . . .
|
.... 403
|
|
|
12.2.2.3
|
AMS Sketch . . . . . . . . . . . . . . . . . . . .
|
.... 406
|
|
|
12.2.2.4 Flajolet–Martin Algorithm for Distinct Element
|
|
|
|
|
Counting . . . . . . . . . . . . . . . . . . . . . .
|
.... 408
|
12.3
|
Frequent Pattern Mining in Data Streams . . . . . . . . . . . . . .
|
.... 409
|
|
12.3.1
|
Leveraging Synopsis Structures . . . . . . . . . . . . . . .
|
.... 409
|
|
|
12.3.1.1
|
Reservoir Sampling . . . . . . . . . . . . . . . .
|
.... 410
|
|
|
12.3.1.2
|
Sketches . . . . . . . . . . . . . . . . . . . . . .
|
.... 410
|
|
12.3.2
|
Lossy Counting Algorithm . . . . . . . . . . . . . . . . . .
|
.... 410
|
12.4
|
Clustering Data Streams . . . . . . . . . . . . . . . . . . . . . . . .
|
.... 411
|
|
12.4.1
|
STREAM Algorithm . . . . . . . . . . . . . . . . . . . . .
|
.... 411
|
|
12.4.2
|
CluStream Algorithm . . . . . . . . . . . . . . . . . . . . .
|
.... 413
|
|
|
12.4.2.1
|
Microcluster Definition . . . . . . . . . . . . . .
|
.... 413
|
|
|
12.4.2.2
|
Microclustering Algorithm . . . . . . . . . . . .
|
.... 414
|
|
|
12.4.2.3
|
Pyramidal Time Frame . . . . . . . . . . . . . .
|
.... 415
|
|
12.4.3
|
Massive-Domain Stream Clustering . . . . . . . . . . . . .
|
.... 417
|
12.5
|
Streaming Outlier Detection . . . . . . . . . . . . . . . . . . . . . .
|
.... 417
|
|
12.5.1
|
Individual Data Points as Outliers . . . . . . . . . . . . . .
|
.... 418
|
|
12.5.2
|
Aggregate Change Points as Outliers . . . . . . . . . . . .
|
.... 419
|
12.6
|
Streaming Classification . . . . . . . . . . . . . . . . . . . . . . . .
|
.... 421
|
|
12.6.1
|
VFDT Family . . . . . . . . . . . . . . . . . . . . . . . . .
|
.... 421
|
|
12.6.2
|
Supervised Microcluster Approach . . . . . . . . . . . . . .
|
.... 424
|
|
12.6.3
|
Ensemble Method . . . . . . . . . . . . . . . . . . . . . . .
|
.... 424
|
|
12.6.4
|
Massive-Domain Streaming Classification . . . . . . . . . .
|
.... 425
|
12.7
|
Summary . . . .
|
............................
|
.... 425
|
12.8
|
Bibliographic Notes . . . . . . . . . . . . . . . . . . . . . . . . . . .
|
.... 425
|
12.9
|
Exercises . . . . .
|
............................
|
.... 426
|
13 Mining Text Data
|
|
429
|
13.1
|
Introduction . . .
|
............................
|
.... 429
|
13.2
|
Document Preparation and Similarity
|
|
|
Computation . .
|
............................
|
.... 431
|
|
13.2.1
|
Document Normalization and Similarity Computation . . .
|
.... 432
|
|
13.2.2
|
Specialized Preprocessing for Web Documents . . . . . . .
|
.... 433
|
13.3
|
Specialized Clustering Methods for Text . . . . . . . . . . . . . . .
|
.... 434
|
|
13.3.1
|
Representative-Based Algorithms . . . . . . . . . . . . . .
|
.... 434
|
|
|
13.3.1.1
|
Scatter/Gather Approach . . . . . . . . . . . . .
|
.... 434
|
|
13.3.2
|
Probabilistic Algorithms . . . . . . . . . . . . . . . . . . .
|
.... 436
|
|
13.3.3
|
Simultaneous Document and Word Cluster Discovery . . .
|
.... 438
|
|
|
13.3.3.1 Co-clustering . . . . . . . . . . . . . . . . . . . .
|
.... 438
|
13.4
|
Topic Modeling .
|
............................
|
.... 440
|