Data Mining: The Textbook



Yüklə 17,13 Mb.
səhifə5/423
tarix07.01.2024
ölçüsü17,13 Mb.
#211690
1   2   3   4   5   6   7   8   9   ...   423
1-Data Mining tarjima

CONTENTS







xv







11.3.2.1 Relationship Between Weighting and Sampling . . . . .

350







11.3.2.2

Synthetic Oversampling: SMOTE . . . . . . . . . . . .

350

11.4

Scalable Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

350




11.4.1

Scalable Decision Trees . . . . . . . . . . . . . . . . . . . . . . . .

351







11.4.1.1

RainForest . . . . . . . . . . . . . . . . . . . . . . . . .

351







11.4.1.2

BOAT ...........................

351




11.4.2

Scalable Support Vector Machines . . . . . . . . . . . . . . . . . .

352

11.5

Regression Modeling with Numeric Classes . . . . . . . . . . . . . . . . . .

353




11.5.1

Linear Regression . . . . . . . . . . . . . . . . . . . . . . . . . . .

353







11.5.1.1 Relationship with Fisher’s Linear Discriminant . . . . .

356




11.5.2

Principal Component Regression . . . . . . . . . . . . . . . . . . .

356




11.5.3

Generalized Linear Models . . . . . . . . . . . . . . . . . . . . . .

357




11.5.4

Nonlinear and Polynomial Regression . . . . . . . . . . . . . . . .

359




11.5.5

From Decision Trees to Regression Trees . . . . . . . . . . . . . .

360




11.5.6

Assessing Model Effectiveness . . . . . . . . . . . . . . . . . . . .

361

11.6

Semisupervised Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . .

361




11.6.1

Generic Meta-algorithms . . . . . . . . . . . . . . . . . . . . . . .

363







11.6.1.1 Self-Training . . . . . . . . . . . . . . . . . . . . . . . .

363







11.6.1.2 Co-training . . . . . . . . . . . . . . . . . . . . . . . . .

363




11.6.2

Specific Variations of Classification Algorithms . . . . . . . . . . .

364







11.6.2.1 Semisupervised Bayes Classification with EM . . . . . .

364







11.6.2.2 Transductive Support Vector Machines . . . . . . . . .

366




11.6.3

Graph-Based Semisupervised Learning . . . . . . . . . . . . . . .

367




11.6.4

Discussion of Semisupervised Learning . . . . . . . . . . . . . . .

367

11.7

Active Learning .

................................

368




11.7.1

Heterogeneity-Based Models . . . . . . . . . . . . . . . . . . . . .

370







11.7.1.1

Uncertainty Sampling . . . . . . . . . . . . . . . . . . .

370







11.7.1.2 Query-by-Committee . . . . . . . . . . . . . . . . . . .

371







11.7.1.3

Expected Model Change . . . . . . . . . . . . . . . . .

371




11.7.2

Performance-Based Models . . . . . . . . . . . . . . . . . . . . . .

372







11.7.2.1

Expected Error Reduction . . . . . . . . . . . . . . . .

372







11.7.2.2

Expected Variance Reduction . . . . . . . . . . . . . .

373




11.7.3

Representativeness-Based Models . . . . . . . . . . . . . . . . . .

373

11.8

Ensemble Methods

...............................

373




11.8.1

Why Does Ensemble Analysis Work? . . . . . . . . . . . . . . . .

375




11.8.2

Formal Statement of Bias-Variance Trade-off . . . . . . . . . . . .

377




11.8.3

Specific Instantiations of Ensemble Learning . . . . . . . . . . . .

379







11.8.3.1

Bagging . . . . . . . . . . . . . . . . . . . . . . . . . .

379







11.8.3.2

Random Forests . . . . . . . . . . . . . . . . . . . . . .

380







11.8.3.3

Boosting . . . . . . . . . . . . . . . . . . . . . . . . . .

381







11.8.3.4

Bucket of Models . . . . . . . . . . . . . . . . . . . . .

383







11.8.3.5

Stacking . . . . . . . . . . . . . . . . . . . . . . . . . .

384

11.9

Summary . . . .

................................

384

11.10

Bibliographic Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

385

11.11

Exercises . . . . .

................................

386




xvi










CONTENTS

12 Mining Data Streams




389

12.1

Introduction . . .

............................

.... 389

12.2

Synopsis Data Structures for Streams . . . . . . . . . . . . . . . . .

.... 391




12.2.1

Reservoir Sampling . . . . . . . . . . . . . . . . . . . . . .

.... 391







12.2.1.1

Handling Concept Drift . . . . . . . . . . . . . .

.... 393







12.2.1.2 Useful Theoretical Bounds for Sampling . . . . .

.... 394




12.2.2

Synopsis Structures for the Massive-Domain Scenario . . .

.... 398







12.2.2.1

Bloom Filter . . . . . . . . . . . . . . . . . . . .

.... 399







12.2.2.2

Count-Min Sketch . . . . . . . . . . . . . . . . .

.... 403







12.2.2.3

AMS Sketch . . . . . . . . . . . . . . . . . . . .

.... 406







12.2.2.4 Flajolet–Martin Algorithm for Distinct Element













Counting . . . . . . . . . . . . . . . . . . . . . .

.... 408

12.3

Frequent Pattern Mining in Data Streams . . . . . . . . . . . . . .

.... 409




12.3.1

Leveraging Synopsis Structures . . . . . . . . . . . . . . .

.... 409







12.3.1.1

Reservoir Sampling . . . . . . . . . . . . . . . .

.... 410







12.3.1.2

Sketches . . . . . . . . . . . . . . . . . . . . . .

.... 410




12.3.2

Lossy Counting Algorithm . . . . . . . . . . . . . . . . . .

.... 410

12.4

Clustering Data Streams . . . . . . . . . . . . . . . . . . . . . . . .

.... 411




12.4.1

STREAM Algorithm . . . . . . . . . . . . . . . . . . . . .

.... 411




12.4.2

CluStream Algorithm . . . . . . . . . . . . . . . . . . . . .

.... 413







12.4.2.1

Microcluster Definition . . . . . . . . . . . . . .

.... 413







12.4.2.2

Microclustering Algorithm . . . . . . . . . . . .

.... 414







12.4.2.3

Pyramidal Time Frame . . . . . . . . . . . . . .

.... 415




12.4.3

Massive-Domain Stream Clustering . . . . . . . . . . . . .

.... 417

12.5

Streaming Outlier Detection . . . . . . . . . . . . . . . . . . . . . .

.... 417




12.5.1

Individual Data Points as Outliers . . . . . . . . . . . . . .

.... 418




12.5.2

Aggregate Change Points as Outliers . . . . . . . . . . . .

.... 419

12.6

Streaming Classification . . . . . . . . . . . . . . . . . . . . . . . .

.... 421




12.6.1

VFDT Family . . . . . . . . . . . . . . . . . . . . . . . . .

.... 421




12.6.2

Supervised Microcluster Approach . . . . . . . . . . . . . .

.... 424




12.6.3

Ensemble Method . . . . . . . . . . . . . . . . . . . . . . .

.... 424




12.6.4

Massive-Domain Streaming Classification . . . . . . . . . .

.... 425

12.7

Summary . . . .

............................

.... 425

12.8

Bibliographic Notes . . . . . . . . . . . . . . . . . . . . . . . . . . .

.... 425

12.9

Exercises . . . . .

............................

.... 426

13 Mining Text Data




429

13.1

Introduction . . .

............................

.... 429

13.2

Document Preparation and Similarity







Computation . .

............................

.... 431




13.2.1

Document Normalization and Similarity Computation . . .

.... 432




13.2.2

Specialized Preprocessing for Web Documents . . . . . . .

.... 433

13.3

Specialized Clustering Methods for Text . . . . . . . . . . . . . . .

.... 434




13.3.1

Representative-Based Algorithms . . . . . . . . . . . . . .

.... 434







13.3.1.1

Scatter/Gather Approach . . . . . . . . . . . . .

.... 434




13.3.2

Probabilistic Algorithms . . . . . . . . . . . . . . . . . . .

.... 436




13.3.3

Simultaneous Document and Word Cluster Discovery . . .

.... 438







13.3.3.1 Co-clustering . . . . . . . . . . . . . . . . . . . .

.... 438

13.4

Topic Modeling .

............................

.... 440





Yüklə 17,13 Mb.

Dostları ilə paylaş:
1   2   3   4   5   6   7   8   9   ...   423




Verilənlər bazası müəlliflik hüququ ilə müdafiə olunur ©azkurs.org 2024
rəhbərliyinə müraciət

gir | qeydiyyatdan keç
    Ana səhifə


yükləyin