Data Mining: The Textbook

Yüklə 17,13 Mb.

səhifə	5/423
tarix	07.01.2024
ölçüsü	17,13 Mb.
	#211690

1 2 3 4 5 6 7 8 9 ... 423

1-Data Mining tarjima

CONTENTS
xv
		11.3.2.1 Relationship Between Weighting and Sampling . . . . .		350
		11.3.2.2	Synthetic Oversampling: SMOTE . . . . . . . . . . . .	350
11.4	Scalable Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . .			350
	11.4.1	Scalable Decision Trees . . . . . . . . . . . . . . . . . . . . . . . .		351
		11.4.1.1	RainForest . . . . . . . . . . . . . . . . . . . . . . . . .	351
		11.4.1.2	BOAT ...........................	351
	11.4.2	Scalable Support Vector Machines . . . . . . . . . . . . . . . . . .		352
11.5	Regression Modeling with Numeric Classes . . . . . . . . . . . . . . . . . .			353
	11.5.1	Linear Regression . . . . . . . . . . . . . . . . . . . . . . . . . . .		353
		11.5.1.1 Relationship with Fisher’s Linear Discriminant . . . . .		356
	11.5.2	Principal Component Regression . . . . . . . . . . . . . . . . . . .		356
	11.5.3	Generalized Linear Models . . . . . . . . . . . . . . . . . . . . . .		357
	11.5.4	Nonlinear and Polynomial Regression . . . . . . . . . . . . . . . .		359
	11.5.5	From Decision Trees to Regression Trees . . . . . . . . . . . . . .		360
	11.5.6	Assessing Model Eﬀectiveness . . . . . . . . . . . . . . . . . . . .		361
11.6	Semisupervised Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . .			361
	11.6.1	Generic Meta-algorithms . . . . . . . . . . . . . . . . . . . . . . .		363
		11.6.1.1 Self-Training . . . . . . . . . . . . . . . . . . . . . . . .		363
		11.6.1.2 Co-training . . . . . . . . . . . . . . . . . . . . . . . . .		363
	11.6.2	Specific Variations of Classification Algorithms . . . . . . . . . . .		364
		11.6.2.1 Semisupervised Bayes Classification with EM . . . . . .		364
		11.6.2.2 Transductive Support Vector Machines . . . . . . . . .		366
	11.6.3	Graph-Based Semisupervised Learning . . . . . . . . . . . . . . .		367
	11.6.4	Discussion of Semisupervised Learning . . . . . . . . . . . . . . .		367
11.7	Active Learning .		................................	368
	11.7.1	Heterogeneity-Based Models . . . . . . . . . . . . . . . . . . . . .		370
		11.7.1.1	Uncertainty Sampling . . . . . . . . . . . . . . . . . . .	370
		11.7.1.2 Query-by-Committee . . . . . . . . . . . . . . . . . . .		371
		11.7.1.3	Expected Model Change . . . . . . . . . . . . . . . . .	371
	11.7.2	Performance-Based Models . . . . . . . . . . . . . . . . . . . . . .		372
		11.7.2.1	Expected Error Reduction . . . . . . . . . . . . . . . .	372
		11.7.2.2	Expected Variance Reduction . . . . . . . . . . . . . .	373
	11.7.3	Representativeness-Based Models . . . . . . . . . . . . . . . . . .		373
11.8	Ensemble Methods		...............................	373
	11.8.1	Why Does Ensemble Analysis Work? . . . . . . . . . . . . . . . .		375
	11.8.2	Formal Statement of Bias-Variance Trade-oﬀ . . . . . . . . . . . .		377
	11.8.3	Specific Instantiations of Ensemble Learning . . . . . . . . . . . .		379
		11.8.3.1	Bagging . . . . . . . . . . . . . . . . . . . . . . . . . .	379
		11.8.3.2	Random Forests . . . . . . . . . . . . . . . . . . . . . .	380
		11.8.3.3	Boosting . . . . . . . . . . . . . . . . . . . . . . . . . .	381
		11.8.3.4	Bucket of Models . . . . . . . . . . . . . . . . . . . . .	383
		11.8.3.5	Stacking . . . . . . . . . . . . . . . . . . . . . . . . . .	384
11.9	Summary . . . .		................................	384
11.10	Bibliographic Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .			385
11.11	Exercises . . . . .		................................	386

xvi				CONTENTS
12 Mining Data Streams				389
12.1	Introduction . . .		............................	.... 389
12.2	Synopsis Data Structures for Streams . . . . . . . . . . . . . . . . .			.... 391
	12.2.1	Reservoir Sampling . . . . . . . . . . . . . . . . . . . . . .		.... 391
		12.2.1.1	Handling Concept Drift . . . . . . . . . . . . . .	.... 393
		12.2.1.2 Useful Theoretical Bounds for Sampling . . . . .		.... 394
	12.2.2	Synopsis Structures for the Massive-Domain Scenario . . .		.... 398
		12.2.2.1	Bloom Filter . . . . . . . . . . . . . . . . . . . .	.... 399
		12.2.2.2	Count-Min Sketch . . . . . . . . . . . . . . . . .	.... 403
		12.2.2.3	AMS Sketch . . . . . . . . . . . . . . . . . . . .	.... 406
		12.2.2.4 Flajolet–Martin Algorithm for Distinct Element
			Counting . . . . . . . . . . . . . . . . . . . . . .	.... 408
12.3	Frequent Pattern Mining in Data Streams . . . . . . . . . . . . . .			.... 409
	12.3.1	Leveraging Synopsis Structures . . . . . . . . . . . . . . .		.... 409
		12.3.1.1	Reservoir Sampling . . . . . . . . . . . . . . . .	.... 410
		12.3.1.2	Sketches . . . . . . . . . . . . . . . . . . . . . .	.... 410
	12.3.2	Lossy Counting Algorithm . . . . . . . . . . . . . . . . . .		.... 410
12.4	Clustering Data Streams . . . . . . . . . . . . . . . . . . . . . . . .			.... 411
	12.4.1	STREAM Algorithm . . . . . . . . . . . . . . . . . . . . .		.... 411
	12.4.2	CluStream Algorithm . . . . . . . . . . . . . . . . . . . . .		.... 413
		12.4.2.1	Microcluster Definition . . . . . . . . . . . . . .	.... 413
		12.4.2.2	Microclustering Algorithm . . . . . . . . . . . .	.... 414
		12.4.2.3	Pyramidal Time Frame . . . . . . . . . . . . . .	.... 415
	12.4.3	Massive-Domain Stream Clustering . . . . . . . . . . . . .		.... 417
12.5	Streaming Outlier Detection . . . . . . . . . . . . . . . . . . . . . .			.... 417
	12.5.1	Individual Data Points as Outliers . . . . . . . . . . . . . .		.... 418
	12.5.2	Aggregate Change Points as Outliers . . . . . . . . . . . .		.... 419
12.6	Streaming Classification . . . . . . . . . . . . . . . . . . . . . . . .			.... 421
	12.6.1	VFDT Family . . . . . . . . . . . . . . . . . . . . . . . . .		.... 421
	12.6.2	Supervised Microcluster Approach . . . . . . . . . . . . . .		.... 424
	12.6.3	Ensemble Method . . . . . . . . . . . . . . . . . . . . . . .		.... 424
	12.6.4	Massive-Domain Streaming Classification . . . . . . . . . .		.... 425
12.7	Summary . . . .		............................	.... 425
12.8	Bibliographic Notes . . . . . . . . . . . . . . . . . . . . . . . . . . .			.... 425
12.9	Exercises . . . . .		............................	.... 426
13 Mining Text Data				429
13.1	Introduction . . .		............................	.... 429
13.2	Document Preparation and Similarity
	Computation . .		............................	.... 431
	13.2.1	Document Normalization and Similarity Computation . . .		.... 432
	13.2.2	Specialized Preprocessing for Web Documents . . . . . . .		.... 433
13.3	Specialized Clustering Methods for Text . . . . . . . . . . . . . . .			.... 434
	13.3.1	Representative-Based Algorithms . . . . . . . . . . . . . .		.... 434
		13.3.1.1	Scatter/Gather Approach . . . . . . . . . . . . .	.... 434
	13.3.2	Probabilistic Algorithms . . . . . . . . . . . . . . . . . . .		.... 436
	13.3.3	Simultaneous Document and Word Cluster Discovery . . .		.... 438
		13.3.3.1 Co-clustering . . . . . . . . . . . . . . . . . . . .		.... 438
13.4	Topic Modeling .		............................	.... 440

Yüklə 17,13 Mb.

Dostları ilə paylaş:

1 2 3 4 5 6 7 8 9 ... 423