Data Mining: The Textbook

Yüklə 17,13 Mb.

səhifə	8/423
tarix	07.01.2024
ölçüsü	17,13 Mb.
	#211690

1 ... 4 5 6 7 8 9 10 11 ... 423

1-Data Mining tarjima

CONTENTS
xxi
19 Social Network Analysis				619
19.1	Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .			..... 619
19.2	Social Networks: Preliminaries and Properties . . . . . . . . . . .			..... 620
	19.2.1	Homophily . . . . . . . . . . . . . . . . . . . . . . . . . .		..... 621
	19.2.2	Triadic Closure and Clustering Coeﬃcient . . . . . . . .		..... 621
	19.2.3	Dynamics of Network Formation . . . . . . . . . . . . . .		..... 622
	19.2.4	Power-Law Degree Distributions . . . . . . . . . . . . . .		..... 623
	19.2.5	Measures of Centrality and Prestige . . . . . . . . . . . .		..... 623
		19.2.5.1 Degree Centrality and Prestige . . . . . . . . .		..... 624
		19.2.5.2 Closeness Centrality and Proximity Prestige .		..... 624
		19.2.5.3	Betweenness Centrality . . . . . . . . . . . . .	..... 626
		19.2.5.4 Rank Centrality and Prestige . . . . . . . . .		..... 627
19.3	Community Detection . . . . . . . . . . . . . . . . . . . . . . . .			..... 627
	19.3.1	Kernighan–Lin Algorithm . . . . . . . . . . . . . . . . .		..... 629
		19.3.1.1	Speeding Up Kernighan–Lin . . . . . . . . . .	..... 630
	19.3.2	Girvan–Newman Algorithm . . . . . . . . . . . . . . . .		..... 631
	19.3.3	Multilevel Graph Partitioning: METIS . . . . . . . . . .		..... 634
	19.3.4	Spectral Clustering . . . . . . . . . . . . . . . . . . . . .		..... 637
		19.3.4.1 Important Observations and Intuitions . . . .		..... 640
19.4	Collective Classification . . . . . . . . . . . . . . . . . . . . . . .			..... 641
	19.4.1	Iterative Classification Algorithm . . . . . . . . . . . . .		..... 641
	19.4.2	Label Propagation with Random Walks . . . . . . . . . .		..... 643
		19.4.2.1 Iterative Label Propagation: The Spectral
			Interpretation . . . . . . . . . . . . . . . . . .	..... 646
	19.4.3	Supervised Spectral Methods . . . . . . . . . . . . . . . .		..... 646
		19.4.3.1 Supervised Feature Generation with Spectral
			Embedding . . . . . . . . . . . . . . . . . . . .	..... 647
		19.4.3.2	Graph Regularization Approach . . . . . . . .	..... 647
		19.4.3.3 Connections with Random Walk Methods . .		..... 649
19.5	Link Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . .			..... 650
	19.5.1	Neighborhood-Based Measures . . . . . . . . . . . . . . .		..... 650
	19.5.2	Katz Measure . . . . . . . . . . . . . . . . . . . . . . . .		..... 652
	19.5.3	Random Walk-Based Measures . . . . . . . . . . . . . . .		..... 653
	19.5.4	Link Prediction as a Classification Problem . . . . . . .		..... 653
	19.5.5	Link Prediction as a Missing-Value Estimation Problem		..... 654
	19.5.6	Discussion	..........................	..... 654
19.6	Social Influence Analysis . . . . . . . . . . . . . . . . . . . . . . .			..... 655
	19.6.1	Linear Threshold Model . . . . . . . . . . . . . . . . . .		..... 656
	19.6.2	Independent Cascade Model . . . . . . . . . . . . . . . .		..... 657
	19.6.3	Influence Function Evaluation . . . . . . . . . . . . . . .		..... 657
19.7	Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .			..... 658
19.8	Bibliographic Notes . . . . . . . . . . . . . . . . . . . . . . . . . .			..... 659
19.9	Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .			..... 660
20 Privacy-Preserving Data Mining				663
20.1	Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .			..... 663
20.2	Privacy During Data Collection . . . . . . . . . . . . . . . . . . .			..... 664
	20.2.1	Reconstructing Aggregate Distributions . . . . . . . . . .		..... 665
	20.2.2	Leveraging Aggregate Distributions for Data Mining . .		..... 667
20.3	Privacy-Preserving Data Publishing . . . . . . . . . . . . . . . . .			..... 667
	20.3.1	The k-Anonymity Model . . . . . . . . . . . . . . . . . .		..... 670

xxii					CONTENTS
		20.3.1.1		Samarati’s Algorithm . . . . . . . . . . . . . .	..... 673
		20.3.1.2		Incognito . . . . . . . . . . . . . . . . . . . . .	..... 675
		20.3.1.3		Mondrian Multidimensional k-Anonymity . . .	..... 678
			20.3.1.4 Synthetic Data Generation: Condensation-Based
				Approach . . . . . . . . . . . . . . . . . . . . .	..... 680
	20.3.2	The -Diversity Model . . . . . . . . . . . . . . . . . . .			..... 682
	20.3.3	The t-closeness Model . . . . . . . . . . . . . . . . . . . .			..... 684
	20.3.4 The Curse of Dimensionality . . . . . . . . . . . . . . . .				..... 687
20.4	Output Privacy .			...........................	..... 688
20.5	Distributed Privacy . . . . . . . . . . . . . . . . . . . . . . . . . .				..... 689
20.6	Summary . . . .			...........................	..... 690
20.7	Bibliographic Notes . . . . . . . . . . . . . . . . . . . . . . . . . .				..... 691
20.8	Exercises . . . . .			...........................	..... 692
Bibliography					695
Index					727

Preface

“Data is the new oil.”– Clive Humby
The field of data mining has seen rapid strides over the past two decades, especially from the perspective of the computer science community. While data analysis has been studied extensively in the conventional field of probability and statistics, data mining is a term coined by the computer science-oriented community. For computer scientists, issues such as scalability, usability, and computational implementation are extremely important.

The emergence of data science as a discipline requires the development of a book that goes beyond the traditional focus of books on only the fundamental data mining courses. Recent years have seen the emergence of the job description of “data scientists,” who try to glean knowledge from vast amounts of data. In typical applications, the data types are so heterogeneous and diverse that the fundamental methods discussed for a multidimensional data type may not be eﬀective. Therefore, more emphasis needs to be placed on the diﬀerent data types and the applications that arise in the context of these diﬀerent data types. A comprehensive data mining book must explore the diﬀerent aspects of data mining, starting from the fundamentals, and then explore the complex data types, and their relationships with the fundamental techniques. While fundamental techniques form an excellent basis for the further study of data mining, they do not provide a complete picture of the true complexity of data analysis. This book studies these advanced topics without compromis-ing the presentation of fundamental methods. Therefore, this book may be used for both introductory and advanced data mining courses. Until now, no single book has addressed all these topics in a comprehensive and integrated way.

The textbook assumes a basic knowledge of probability, statistics, and linear algebra, which is taught in most undergraduate curricula of science and engineering disciplines. Therefore, the book can also be used by industrial practitioners, who have a working knowl-edge of these basic skills. While stronger mathematical background is helpful for the more advanced chapters, it is not a prerequisite. Special chapters are also devoted to diﬀerent aspects of data mining, such as text data, time- series data, discrete sequences, and graphs. This kind of specialized treatment is intended to capture the wide diversity of problem domains in which a data mining problem might arise.

The chapters of this book fall into one of three categories:

The fundamental chapters: Data mining has four main “super problems,” which correspond to clustering, classification, association pattern mining, and outlier anal-

xxiii

xxiv PREFACE

ysis. These problems are so important because they are used repeatedly as building blocks in the context of a wide variety of data mining applications. As a result, a large amount of emphasis has been placed by data mining researchers and practitioners to design eﬀective and eﬃcient methods for these problems. These chapters comprehen-sively discuss the vast diversity of methods used by the data mining community in the context of these super problems.

Domain chapters: These chapters discuss the specific methods used for diﬀerent domains of data such as text data, time-series data, sequence data, graph data, and spatial data. Many of these chapters can also be considered application chapters, because they explore the specific characteristics of the problem in a particular domain.

Application chapters: Advancements in hardware technology and software plat-forms have lead to a number of data-intensive applications such as streaming systems, Web mining, social networks, and privacy preservation. These topics are studied in detail in these chapters. The domain chapters are also focused on many diﬀerent kinds of applications that arise in the context of those data types.

Suggestions for the Instructor

The book was specifically written to enable the teaching of both the basic data mining and advanced data mining courses from a single book. It can be used to oﬀer various types of data mining courses with diﬀerent emphases. Specifically, the courses that could be oﬀered with various chapters are as follows:

Basic data mining course and fundamentals: The basic data mining course should focus on the fundamentals of data mining. Chapters 1, 2, 3, 4, 6, 8, and 10 can be covered. In fact, the material in these chapters is more than what is possible to teach in a single course. Therefore, instructors may need to select topics of their interest from these chapters. Some portions of Chaps. 5, 7, 9, and 11 can also be covered, although these chapters are really meant for an advanced course.

Advanced course (fundamentals): Such a course would cover advanced topics on the fundamentals of data mining and assume that the student is already familiar with Chaps. 1–3, and parts of Chaps. 4, 6, 8, and 10. The course can then focus on Chaps. 5, 7, 9, and 11. Topics such as ensemble analysis are useful for the advanced course. Furthermore, some topics from Chaps. 4, 6, 8, and 10, which were not covered in the basic course, can be used. In addition, Chap. 20 on privacy can be oﬀered.

Advanced course (data types): Advanced topics such as text mining, time series, sequences, graphs, and spatial data may be covered. The material should focus on Chaps. 13, 14, 15, 16, and 17. Some parts of Chap. 19 (e.g., graph clustering) and Chap. 12 (data streaming) can also be used.

Advanced course (applications): An application course overlaps with a data type course but has a diﬀerent focus. For example, the focus in an application-centered course would be more on the modeling aspect than the algorithmic aspect. Therefore, the same materials in Chaps. 13, 14, 15, 16, and 17 can be used while skipping specific details of algorithms. With less focus on specific algorithms, these chapters can be covered fairly quickly. The remaining time should be allocated to three very important chapters on data streams (Chap. 12), Web mining (Chap. 18), and social network analysis (Chap. 19).

PREFACE xxv

The book is written in a simple style to make it accessible to undergraduate students and industrial practitioners with a limited mathematical background. Thus, the book will serve both as an introductory text and as an advanced text for students, industrial practitioners, and researchers.

Throughout this book, a vector or a multidimensional data point (including categorical attributes), is annotated with a bar, such as X or y. A vector or multidimensional point may be denoted by either small letters or capital letters, as long as it has a bar. Vector dot products are denoted by centered dots, such as X · Y . A matrix is denoted in capital letters without a bar, such as R. Throughout the book, the n×d data matrix is denoted by D, with n points and d dimensions. The individual data points in D are therefore d-dimensional row vectors. On the other hand, vectors with one component for each data point are usually n-dimensional column vectors. An example is the n-dimensional column vector y of class variables of n data points.

Acknowledgments

I would like to thank my wife and daughter for their love and support during the writing of this book. The writing of a book requires significant time, which is taken away from family members. This book is the result of their patience with me during this time.

I would also like to thank my manager Nagui Halim for providing the tremendous support necessary for the writing of this book. His professional support has been instrumental for my many book eﬀorts in the past and present.

During the writing of this book, I received feedback from many colleagues. In partic-ular, I received feedback from Kanishka Bhaduri, Alain Biem, Graham Cormode, Hongbo Deng, Amit Dhurandhar, Bart Goethals, Alexander Hinneburg, Ramakrishnan Kannan, George Karypis, Dominique LaSalle, Abdullah Mueen, Guojun Qi, Pierangela Samarati, Saket Sathe, Karthik Subbian, Jiliang Tang, Deepak Turaga, Jilles Vreeken, Jieping Ye, and Peixiang Zhao. I would like to thank them for their constructive feedback and sugges-tions. Over the years, I have benefited from the insights of numerous collaborators. These insights have influenced this book directly or indirectly. I would first like to thank my long-term collaborator Philip S. Yu for my years of collaboration with him. Other researchers with whom I have had significant collaborations include Tarek F. Abdelzaher, Jing Gao, Quanquan Gu, Manish Gupta, Jiawei Han, Alexander Hinneburg, Thomas Huang, Nan Li, Huan Liu, Ruoming Jin, Daniel Keim, Arijit Khan, Latifur Khan, Mohammad M. Masud, Jian Pei, Magda Procopiuc, Guojun Qi, Chandan Reddy, Jaideep Srivastava, Karthik Sub-bian, Yizhou Sun, Jiliang Tang, Min-Hsuan Tsai, Haixun Wang, Jianyong Wang, Min Wang, Joel Wolf, Xifeng Yan, Mohammed Zaki, ChengXiang Zhai, and Peixiang Zhao.

I would also like to thank my advisor James B. Orlin for his guidance during my early years as a researcher. While I no longer work in the same area, the legacy of what I learned from him is a crucial part of my approach to research. In particular, he taught me the importance of intuition and simplicity of thought in the research process. These are more important aspects of research than is generally recognized. This book is written in a simple and intuitive style, and is meant to improve accessibility of this area to both researchers and practitioners.

I would also like to thank Lata Aggarwal for helping me with some of the figures drawn using Microsoft Powerpoint.

xxvii

Author Biography

Charu C. Aggarwal is a Distinguished Research Staﬀ Member (DRSM) at the IBM T. J. Watson Research Center in Yorktown Heights, New York. He completed his B.S. from IIT Kanpur in 1993 and his Ph.D. from the Massachusetts Institute of Technology in 1996.

He has worked extensively in the field of data mining. He has pub-lished more than 250 papers in refereed conferences and journals and authored over 80 patents. He is author or editor of 14 books, including the first comprehensive book on outlier analysis, which is written from a computer science point of view. Because of the commercial value of his patents, he has thrice been designated a Master Inventor at IBM. He is a recipient of an IBM Corporate Award (2003) for his work on bio-terrorist threat detection in data streams, a recipient of the IBM Outstanding Innovation Award (2008) for his scientific contributions to privacy technology, a recip-ient of the IBM Outstanding Technical Achievement Award (2009) for his work on data streams, and a recipient of an IBM Research

Division Award (2008) for his contributions to System S. He also received the EDBT 2014 Test of Time Award for his work on condensation-based privacy-preserving data mining.

He has served as the general co-chair of the IEEE Big Data Conference, 2014, and as an associate editor of the IEEE Transactions on Knowledge and Data Engineering from 2004 to 2008. He is an associate editor of the ACM Transactions on Knowledge Discovery from Data, an action editor of the Data Mining and Knowledge Discovery Journal, editor-in-chief of the ACM SIGKDD Explorations, and an associate editor of the Knowledge and Information Systems Journal. He serves on the advisory board of the Lecture Notes on Social Networks, a publication by Springer. He has served as the vice-president of the SIAM Activity Group on Data Mining. He is a fellow of the ACM and the IEEE, for “contributions to knowledge discovery and data mining algorithms.”

xxix

Chapter 1

An Introduction to Data Mining

“Education is not the piling on of learning, information, data, facts, skills,

Yüklə 17,13 Mb.

Dostları ilə paylaş:

1 ... 4 5 6 7 8 9 10 11 ... 423