CONTENTS
|
|
|
xxi
|
19 Social Network Analysis
|
619
|
19.1
|
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
|
..... 619
|
19.2
|
Social Networks: Preliminaries and Properties . . . . . . . . . . .
|
..... 620
|
|
19.2.1
|
Homophily . . . . . . . . . . . . . . . . . . . . . . . . . .
|
..... 621
|
|
19.2.2
|
Triadic Closure and Clustering Coefficient . . . . . . . .
|
..... 621
|
|
19.2.3
|
Dynamics of Network Formation . . . . . . . . . . . . . .
|
..... 622
|
|
19.2.4
|
Power-Law Degree Distributions . . . . . . . . . . . . . .
|
..... 623
|
|
19.2.5
|
Measures of Centrality and Prestige . . . . . . . . . . . .
|
..... 623
|
|
|
19.2.5.1 Degree Centrality and Prestige . . . . . . . . .
|
..... 624
|
|
|
19.2.5.2 Closeness Centrality and Proximity Prestige .
|
..... 624
|
|
|
19.2.5.3
|
Betweenness Centrality . . . . . . . . . . . . .
|
..... 626
|
|
|
19.2.5.4 Rank Centrality and Prestige . . . . . . . . .
|
..... 627
|
19.3
|
Community Detection . . . . . . . . . . . . . . . . . . . . . . . .
|
..... 627
|
|
19.3.1
|
Kernighan–Lin Algorithm . . . . . . . . . . . . . . . . .
|
..... 629
|
|
|
19.3.1.1
|
Speeding Up Kernighan–Lin . . . . . . . . . .
|
..... 630
|
|
19.3.2
|
Girvan–Newman Algorithm . . . . . . . . . . . . . . . .
|
..... 631
|
|
19.3.3
|
Multilevel Graph Partitioning: METIS . . . . . . . . . .
|
..... 634
|
|
19.3.4
|
Spectral Clustering . . . . . . . . . . . . . . . . . . . . .
|
..... 637
|
|
|
19.3.4.1 Important Observations and Intuitions . . . .
|
..... 640
|
19.4
|
Collective Classification . . . . . . . . . . . . . . . . . . . . . . .
|
..... 641
|
|
19.4.1
|
Iterative Classification Algorithm . . . . . . . . . . . . .
|
..... 641
|
|
19.4.2
|
Label Propagation with Random Walks . . . . . . . . . .
|
..... 643
|
|
|
19.4.2.1 Iterative Label Propagation: The Spectral
|
|
|
|
|
Interpretation . . . . . . . . . . . . . . . . . .
|
..... 646
|
|
19.4.3
|
Supervised Spectral Methods . . . . . . . . . . . . . . . .
|
..... 646
|
|
|
19.4.3.1 Supervised Feature Generation with Spectral
|
|
|
|
|
Embedding . . . . . . . . . . . . . . . . . . . .
|
..... 647
|
|
|
19.4.3.2
|
Graph Regularization Approach . . . . . . . .
|
..... 647
|
|
|
19.4.3.3 Connections with Random Walk Methods . .
|
..... 649
|
19.5
|
Link Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . .
|
..... 650
|
|
19.5.1
|
Neighborhood-Based Measures . . . . . . . . . . . . . . .
|
..... 650
|
|
19.5.2
|
Katz Measure . . . . . . . . . . . . . . . . . . . . . . . .
|
..... 652
|
|
19.5.3
|
Random Walk-Based Measures . . . . . . . . . . . . . . .
|
..... 653
|
|
19.5.4
|
Link Prediction as a Classification Problem . . . . . . .
|
..... 653
|
|
19.5.5
|
Link Prediction as a Missing-Value Estimation Problem
|
..... 654
|
|
19.5.6
|
Discussion
|
..........................
|
..... 654
|
19.6
|
Social Influence Analysis . . . . . . . . . . . . . . . . . . . . . . .
|
..... 655
|
|
19.6.1
|
Linear Threshold Model . . . . . . . . . . . . . . . . . .
|
..... 656
|
|
19.6.2
|
Independent Cascade Model . . . . . . . . . . . . . . . .
|
..... 657
|
|
19.6.3
|
Influence Function Evaluation . . . . . . . . . . . . . . .
|
..... 657
|
19.7
|
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
|
..... 658
|
19.8
|
Bibliographic Notes . . . . . . . . . . . . . . . . . . . . . . . . . .
|
..... 659
|
19.9
|
Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
|
..... 660
|
20 Privacy-Preserving Data Mining
|
663
|
20.1
|
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
|
..... 663
|
20.2
|
Privacy During Data Collection . . . . . . . . . . . . . . . . . . .
|
..... 664
|
|
20.2.1
|
Reconstructing Aggregate Distributions . . . . . . . . . .
|
..... 665
|
|
20.2.2
|
Leveraging Aggregate Distributions for Data Mining . .
|
..... 667
|
20.3
|
Privacy-Preserving Data Publishing . . . . . . . . . . . . . . . . .
|
..... 667
|
|
20.3.1
|
The k-Anonymity Model . . . . . . . . . . . . . . . . . .
|
..... 670
|
xxii
|
|
|
|
CONTENTS
|
|
|
20.3.1.1
|
Samarati’s Algorithm . . . . . . . . . . . . . .
|
..... 673
|
|
|
20.3.1.2
|
Incognito . . . . . . . . . . . . . . . . . . . . .
|
..... 675
|
|
|
20.3.1.3
|
Mondrian Multidimensional k-Anonymity . . .
|
..... 678
|
|
|
20.3.1.4 Synthetic Data Generation: Condensation-Based
|
|
|
|
Approach . . . . . . . . . . . . . . . . . . . . .
|
..... 680
|
|
20.3.2
|
The -Diversity Model . . . . . . . . . . . . . . . . . . .
|
..... 682
|
|
20.3.3
|
The t-closeness Model . . . . . . . . . . . . . . . . . . . .
|
..... 684
|
|
20.3.4 The Curse of Dimensionality . . . . . . . . . . . . . . . .
|
..... 687
|
20.4
|
Output Privacy .
|
...........................
|
..... 688
|
20.5
|
Distributed Privacy . . . . . . . . . . . . . . . . . . . . . . . . . .
|
..... 689
|
20.6
|
Summary . . . .
|
...........................
|
..... 690
|
20.7
|
Bibliographic Notes . . . . . . . . . . . . . . . . . . . . . . . . . .
|
..... 691
|
20.8
|
Exercises . . . . .
|
...........................
|
..... 692
|
Bibliography
|
|
|
695
|
Index
|
|
|
|
727
|
Preface
“Data is the new oil.”– Clive Humby
The field of data mining has seen rapid strides over the past two decades, especially from the perspective of the computer science community. While data analysis has been studied extensively in the conventional field of probability and statistics, data mining is a term coined by the computer science-oriented community. For computer scientists, issues such as scalability, usability, and computational implementation are extremely important.
The emergence of data science as a discipline requires the development of a book that goes beyond the traditional focus of books on only the fundamental data mining courses. Recent years have seen the emergence of the job description of “data scientists,” who try to glean knowledge from vast amounts of data. In typical applications, the data types are so heterogeneous and diverse that the fundamental methods discussed for a multidimensional data type may not be effective. Therefore, more emphasis needs to be placed on the different data types and the applications that arise in the context of these different data types. A comprehensive data mining book must explore the different aspects of data mining, starting from the fundamentals, and then explore the complex data types, and their relationships with the fundamental techniques. While fundamental techniques form an excellent basis for the further study of data mining, they do not provide a complete picture of the true complexity of data analysis. This book studies these advanced topics without compromis-ing the presentation of fundamental methods. Therefore, this book may be used for both introductory and advanced data mining courses. Until now, no single book has addressed all these topics in a comprehensive and integrated way.
The textbook assumes a basic knowledge of probability, statistics, and linear algebra, which is taught in most undergraduate curricula of science and engineering disciplines. Therefore, the book can also be used by industrial practitioners, who have a working knowl-edge of these basic skills. While stronger mathematical background is helpful for the more advanced chapters, it is not a prerequisite. Special chapters are also devoted to different aspects of data mining, such as text data, time- series data, discrete sequences, and graphs. This kind of specialized treatment is intended to capture the wide diversity of problem domains in which a data mining problem might arise.
The chapters of this book fall into one of three categories:
The fundamental chapters: Data mining has four main “super problems,” which correspond to clustering, classification, association pattern mining, and outlier anal-
xxiv PREFACE
ysis. These problems are so important because they are used repeatedly as building blocks in the context of a wide variety of data mining applications. As a result, a large amount of emphasis has been placed by data mining researchers and practitioners to design effective and efficient methods for these problems. These chapters comprehen-sively discuss the vast diversity of methods used by the data mining community in the context of these super problems.
Domain chapters: These chapters discuss the specific methods used for different domains of data such as text data, time-series data, sequence data, graph data, and spatial data. Many of these chapters can also be considered application chapters, because they explore the specific characteristics of the problem in a particular domain.
Application chapters: Advancements in hardware technology and software plat-forms have lead to a number of data-intensive applications such as streaming systems, Web mining, social networks, and privacy preservation. These topics are studied in detail in these chapters. The domain chapters are also focused on many different kinds of applications that arise in the context of those data types.
Suggestions for the Instructor
The book was specifically written to enable the teaching of both the basic data mining and advanced data mining courses from a single book. It can be used to offer various types of data mining courses with different emphases. Specifically, the courses that could be offered with various chapters are as follows:
Basic data mining course and fundamentals: The basic data mining course should focus on the fundamentals of data mining. Chapters 1, 2, 3, 4, 6, 8, and 10 can be covered. In fact, the material in these chapters is more than what is possible to teach in a single course. Therefore, instructors may need to select topics of their interest from these chapters. Some portions of Chaps. 5, 7, 9, and 11 can also be covered, although these chapters are really meant for an advanced course.
Advanced course (fundamentals): Such a course would cover advanced topics on the fundamentals of data mining and assume that the student is already familiar with Chaps. 1–3, and parts of Chaps. 4, 6, 8, and 10. The course can then focus on Chaps. 5, 7, 9, and 11. Topics such as ensemble analysis are useful for the advanced course. Furthermore, some topics from Chaps. 4, 6, 8, and 10, which were not covered in the basic course, can be used. In addition, Chap. 20 on privacy can be offered.
Advanced course (data types): Advanced topics such as text mining, time series, sequences, graphs, and spatial data may be covered. The material should focus on Chaps. 13, 14, 15, 16, and 17. Some parts of Chap. 19 (e.g., graph clustering) and Chap. 12 (data streaming) can also be used.
Advanced course (applications): An application course overlaps with a data type course but has a different focus. For example, the focus in an application-centered course would be more on the modeling aspect than the algorithmic aspect. Therefore, the same materials in Chaps. 13, 14, 15, 16, and 17 can be used while skipping specific details of algorithms. With less focus on specific algorithms, these chapters can be covered fairly quickly. The remaining time should be allocated to three very important chapters on data streams (Chap. 12), Web mining (Chap. 18), and social network analysis (Chap. 19).
PREFACE xxv
The book is written in a simple style to make it accessible to undergraduate students and industrial practitioners with a limited mathematical background. Thus, the book will serve both as an introductory text and as an advanced text for students, industrial practitioners, and researchers.
Throughout this book, a vector or a multidimensional data point (including categorical attributes), is annotated with a bar, such as X or y. A vector or multidimensional point may be denoted by either small letters or capital letters, as long as it has a bar. Vector dot products are denoted by centered dots, such as X · Y . A matrix is denoted in capital letters without a bar, such as R. Throughout the book, the n×d data matrix is denoted by D, with n points and d dimensions. The individual data points in D are therefore d-dimensional row vectors. On the other hand, vectors with one component for each data point are usually n-dimensional column vectors. An example is the n-dimensional column vector y of class variables of n data points.
Acknowledgments
I would like to thank my wife and daughter for their love and support during the writing of this book. The writing of a book requires significant time, which is taken away from family members. This book is the result of their patience with me during this time.
I would also like to thank my manager Nagui Halim for providing the tremendous support necessary for the writing of this book. His professional support has been instrumental for my many book efforts in the past and present.
During the writing of this book, I received feedback from many colleagues. In partic-ular, I received feedback from Kanishka Bhaduri, Alain Biem, Graham Cormode, Hongbo Deng, Amit Dhurandhar, Bart Goethals, Alexander Hinneburg, Ramakrishnan Kannan, George Karypis, Dominique LaSalle, Abdullah Mueen, Guojun Qi, Pierangela Samarati, Saket Sathe, Karthik Subbian, Jiliang Tang, Deepak Turaga, Jilles Vreeken, Jieping Ye, and Peixiang Zhao. I would like to thank them for their constructive feedback and sugges-tions. Over the years, I have benefited from the insights of numerous collaborators. These insights have influenced this book directly or indirectly. I would first like to thank my long-term collaborator Philip S. Yu for my years of collaboration with him. Other researchers with whom I have had significant collaborations include Tarek F. Abdelzaher, Jing Gao, Quanquan Gu, Manish Gupta, Jiawei Han, Alexander Hinneburg, Thomas Huang, Nan Li, Huan Liu, Ruoming Jin, Daniel Keim, Arijit Khan, Latifur Khan, Mohammad M. Masud, Jian Pei, Magda Procopiuc, Guojun Qi, Chandan Reddy, Jaideep Srivastava, Karthik Sub-bian, Yizhou Sun, Jiliang Tang, Min-Hsuan Tsai, Haixun Wang, Jianyong Wang, Min Wang, Joel Wolf, Xifeng Yan, Mohammed Zaki, ChengXiang Zhai, and Peixiang Zhao.
I would also like to thank my advisor James B. Orlin for his guidance during my early years as a researcher. While I no longer work in the same area, the legacy of what I learned from him is a crucial part of my approach to research. In particular, he taught me the importance of intuition and simplicity of thought in the research process. These are more important aspects of research than is generally recognized. This book is written in a simple and intuitive style, and is meant to improve accessibility of this area to both researchers and practitioners.
I would also like to thank Lata Aggarwal for helping me with some of the figures drawn using Microsoft Powerpoint.
xxvii
Author Biography
Charu C. Aggarwal is a Distinguished Research Staff Member (DRSM) at the IBM T. J. Watson Research Center in Yorktown Heights, New York. He completed his B.S. from IIT Kanpur in 1993 and his Ph.D. from the Massachusetts Institute of Technology in 1996.
He has worked extensively in the field of data mining. He has pub-lished more than 250 papers in refereed conferences and journals and authored over 80 patents. He is author or editor of 14 books, including the first comprehensive book on outlier analysis, which is written from a computer science point of view. Because of the commercial value of his patents, he has thrice been designated a Master Inventor at IBM. He is a recipient of an IBM Corporate Award (2003) for his work on bio-terrorist threat detection in data streams, a recipient of the IBM Outstanding Innovation Award (2008) for his scientific contributions to privacy technology, a recip-ient of the IBM Outstanding Technical Achievement Award (2009) for his work on data streams, and a recipient of an IBM Research
Division Award (2008) for his contributions to System S. He also received the EDBT 2014 Test of Time Award for his work on condensation-based privacy-preserving data mining.
He has served as the general co-chair of the IEEE Big Data Conference, 2014, and as an associate editor of the IEEE Transactions on Knowledge and Data Engineering from 2004 to 2008. He is an associate editor of the ACM Transactions on Knowledge Discovery from Data, an action editor of the Data Mining and Knowledge Discovery Journal, editor-in-chief of the ACM SIGKDD Explorations, and an associate editor of the Knowledge and Information Systems Journal. He serves on the advisory board of the Lecture Notes on Social Networks, a publication by Springer. He has served as the vice-president of the SIAM Activity Group on Data Mining. He is a fellow of the ACM and the IEEE, for “contributions to knowledge discovery and data mining algorithms.”
Chapter 1
An Introduction to Data Mining
“Education is not the piling on of learning, information, data, facts, skills,
Dostları ilə paylaş: |