Data Mining: The Textbook



Yüklə 17,13 Mb.
səhifə8/423
tarix07.01.2024
ölçüsü17,13 Mb.
#211690
1   ...   4   5   6   7   8   9   10   11   ...   423
1-Data Mining tarjima

CONTENTS







xxi

19 Social Network Analysis

619

19.1

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

..... 619

19.2

Social Networks: Preliminaries and Properties . . . . . . . . . . .

..... 620




19.2.1

Homophily . . . . . . . . . . . . . . . . . . . . . . . . . .

..... 621




19.2.2

Triadic Closure and Clustering Coefficient . . . . . . . .

..... 621




19.2.3

Dynamics of Network Formation . . . . . . . . . . . . . .

..... 622




19.2.4

Power-Law Degree Distributions . . . . . . . . . . . . . .

..... 623




19.2.5

Measures of Centrality and Prestige . . . . . . . . . . . .

..... 623







19.2.5.1 Degree Centrality and Prestige . . . . . . . . .

..... 624







19.2.5.2 Closeness Centrality and Proximity Prestige .

..... 624







19.2.5.3

Betweenness Centrality . . . . . . . . . . . . .

..... 626







19.2.5.4 Rank Centrality and Prestige . . . . . . . . .

..... 627

19.3

Community Detection . . . . . . . . . . . . . . . . . . . . . . . .

..... 627




19.3.1

Kernighan–Lin Algorithm . . . . . . . . . . . . . . . . .

..... 629







19.3.1.1

Speeding Up Kernighan–Lin . . . . . . . . . .

..... 630




19.3.2

Girvan–Newman Algorithm . . . . . . . . . . . . . . . .

..... 631




19.3.3

Multilevel Graph Partitioning: METIS . . . . . . . . . .

..... 634




19.3.4

Spectral Clustering . . . . . . . . . . . . . . . . . . . . .

..... 637







19.3.4.1 Important Observations and Intuitions . . . .

..... 640

19.4

Collective Classification . . . . . . . . . . . . . . . . . . . . . . .

..... 641




19.4.1

Iterative Classification Algorithm . . . . . . . . . . . . .

..... 641




19.4.2

Label Propagation with Random Walks . . . . . . . . . .

..... 643







19.4.2.1 Iterative Label Propagation: The Spectral













Interpretation . . . . . . . . . . . . . . . . . .

..... 646




19.4.3

Supervised Spectral Methods . . . . . . . . . . . . . . . .

..... 646







19.4.3.1 Supervised Feature Generation with Spectral













Embedding . . . . . . . . . . . . . . . . . . . .

..... 647







19.4.3.2

Graph Regularization Approach . . . . . . . .

..... 647







19.4.3.3 Connections with Random Walk Methods . .

..... 649

19.5

Link Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . .

..... 650




19.5.1

Neighborhood-Based Measures . . . . . . . . . . . . . . .

..... 650




19.5.2

Katz Measure . . . . . . . . . . . . . . . . . . . . . . . .

..... 652




19.5.3

Random Walk-Based Measures . . . . . . . . . . . . . . .

..... 653




19.5.4

Link Prediction as a Classification Problem . . . . . . .

..... 653




19.5.5

Link Prediction as a Missing-Value Estimation Problem

..... 654




19.5.6

Discussion

..........................

..... 654

19.6

Social Influence Analysis . . . . . . . . . . . . . . . . . . . . . . .

..... 655




19.6.1

Linear Threshold Model . . . . . . . . . . . . . . . . . .

..... 656




19.6.2

Independent Cascade Model . . . . . . . . . . . . . . . .

..... 657




19.6.3

Influence Function Evaluation . . . . . . . . . . . . . . .

..... 657

19.7

Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

..... 658

19.8

Bibliographic Notes . . . . . . . . . . . . . . . . . . . . . . . . . .

..... 659

19.9

Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

..... 660

20 Privacy-Preserving Data Mining

663

20.1

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

..... 663

20.2

Privacy During Data Collection . . . . . . . . . . . . . . . . . . .

..... 664




20.2.1

Reconstructing Aggregate Distributions . . . . . . . . . .

..... 665




20.2.2

Leveraging Aggregate Distributions for Data Mining . .

..... 667

20.3

Privacy-Preserving Data Publishing . . . . . . . . . . . . . . . . .

..... 667




20.3.1

The k-Anonymity Model . . . . . . . . . . . . . . . . . .

..... 670




xxii










CONTENTS







20.3.1.1

Samarati’s Algorithm . . . . . . . . . . . . . .

..... 673







20.3.1.2

Incognito . . . . . . . . . . . . . . . . . . . . .

..... 675







20.3.1.3

Mondrian Multidimensional k-Anonymity . . .

..... 678







20.3.1.4 Synthetic Data Generation: Condensation-Based










Approach . . . . . . . . . . . . . . . . . . . . .

..... 680




20.3.2

The -Diversity Model . . . . . . . . . . . . . . . . . . .

..... 682




20.3.3

The t-closeness Model . . . . . . . . . . . . . . . . . . . .

..... 684




20.3.4 The Curse of Dimensionality . . . . . . . . . . . . . . . .

..... 687

20.4

Output Privacy .

...........................

..... 688

20.5

Distributed Privacy . . . . . . . . . . . . . . . . . . . . . . . . . .

..... 689

20.6

Summary . . . .

...........................

..... 690

20.7

Bibliographic Notes . . . . . . . . . . . . . . . . . . . . . . . . . .

..... 691

20.8

Exercises . . . . .

...........................

..... 692

Bibliography







695

Index










727

Preface

Data is the new oil.”– Clive Humby
The field of data mining has seen rapid strides over the past two decades, especially from the perspective of the computer science community. While data analysis has been studied extensively in the conventional field of probability and statistics, data mining is a term coined by the computer science-oriented community. For computer scientists, issues such as scalability, usability, and computational implementation are extremely important.

The emergence of data science as a discipline requires the development of a book that goes beyond the traditional focus of books on only the fundamental data mining courses. Recent years have seen the emergence of the job description of “data scientists,” who try to glean knowledge from vast amounts of data. In typical applications, the data types are so heterogeneous and diverse that the fundamental methods discussed for a multidimensional data type may not be effective. Therefore, more emphasis needs to be placed on the different data types and the applications that arise in the context of these different data types. A comprehensive data mining book must explore the different aspects of data mining, starting from the fundamentals, and then explore the complex data types, and their relationships with the fundamental techniques. While fundamental techniques form an excellent basis for the further study of data mining, they do not provide a complete picture of the true complexity of data analysis. This book studies these advanced topics without compromis-ing the presentation of fundamental methods. Therefore, this book may be used for both introductory and advanced data mining courses. Until now, no single book has addressed all these topics in a comprehensive and integrated way.


The textbook assumes a basic knowledge of probability, statistics, and linear algebra, which is taught in most undergraduate curricula of science and engineering disciplines. Therefore, the book can also be used by industrial practitioners, who have a working knowl-edge of these basic skills. While stronger mathematical background is helpful for the more advanced chapters, it is not a prerequisite. Special chapters are also devoted to different aspects of data mining, such as text data, time- series data, discrete sequences, and graphs. This kind of specialized treatment is intended to capture the wide diversity of problem domains in which a data mining problem might arise.


The chapters of this book fall into one of three categories:





  • The fundamental chapters: Data mining has four main “super problems,” which correspond to clustering, classification, association pattern mining, and outlier anal-

xxiii


xxiv PREFACE

ysis. These problems are so important because they are used repeatedly as building blocks in the context of a wide variety of data mining applications. As a result, a large amount of emphasis has been placed by data mining researchers and practitioners to design effective and efficient methods for these problems. These chapters comprehen-sively discuss the vast diversity of methods used by the data mining community in the context of these super problems.





  • Domain chapters: These chapters discuss the specific methods used for different domains of data such as text data, time-series data, sequence data, graph data, and spatial data. Many of these chapters can also be considered application chapters, because they explore the specific characteristics of the problem in a particular domain.




  • Application chapters: Advancements in hardware technology and software plat-forms have lead to a number of data-intensive applications such as streaming systems, Web mining, social networks, and privacy preservation. These topics are studied in detail in these chapters. The domain chapters are also focused on many different kinds of applications that arise in the context of those data types.

Suggestions for the Instructor


The book was specifically written to enable the teaching of both the basic data mining and advanced data mining courses from a single book. It can be used to offer various types of data mining courses with different emphases. Specifically, the courses that could be offered with various chapters are as follows:





  • Basic data mining course and fundamentals: The basic data mining course should focus on the fundamentals of data mining. Chapters 1, 2, 3, 4, 6, 8, and 10 can be covered. In fact, the material in these chapters is more than what is possible to teach in a single course. Therefore, instructors may need to select topics of their interest from these chapters. Some portions of Chaps. 5, 7, 9, and 11 can also be covered, although these chapters are really meant for an advanced course.




  • Advanced course (fundamentals): Such a course would cover advanced topics on the fundamentals of data mining and assume that the student is already familiar with Chaps. 1–3, and parts of Chaps. 4, 6, 8, and 10. The course can then focus on Chaps. 5, 7, 9, and 11. Topics such as ensemble analysis are useful for the advanced course. Furthermore, some topics from Chaps. 4, 6, 8, and 10, which were not covered in the basic course, can be used. In addition, Chap. 20 on privacy can be offered.




  • Advanced course (data types): Advanced topics such as text mining, time series, sequences, graphs, and spatial data may be covered. The material should focus on Chaps. 13, 14, 15, 16, and 17. Some parts of Chap. 19 (e.g., graph clustering) and Chap. 12 (data streaming) can also be used.




  • Advanced course (applications): An application course overlaps with a data type course but has a different focus. For example, the focus in an application-centered course would be more on the modeling aspect than the algorithmic aspect. Therefore, the same materials in Chaps. 13, 14, 15, 16, and 17 can be used while skipping specific details of algorithms. With less focus on specific algorithms, these chapters can be covered fairly quickly. The remaining time should be allocated to three very important chapters on data streams (Chap. 12), Web mining (Chap. 18), and social network analysis (Chap. 19).

PREFACE xxv

The book is written in a simple style to make it accessible to undergraduate students and industrial practitioners with a limited mathematical background. Thus, the book will serve both as an introductory text and as an advanced text for students, industrial practitioners, and researchers.


Throughout this book, a vector or a multidimensional data point (including categorical attributes), is annotated with a bar, such as X or y. A vector or multidimensional point may be denoted by either small letters or capital letters, as long as it has a bar. Vector dot products are denoted by centered dots, such as X · Y . A matrix is denoted in capital letters without a bar, such as R. Throughout the book, the n×d data matrix is denoted by D, with n points and d dimensions. The individual data points in D are therefore d-dimensional row vectors. On the other hand, vectors with one component for each data point are usually n-dimensional column vectors. An example is the n-dimensional column vector y of class variables of n data points.




Acknowledgments


I would like to thank my wife and daughter for their love and support during the writing of this book. The writing of a book requires significant time, which is taken away from family members. This book is the result of their patience with me during this time.


I would also like to thank my manager Nagui Halim for providing the tremendous support necessary for the writing of this book. His professional support has been instrumental for my many book efforts in the past and present.


During the writing of this book, I received feedback from many colleagues. In partic-ular, I received feedback from Kanishka Bhaduri, Alain Biem, Graham Cormode, Hongbo Deng, Amit Dhurandhar, Bart Goethals, Alexander Hinneburg, Ramakrishnan Kannan, George Karypis, Dominique LaSalle, Abdullah Mueen, Guojun Qi, Pierangela Samarati, Saket Sathe, Karthik Subbian, Jiliang Tang, Deepak Turaga, Jilles Vreeken, Jieping Ye, and Peixiang Zhao. I would like to thank them for their constructive feedback and sugges-tions. Over the years, I have benefited from the insights of numerous collaborators. These insights have influenced this book directly or indirectly. I would first like to thank my long-term collaborator Philip S. Yu for my years of collaboration with him. Other researchers with whom I have had significant collaborations include Tarek F. Abdelzaher, Jing Gao, Quanquan Gu, Manish Gupta, Jiawei Han, Alexander Hinneburg, Thomas Huang, Nan Li, Huan Liu, Ruoming Jin, Daniel Keim, Arijit Khan, Latifur Khan, Mohammad M. Masud, Jian Pei, Magda Procopiuc, Guojun Qi, Chandan Reddy, Jaideep Srivastava, Karthik Sub-bian, Yizhou Sun, Jiliang Tang, Min-Hsuan Tsai, Haixun Wang, Jianyong Wang, Min Wang, Joel Wolf, Xifeng Yan, Mohammed Zaki, ChengXiang Zhai, and Peixiang Zhao.


I would also like to thank my advisor James B. Orlin for his guidance during my early years as a researcher. While I no longer work in the same area, the legacy of what I learned from him is a crucial part of my approach to research. In particular, he taught me the importance of intuition and simplicity of thought in the research process. These are more important aspects of research than is generally recognized. This book is written in a simple and intuitive style, and is meant to improve accessibility of this area to both researchers and practitioners.


I would also like to thank Lata Aggarwal for helping me with some of the figures drawn using Microsoft Powerpoint.



xxvii

Author Biography


Charu C. Aggarwal is a Distinguished Research Staff Member (DRSM) at the IBM T. J. Watson Research Center in Yorktown Heights, New York. He completed his B.S. from IIT Kanpur in 1993 and his Ph.D. from the Massachusetts Institute of Technology in 1996.


He has worked extensively in the field of data mining. He has pub-lished more than 250 papers in refereed conferences and journals and authored over 80 patents. He is author or editor of 14 books, including the first comprehensive book on outlier analysis, which is written from a computer science point of view. Because of the commercial value of his patents, he has thrice been designated a Master Inventor at IBM. He is a recipient of an IBM Corporate Award (2003) for his work on bio-terrorist threat detection in data streams, a recipient of the IBM Outstanding Innovation Award (2008) for his scientific contributions to privacy technology, a recip-ient of the IBM Outstanding Technical Achievement Award (2009) for his work on data streams, and a recipient of an IBM Research


Division Award (2008) for his contributions to System S. He also received the EDBT 2014 Test of Time Award for his work on condensation-based privacy-preserving data mining.


He has served as the general co-chair of the IEEE Big Data Conference, 2014, and as an associate editor of the IEEE Transactions on Knowledge and Data Engineering from 2004 to 2008. He is an associate editor of the ACM Transactions on Knowledge Discovery from Data, an action editor of the Data Mining and Knowledge Discovery Journal, editor-in-chief of the ACM SIGKDD Explorations, and an associate editor of the Knowledge and Information Systems Journal. He serves on the advisory board of the Lecture Notes on Social Networks, a publication by Springer. He has served as the vice-president of the SIAM Activity Group on Data Mining. He is a fellow of the ACM and the IEEE, for “contributions to knowledge discovery and data mining algorithms.”



xxix


Chapter 1


An Introduction to Data Mining

Education is not the piling on of learning, information, data, facts, skills,





Yüklə 17,13 Mb.

Dostları ilə paylaş:
1   ...   4   5   6   7   8   9   10   11   ...   423




Verilənlər bazası müəlliflik hüququ ilə müdafiə olunur ©azkurs.org 2024
rəhbərliyinə müraciət

gir | qeydiyyatdan keç
    Ana səhifə


yükləyin