some cases, an attribute set Xi may be associated with node i, or an attribute set Yij may be associated with edge (i, j).
The edge (i, j) may be directed or undirected, depending on the application at hand. For example, the Web graph may contain directed edges corresponding to directions of hyper-links between pages, whereas friendships in the Facebook social network are undirected.
A second class of graph mining problems is that of a database containing many small graphs such as chemical compounds. The challenges in these two classes of problems are very different. Some examples of data that are represented as graphs are as follows:
Web graph: The nodes correspond to the Web pages, and the edges correspond to hyperlinks. The nodes have text attributes corresponding to the content in the page.
Social networks: In this case, the nodes correspond to social network actors, whereas the edges correspond to friendship links. The nodes may have attributes corresponding to social page content. In some specialized forms of social networks, such as email or
14 CHAPTER 1. AN INTRODUCTION TO DATA MINING
chat-messenger networks, the edges may have content associated with them. This content corresponds to the communication between the different nodes.
Chemical compound databases: In this case, the nodes correspond to the elements and the edges correspond to the chemical bonds between the elements. The structures in these chemical compounds are very useful for identifying important reactive and pharmacological properties of these compounds.
Network data are a very general representation and can be used for solving many similarity-based applications on other data types. For example, multidimensional data may be con-verted to network data by creating a node for each record in the database, and representing similarities between nodes by edges. Such a representation is used quite often for many similarity-based data mining applications, such as clustering. It is possible to use commu-nity detection algorithms to determine clusters in the network data and then map them back to multidimensional data. Some spectral clustering methods, discussed in Chap. 19, are based on this principle. This generality of network data comes at a price. The develop-ment of mining algorithms for network data is generally more difficult. Methods for mining network data are discussed in Chaps. 17, 18, and 19.
1.4 The Major Building Blocks: A Bird’s Eye View
As discussed in the introduction Sect. 1.1, four problems in data mining are considered fundamental to the mining process. These problems correspond to clustering, classification, association pattern mining, and outlier detection, and they are encountered repeatedly in the context of many data mining applications. What makes these problems so special? Why are they encountered repeatedly? To answer these questions, one must understand the nature of the typical relationships that data scientists often try to extract from the data.
Consider a multidimensional database D with n records, and d attributes. Such a database D may be represented as an n × d matrix D, in which each row corresponds to one record and each column corresponds to a dimension. We generally refer to this matrix as the data matrix . This book will use the notation of a data matrix D, and a database D interchangeably. Broadly speaking, data mining is all about finding summary relation-ships between the entries in the data matrix that are either unusually frequent or unusually infrequent. Relationships between data items are one of two kinds:
|