The support of the item set A is at least s.
The confidence of A ⇒ B is at least c.
By incorporating supervision in association rule mining algorithms, it is possible to provide solutions for the classification problem. Many variations of association pattern mining are also related to clustering and outlier analysis. This is a natural consequence of the fact that horizontal and vertical analysis of the data matrix are often related to one another. In fact, many variations of the association pattern mining problem are used as a subroutine to solve the clustering, outlier analysis, and classification problems. These issues will be discussed in Chaps. 4 and 5.
1.4.2 Data Clustering
A rather broad and informal definition of the clustering problem is as follows:
Definition 1.4.3 (Data Clustering) Given a data matrix D (database D), partition its rows (records) into sets C1 . . . Ck, such that the rows (records) in each cluster are “similar” to one another.
We have intentionally provided an informal definition here because clustering allows a wide variety of definitions of similarity, some of which are not cleanly defined in closed form by a similarity function. A clustering problem can often be defined as an optimization problem, in which the variables of the optimization problem represent cluster memberships of data points, and the objective function maximizes a concrete mathematical quantification of intragroup similarity in terms of these variables.
1.4. THE MAJOR BUILDING BLOCKS: A BIRD’S EYE VIEW
|
17
|
An important part of the clustering process is the design of an appropriate similarity function for the computation process. Clearly, the computation of similarity depends heavily on the underlying data type. The issue of similarity computation will be discussed in detail in Chap. 3. Some examples of relevant applications are as follows:
Customer segmentation: In many applications, it is desirable to determine customers that are similar to one another in the context of a variety of product promotion tasks. The segmentation phase plays an important role in this process.
Data summarization: Because clusters can be considered similar groups of records, these similar groups can be used to create a summary of the data.
Application to other data mining problems: Because clustering is considered an unsu-pervised version of classification, it is often used as a building block to solve the latter. Furthermore, this problem is also used in the context of the outlier analysis problem, as discussed below.
The data clustering problem is discussed in detail in Chaps. 6 and 7.
1.4.3 Outlier Detection
An outlier is a data point that is significantly different from the remaining data. Hawkins formally defined [259] the concept of an outlier as follows:
“An outlier is an observation that deviates so much from the other observations as to arouse suspicions that it was generated by a different mechanism.”
Outliers are also referred to as abnormalities, discordants, deviants, or anomalies in the data mining and statistics literature. In most applications, the data are created by one or more generating processes that can either reflect activity in the system or observations collected about entities. When the generating process behaves in an unusual way, it results in the creation of outliers. Therefore, an outlier often contains useful information about abnormal characteristics of the systems and entities that impact the data-generation process. The recognition of such unusual characteristics provides useful application-specific insights. The outlier detection problem is informally defined in terms of the data matrix as follows:
Definition 1.4.4 (Outlier Detection) Given a data matrix D, determine the rows of the data matrix that are very different from the remaining rows in the matrix.
The outlier detection problem is related to the clustering problem by complementarity. This is because outliers correspond to dissimilar data points from the main groups in the data. On the other hand, the main groups in the data are clusters. In fact, a simple methodology to determine outliers uses clustering as an intermediate step. Some examples of relevant applications are as follows:
|