1.4. THE MAJOR BUILDING BLOCKS: A BIRD’S EYE VIEW
|
19
|
notion of a group of “similar” records. Therefore, from a learning perspective, clustering is often referred to as unsupervised learning (because of the lack of a special training database to “teach” the model about the notion of an appropriate grouping), whereas the classification problem is referred to as supervised learning.
The classification problem is related to association pattern mining, in the sense that the latter problem is often used to solve the former. This is because if the entire training database (including the class label) is treated as an n× (d+1) matrix, then frequent patterns containing the class label in this matrix provide useful hints about the correlations of other features to the class label. In fact, many forms of classifiers, known as rule-based classifiers, are based on this broader principle.
The classification problem can be mapped to a specific version of the outlier detection problem, by incorporating supervision in the latter. While the outlier detection problem is assumed to be unsupervised by default, many variations of the problem are either partially or fully supervised. In supervised outlier detection, some examples of outliers are available. Thus, such data records are tagged to belong to a rare class, whereas the remaining data records belong to the normal class. Thus, the supervised outlier detection problem maps to a binary classification problem, with the caveat that the class labels are highly imbalanced.
The incorporation of supervision makes the classification problem unique in terms of its direct application specificity due to its use of application-specific class labels. Compared to the other major data mining problems, the classification problem is relatively self-contained. For example, the clustering and frequent pattern mining problem are more often used as intermediate steps in larger application frameworks. Even the outlier analysis problem is sometimes used in an exploratory way. On the other hand, the classification problem is often used directly as a stand-alone tool in many applications. Some examples of applications where the classification problem is used are as follows:
Target marketing: Features about customers are related to their buying behavior with the use of a training model.
Intrusion detection: The sequences of customer activity in a computer system may be used to predict the possibility of intrusions.
Supervised anomaly detection: The rare class may be differentiated from the normal class when previous examples of outliers are available.
The data classification problem is discussed in detail in Chaps. 10 and 11.
1.4.5 Impact of Complex Data Types on Problem Definitions
The specific data type has a profound impact on the kinds of problems that may be defined. In particular, in dependency-oriented data types, the dependencies often play a critical role in the problem definition, the solution, or both. This is because the contextual attributes and dependencies are often fundamental to how the data may be evaluated. Furthermore, because complex data types are much richer, they allow the formulation of novel problem definitions that may not even exist in the context of multidimensional data. A tabular summary of the different variations of data mining problems for dependency-oriented data types is provided in Table 1.2. In the following, a brief review will be provided as to how the different problem definitions are affected by data type.
20 CHAPTER 1. AN INTRODUCTION TO DATA MINING
Table 1.2: Some examples of variation in problem definition with data type
Problem
Time series
Spatial
Sequence
Networks
|
Patterns
|
Motif-
|
Colocation
|
Sequential
|
Structural
|
|
|
mining
|
patterns
|
patterns
|
patterns
|
|
|
Periodic
|
|
Periodic
|
|
|
|
pattern
|
|
Sequence
|
|
|
|
Trajectory patterns
|
|
|
|
Clustering
|
Shape
|
Spatial
|
Sequence
|
Community
|
|
|
clusters
|
clusters
|
clusters
|
detection
|
|
|
Trajectory clusters
|
|
|
|
Outliers
|
Position outlier
|
Position outlier
|
Position outlier
|
Node outlier
|
|
|
Shape outlier
|
Shape outlier
|
Combination
|
Linkage
|
|
|
|
|
outlier
|
outlier
|
|
|
Trajectory
|
|
Community
|
|
|
outliers
|
|
outliers
|
|
Classification
|
Position
|
Position
|
Position
|
Collective
|
|
|
classification
|
classification
|
classification
|
classification
|
|
|
Shape
|
Shape
|
Sequence
|
Graph
|
|
|
classification
|
classification
|
classification
|
classification
|
|
|
Trajectory classification
|
|
|
|
|
|
|
|
|
1.4.5.1 Pattern Mining with Complex Data Types
The association pattern mining problem generally determines the patterns from the under-lying data in the form of sets; however, this is not the case when dependencies are present in the data. This is because the dependencies and relationships often impose ordering among data items, and the direct use of frequent pattern mining methods fails to recognize the relationships among the different data values. For example, when a larger number of time series are made available, they can be used to determine different kinds of temporally fre-quent patterns, in which a temporal ordering is imposed on the items in the pattern. Fur-thermore, because of the presence of the additional contextual attribute representing time, temporal patterns may be defined in a much richer way than a set-based pattern as in association pattern mining. The patterns may be temporally contiguous, as in time-series motifs, or they may be periodic, as in periodic patterns. Some of these methods for tempo-ral pattern mining will be discussed in Chap. 14. A similar analogy exists for the case of discrete sequence mining, except that the individual pattern constituents are categorical, as opposed to continuous. It is also possible to define 2-dimensional motifs for the spatial scenario, and such a formulation is useful for image processing. Finally, structural patterns are commonly defined in networks that correspond to frequent subgraphs in the data. Thus, the dependencies between the nodes are included within the definition of the patterns.
1.4.5.2 Clustering with Complex Data Types
The techniques used for clustering are also affected significantly by the underlying data type. Most importantly, the similarity function is significantly affected by the data type. For example, in the case of time series, sequential, or graph data, the similarity between a pair of time series cannot be easily defined by using straightforward metrics such as the Euclidean metric. Rather, it is necessary to use other kinds of metrics, such as the edit distance or structural similarity. In the context of spatial data, trajectory clustering is particularly useful in finding the relevant patterns for mobile data, or for multivariate
Dostları ilə paylaş: |