Data Mining: The Textbook

Yüklə 17,13 Mb.

səhifə	22/423
tarix	07.01.2024
ölçüsü	17,13 Mb.
	#211690

1 ... 18 19 20 21 22 23 24 25 ... 423

1-Data Mining tarjima

1.4. THE MAJOR BUILDING BLOCKS: A BIRD’S EYE VIEW

notion of a group of “similar” records. Therefore, from a learning perspective, clustering is often referred to as unsupervised learning (because of the lack of a special training database to “teach” the model about the notion of an appropriate grouping), whereas the classification problem is referred to as supervised learning.

The classification problem is related to association pattern mining, in the sense that the latter problem is often used to solve the former. This is because if the entire training database (including the class label) is treated as an n× (d+1) matrix, then frequent patterns containing the class label in this matrix provide useful hints about the correlations of other features to the class label. In fact, many forms of classifiers, known as rule-based classifiers, are based on this broader principle.

The classification problem can be mapped to a specific version of the outlier detection problem, by incorporating supervision in the latter. While the outlier detection problem is assumed to be unsupervised by default, many variations of the problem are either partially or fully supervised. In supervised outlier detection, some examples of outliers are available. Thus, such data records are tagged to belong to a rare class, whereas the remaining data records belong to the normal class. Thus, the supervised outlier detection problem maps to a binary classification problem, with the caveat that the class labels are highly imbalanced.

The incorporation of supervision makes the classification problem unique in terms of its direct application specificity due to its use of application-specific class labels. Compared to the other major data mining problems, the classification problem is relatively self-contained. For example, the clustering and frequent pattern mining problem are more often used as intermediate steps in larger application frameworks. Even the outlier analysis problem is sometimes used in an exploratory way. On the other hand, the classification problem is often used directly as a stand-alone tool in many applications. Some examples of applications where the classification problem is used are as follows:

Target marketing: Features about customers are related to their buying behavior with the use of a training model.

Intrusion detection: The sequences of customer activity in a computer system may be used to predict the possibility of intrusions.

Supervised anomaly detection: The rare class may be diﬀerentiated from the normal class when previous examples of outliers are available.

The data classification problem is discussed in detail in Chaps. 10 and 11.

1.4.5 Impact of Complex Data Types on Problem Definitions

The specific data type has a profound impact on the kinds of problems that may be defined. In particular, in dependency-oriented data types, the dependencies often play a critical role in the problem definition, the solution, or both. This is because the contextual attributes and dependencies are often fundamental to how the data may be evaluated. Furthermore, because complex data types are much richer, they allow the formulation of novel problem definitions that may not even exist in the context of multidimensional data. A tabular summary of the diﬀerent variations of data mining problems for dependency-oriented data types is provided in Table 1.2. In the following, a brief review will be provided as to how the diﬀerent problem definitions are aﬀected by data type.

20 CHAPTER 1. AN INTRODUCTION TO DATA MINING

Table 1.2: Some examples of variation in problem definition with data type

Problem
Time series

Spatial

Sequence
Networks

Patterns	Motif-	Colocation	Sequential	Structural
	mining	patterns	patterns	patterns
	Periodic		Periodic
	pattern		Sequence
	Trajectory patterns
Clustering	Shape	Spatial	Sequence	Community
	clusters	clusters	clusters	detection
	Trajectory clusters
Outliers	Position outlier	Position outlier	Position outlier	Node outlier
	Shape outlier	Shape outlier	Combination	Linkage
			outlier	outlier
	Trajectory			Community
	outliers			outliers
Classification	Position	Position	Position	Collective
	classification	classification	classification	classification
	Shape	Shape	Sequence	Graph
	classification	classification	classification	classification
	Trajectory classification

1.4.5.1 Pattern Mining with Complex Data Types

The association pattern mining problem generally determines the patterns from the under-lying data in the form of sets; however, this is not the case when dependencies are present in the data. This is because the dependencies and relationships often impose ordering among data items, and the direct use of frequent pattern mining methods fails to recognize the relationships among the diﬀerent data values. For example, when a larger number of time series are made available, they can be used to determine diﬀerent kinds of temporally fre-quent patterns, in which a temporal ordering is imposed on the items in the pattern. Fur-thermore, because of the presence of the additional contextual attribute representing time, temporal patterns may be defined in a much richer way than a set-based pattern as in association pattern mining. The patterns may be temporally contiguous, as in time-series motifs, or they may be periodic, as in periodic patterns. Some of these methods for tempo-ral pattern mining will be discussed in Chap. 14. A similar analogy exists for the case of discrete sequence mining, except that the individual pattern constituents are categorical, as opposed to continuous. It is also possible to define 2-dimensional motifs for the spatial scenario, and such a formulation is useful for image processing. Finally, structural patterns are commonly defined in networks that correspond to frequent subgraphs in the data. Thus, the dependencies between the nodes are included within the definition of the patterns.

1.4.5.2 Clustering with Complex Data Types

The techniques used for clustering are also aﬀected significantly by the underlying data type. Most importantly, the similarity function is significantly aﬀected by the data type. For example, in the case of time series, sequential, or graph data, the similarity between a pair of time series cannot be easily defined by using straightforward metrics such as the Euclidean metric. Rather, it is necessary to use other kinds of metrics, such as the edit distance or structural similarity. In the context of spatial data, trajectory clustering is particularly useful in finding the relevant patterns for mobile data, or for multivariate

Yüklə 17,13 Mb.

Dostları ilə paylaş:

1 ... 18 19 20 21 22 23 24 25 ... 423