Data Mining: The Textbook



Yüklə 17,13 Mb.
səhifə22/423
tarix07.01.2024
ölçüsü17,13 Mb.
#211690
1   ...   18   19   20   21   22   23   24   25   ...   423
1-Data Mining tarjima

1.4. THE MAJOR BUILDING BLOCKS: A BIRD’S EYE VIEW

19

notion of a group of “similar” records. Therefore, from a learning perspective, clustering is often referred to as unsupervised learning (because of the lack of a special training database to “teach” the model about the notion of an appropriate grouping), whereas the classification problem is referred to as supervised learning.


The classification problem is related to association pattern mining, in the sense that the latter problem is often used to solve the former. This is because if the entire training database (including the class label) is treated as an (d+1) matrix, then frequent patterns containing the class label in this matrix provide useful hints about the correlations of other features to the class label. In fact, many forms of classifiers, known as rule-based classifiers, are based on this broader principle.


The classification problem can be mapped to a specific version of the outlier detection problem, by incorporating supervision in the latter. While the outlier detection problem is assumed to be unsupervised by default, many variations of the problem are either partially or fully supervised. In supervised outlier detection, some examples of outliers are available. Thus, such data records are tagged to belong to a rare class, whereas the remaining data records belong to the normal class. Thus, the supervised outlier detection problem maps to a binary classification problem, with the caveat that the class labels are highly imbalanced.


The incorporation of supervision makes the classification problem unique in terms of its direct application specificity due to its use of application-specific class labels. Compared to the other major data mining problems, the classification problem is relatively self-contained. For example, the clustering and frequent pattern mining problem are more often used as intermediate steps in larger application frameworks. Even the outlier analysis problem is sometimes used in an exploratory way. On the other hand, the classification problem is often used directly as a stand-alone tool in many applications. Some examples of applications where the classification problem is used are as follows:





  • Target marketing: Features about customers are related to their buying behavior with the use of a training model.




  • Intrusion detection: The sequences of customer activity in a computer system may be used to predict the possibility of intrusions.




  • Supervised anomaly detection: The rare class may be differentiated from the normal class when previous examples of outliers are available.

The data classification problem is discussed in detail in Chaps. 10 and 11.


1.4.5 Impact of Complex Data Types on Problem Definitions


The specific data type has a profound impact on the kinds of problems that may be defined. In particular, in dependency-oriented data types, the dependencies often play a critical role in the problem definition, the solution, or both. This is because the contextual attributes and dependencies are often fundamental to how the data may be evaluated. Furthermore, because complex data types are much richer, they allow the formulation of novel problem definitions that may not even exist in the context of multidimensional data. A tabular summary of the different variations of data mining problems for dependency-oriented data types is provided in Table 1.2. In the following, a brief review will be provided as to how the different problem definitions are affected by data type.


20 CHAPTER 1. AN INTRODUCTION TO DATA MINING


Table 1.2: Some examples of variation in problem definition with data type



Problem
Time series


Spatial

Sequence
Networks








Patterns

Motif-

Colocation

Sequential

Structural







mining

patterns

patterns

patterns







Periodic




Periodic










pattern




Sequence










Trajectory patterns










Clustering

Shape

Spatial

Sequence

Community







clusters

clusters

clusters

detection







Trajectory clusters










Outliers

Position outlier

Position outlier

Position outlier

Node outlier







Shape outlier

Shape outlier

Combination

Linkage













outlier

outlier







Trajectory




Community







outliers




outliers




Classification

Position

Position

Position

Collective







classification

classification

classification

classification







Shape

Shape

Sequence

Graph







classification

classification

classification

classification







Trajectory classification

























1.4.5.1 Pattern Mining with Complex Data Types


The association pattern mining problem generally determines the patterns from the under-lying data in the form of sets; however, this is not the case when dependencies are present in the data. This is because the dependencies and relationships often impose ordering among data items, and the direct use of frequent pattern mining methods fails to recognize the relationships among the different data values. For example, when a larger number of time series are made available, they can be used to determine different kinds of temporally fre-quent patterns, in which a temporal ordering is imposed on the items in the pattern. Fur-thermore, because of the presence of the additional contextual attribute representing time, temporal patterns may be defined in a much richer way than a set-based pattern as in association pattern mining. The patterns may be temporally contiguous, as in time-series motifs, or they may be periodic, as in periodic patterns. Some of these methods for tempo-ral pattern mining will be discussed in Chap. 14. A similar analogy exists for the case of discrete sequence mining, except that the individual pattern constituents are categorical, as opposed to continuous. It is also possible to define 2-dimensional motifs for the spatial scenario, and such a formulation is useful for image processing. Finally, structural patterns are commonly defined in networks that correspond to frequent subgraphs in the data. Thus, the dependencies between the nodes are included within the definition of the patterns.


1.4.5.2 Clustering with Complex Data Types

The techniques used for clustering are also affected significantly by the underlying data type. Most importantly, the similarity function is significantly affected by the data type. For example, in the case of time series, sequential, or graph data, the similarity between a pair of time series cannot be easily defined by using straightforward metrics such as the Euclidean metric. Rather, it is necessary to use other kinds of metrics, such as the edit distance or structural similarity. In the context of spatial data, trajectory clustering is particularly useful in finding the relevant patterns for mobile data, or for multivariate






Yüklə 17,13 Mb.

Dostları ilə paylaş:
1   ...   18   19   20   21   22   23   24   25   ...   423




Verilənlər bazası müəlliflik hüququ ilə müdafiə olunur ©azkurs.org 2024
rəhbərliyinə müraciət

gir | qeydiyyatdan keç
    Ana səhifə


yükləyin