Data Mining: The Textbook

Yüklə 17,13 Mb.

səhifə	19/423
tarix	07.01.2024
ölçüsü	17,13 Mb.
	#211690

1 ... 15 16 17 18 19 20 21 22 ... 423

1-Data Mining tarjima

Relationships between columns: In this case, the frequent or infrequent relationships between the values in a particular row are determined. This maps into either the positive or negative association pattern mining problem, though the former is more commonly studied. In some cases, one particular column of the matrix is considered more important than other columns because it represents a target attribute of the data mining analyst. In such cases, one tries to determine how the relationships in the other columns relate to this special column. Such relationships can be used to predict the value of this special column, when the value of that special column is unknown. This problem is referred to as data classification. A mining process is referred to as supervised when it is based on treating a particular attribute as special and predicting it.

Relationships between rows: In these cases, the goal is to determine subsets of rows, in which the values in the corresponding columns are related. In cases where these subsets are similar, the corresponding problem is referred to as clustering. On the other hand,

1.4. THE MAJOR BUILDING BLOCKS: A BIRD’S EYE VIEW

when the entries in a row are very diﬀerent from the corresponding entries in other rows, then the corresponding row becomes interesting as an unusual data point, or as an anomaly. This problem is referred to as outlier analysis . Interestingly, the clustering problem is closely related to that of classification, in that the latter can be considered a supervised version of the former. The discrete values of a special column in the data correspond to the group identifiers of diﬀerent desired or supervised groups of application-specific similar records in the data. For example, when the special column corresponds to whether or not a customer is interested in a particular product, this represents the two groups in the data that one is interested in learning, with the use of supervision. The term “supervision” refers to the fact that the special column is used to direct the data mining process in an application-specific way, just as a teacher may supervise his or her student toward a specific goal.

Thus, these four problems are important because they seem to cover an exhaustive range of scenarios representing diﬀerent kinds of positive, negative, supervised, or unsupervised relationships between the entries of the data matrix. These problems are also related to one another in a variety of ways. For example, association patterns may be considered indirect representations of (overlapping) clusters, where each pattern corresponds to a cluster of data points of which it is a subset.

It should be pointed out that the aforementioned discussion assumes the (most com-monly encountered) multidimensional data type, although these problems continue to retain their relative importance for more complex data types. However, the more complex data types have a wider variety of problem formulations associated with them because of their greater complexity. This issue will be discussed in detail later in this section.

It has consistently been observed that many application scenarios determine such rela-tionships between rows and columns of the data matrix as an intermediate step. This is the reason that a good understanding of these building-block problems is so important for the data mining process. Therefore, the first part of this book will focus on these problems in detail before generalizing to complex scenarios.

1.4.1 Association Pattern Mining

In its most primitive form, the association pattern mining problem is defined in the context of sparse binary databases, where the data matrix contains only 0/1 entries, and most entries take on the value of 0. Most customer transaction databases are of this type. For example, if each column in the data matrix corresponds to an item, and a customer transaction represents a row, the ( i, j)th entry is 1, if customer transaction i contains item j as one of the items that was bought. A particularly commonly studied version of this problem is the frequent pattern mining problem or, more generally, the association pattern mining problem. In terms of the binary data matrix, the frequent pattern mining problem may be formally defined as follows:

Definition 1.4.1 (Frequent Pattern Mining) Given a binary n × d data matrix D, determine all subsets of columns such that all the values in these columns take on the value of 1 for at least a fraction s of the rows in the matrix. The relative frequency of a pattern is referred to as its support. The fraction s is referred to as the minimum support.

Patterns that satisfy the minimum support requirement are often referred to as frequent patterns, or frequent itemsets. Frequent patterns represent an important class of association patterns. Many other definitions of relevant association patterns are possible that do not use

16 CHAPTER 1. AN INTRODUCTION TO DATA MINING

absolute frequencies but use other statistical quantifications such as the χ2 measure. These measures often lead to generation of more interesting rules from a statistical perspective. Nevertheless, this particular definition of association pattern mining has become the most popular one in the literature because of the ease in developing algorithms for it. This book therefore refers to this problem as association pattern mining as opposed to frequent pattern mining.

For example, if the columns of the data matrix D corresponding to Bread, Butter , and Milk take on the value of 1 together frequently in a customer transaction database, then it implies that these items are often bought together. This is very useful information for the merchant from the perspective of physical placement of the items in the store, or from the perspective of product promotions. Association pattern mining is not restricted to the case of binary data and can be easily generalized to quantitative and numeric attributes by using appropriate data transformations, which will be discussed in Chap. 4.

Association pattern mining was originally proposed in the context of association rule mining, where an additional step was included based on a measure known as the confidence of the rule. For example, consider two sets of items A and B. The confidence of the rule A ⇒ B is defined as the fraction of transactions containing A, which also contain B . In other words, the confidence is obtained by dividing the support of the pattern A∪B with the support of pattern A. A combination of support and confidence is used to define association rules.

Definition 1.4.2 (Association Rules) Let A and B be two sets of items. The rule A ⇒ B is said to be valid at support level s and confidence level c, if the following two conditions are satisfied:

Yüklə 17,13 Mb.

Dostları ilə paylaş:

1 ... 15 16 17 18 19 20 21 22 ... 423