Data Mining: The Textbook



Yüklə 17,13 Mb.
səhifə62/423
tarix07.01.2024
ölçüsü17,13 Mb.
#211690
1   ...   58   59   60   61   62   63   64   65   ...   423
1-Data Mining tarjima

||




||2 + ||







||2 − ||









||2










cosine(










) =

X

Y

X

Y




(3.25)




X,

Y

.















































































2||X||||Y ||





































  1. Download the KDD Cup Network Intrusion Data Set for the UCI Machine Learning Repository [213]. Create a data set containing only the categorical attributes. Compute the nearest neighbor for each data point using the (a) match measure, and (b) inverse occurrence frequency measure. Compute the number of cases where there is a match on the class label.




  1. Repeat Exercise 6 using only the quantitative attributes of the data set, and using the Lp-norm for values of p = 1, 2, ∞.

3.9. EXERCISES

91




  1. Repeat Exercise 6 using all attributes in the data set. Use the mixed-attribute function, and different combinations of the categorical and quantitative distance functions of Exercises 6 and 7.




  1. Write a computer program to compute the edit distance.




  1. Write a computer program to compute the LCSS distance.




  1. Write a computer program to compute the DTW distance.




  1. Assume that Edit(X, Y ) represents the cost of transforming the string X to Y . Show that Edit(X, Y ) and Edit(Y , X) are the same, as long as the insertion and deletion costs are the same.




  1. Compute the edit distance, and LCSS similarity between: (a) ababcabc and babcbc and




    1. cbacbacba and acbacbacb. For the edit distance, assume equal cost of insertion, deletion, or replacement.




  1. Show that Edit(i, j), LCSS(i, j), and DT W (i, j) are all monotonic functions in i and j.




  1. Compute the cosine measure using the raw frequencies between the following two sentences:




      1. “The sly fox jumped over the lazy dog.”




      1. “The dog jumped at the intruder.”




  1. Suppose that insertion and deletion costs are 1, and replacement costs are 2 units for the edit distance. Show that the optimal edit distance between two strings can be computed only with insertion and deletion operations. Under the aforementioned cost assumptions, show that the optimal edit distance can be expressed as a function of the optimal LCSS distance and the lengths of the two strings.



Chapter 4


Association Pattern Mining

The pattern of the prodigal is: rebellion, ruin, repentance, reconciliation, restoration.”—Edwin Louis Cole


4.1 Introduction


The classical problem of association pattern mining is defined in the context of supermarket data containing sets of items bought by customers, which are referred to as transactions. The goal is to determine associations between groups of items bought by customers, which can intuitively be viewed as k-way correlations between items. The most popular model for association pattern mining uses the frequencies of sets of items as the quantification of the level of association. The discovered sets of items are referred to as large itemsets, frequent itemsets, or frequent patterns. The association pattern mining problem has a wide variety of applications:





  1. Supermarket data: The supermarket application was the original motivating scenario in which the association pattern mining problem was proposed. This is also the reason that the term itemset is used to refer to a frequent pattern in the context of super-market items bought by a customer. The determination of frequent itemsets provides useful insights about target marketing and shelf placement of the items.




  1. Text mining: Because text data is often represented in the bag-of-words model, fre-quent pattern mining can help in identifying co-occurring terms and keywords. Such co-occurring terms have numerous text-mining applications.




  1. Generalization to dependency-oriented data types: The original frequent pattern min-ing model has been generalized to many dependency-oriented data types, such as time-series data, sequential data, spatial data, and graph data, with a few modifica-tions. Such models are useful in applications such as Web log analysis, software bug detection, and spatiotemporal event detection.


C. C. Aggarwal, Data Mining: The Textbook, DOI 10.1007/978-3-319-14142-8 4

93

c Springer International Publishing Switzerland 2015



94 CHAPTER 4. ASSOCIATION PATTERN MINING



  1. Other major data mining problems: Frequent pattern mining can be used as a subrou-tine to provide effective solutions to many data mining problems such as clustering, classification, and outlier analysis.

Because the frequent pattern mining problem was originally proposed in the context of market basket data, a significant amount of terminology used to describe both the data (e.g., transactions) and the output (e.g., itemsets) is borrowed from the supermarket analogy. From an application-neutral perspective, a frequent pattern may be defined as a frequent subset, defined on the universe of all possible sets. Nevertheless, because the market basket terminology has been used popularly, this chapter will be consistent with it.


Frequent itemsets can be used to generate association rules of the form X ⇒ Y , where X and Y are sets of items. A famous example of an association rule, which has now become part1 of the data mining folklore, is {Beer} ⇒ {Diapers}. This rule suggests that buying beer makes it more likely that diapers will also be bought. Thus, there is a certain direc-tionality to the implication that is quantified as a conditional probability. Association rules are particularly useful for a variety of target market applications. For example, if a super-market owner discovers that {Eggs, M ilk} ⇒ {Y ogurt} is an association rule, he or she can promote yogurt to customers who often buy eggs and milk. Alternatively, the supermarket owner may place yogurt on shelves that are located in proximity to eggs and milk.


The frequency-based model for association pattern mining is very popular because of its simplicity. However, the raw frequency of a pattern is not quite the same as the statistical significance of the underlying correlations. Therefore, numerous models for frequent pattern mining have been proposed that are based on statistical significance. This chapter will also explore some of these alternative models, which are also referred to as interesting patterns.


This chapter is organized as follows. Section 4.2 introduces the basic model for associa-tion pattern mining. The generation of association rules from frequent itemsets is discussed in Sect. 4.3. A variety of algorithms for frequent pattern mining are discussed in Sect. 4.4. This includes the Apriori algorithm, a number of enumeration tree algorithms, and a suffix-based recursive approach. Methods for finding interesting frequent patterns are discussed in Sect. 4.5. Meta-algorithms for frequent pattern mining are discussed in Sect. 4.6. Section 4.7 discusses the conclusions and summary.


4.2 The Frequent Pattern Mining Model


The problem of association pattern mining is naturally defined on unordered set-wise data. It is assumed that the database T contains a set of n transactions, denoted by T1 . . . Tn. Each transaction Ti is drawn on the universe of items U and can also be represented as a multidimensional record of dimensionality, d = |U |, containing only binary attributes. Each binary attribute in this record represents a particular item. The value of an attribute in this record is 1 if that item is present in the transaction, and 0 otherwise. In practical settings, the universe of items U is very large compared to the typical number of items in each transaction Ti . For example, a supermarket database may have tens of thousands of items, and a single transaction will typically contain less than 50 items. This property is often leveraged in the design of frequent pattern mining algorithms.


An itemset is a set of items. A k-itemset is an itemset that contains exactly k items. In other words, a k-itemset is a set of items of cardinality k. The fraction of transactions








  • This rule was derived in some early publications on supermarket data. No assertion is made here about the likelihood of such a rule appearing in an arbitrary supermarket data set.


Yüklə 17,13 Mb.

Dostları ilə paylaş:
1   ...   58   59   60   61   62   63   64   65   ...   423




Verilənlər bazası müəlliflik hüququ ilə müdafiə olunur ©azkurs.org 2024
rəhbərliyinə müraciət

gir | qeydiyyatdan keç
    Ana səhifə


yükləyin