Data Mining: The Textbook

Yüklə 17,13 Mb.

səhifə	82/423
tarix	07.01.2024
ölçüsü	17,13 Mb.
	#211690

1 ... 78 79 80 81 82 83 84 85 ... 423

1-Data Mining tarjima

particular set of k columns, and it does not discover all the k-itemsets that satisfy the minimum criterion on the Jaccard coeﬃcient.

The min-hash trick can be used to eﬃciently perform the sorts in an implicit way and transform to a concise sampled representation on which traditional frequent pattern mining algorithms can be applied to discover combinations satisfying the Jaccard threshold. The basic idea is as follows. A random hash function h(·) is applied to each tid. For each column of binary values, the tid, with the smallest hash function value, is selected among all entries that have a unit value in that column. This results in a vector of d diﬀerent tids. What is the probability that the tids in the first k columns are the same? It is easy to see that this is equal to the Jaccard coeﬃcient because the hashing process simulates the sort, and reports the index of the first non-zero element in the binary matrix. Therefore, by using independent hash functions to create multiple samples, it is possible to estimate the Jaccard coeﬃcient. It is possible to repeat this process with r diﬀerent hash functions, to create r diﬀerent samples. Note that the r hash-functions can be applied simultaneously in a single pass over the transaction database. This creates a r × d categorical data matrix of tids. By determining the subsets of columns where the tid value is the same with support equal to a minimum support value, it is possible to estimate all sets of k- items whose Jaccard coeﬃcient is at least equal to the minimum support value. This is a standard frequent pattern mining problem, except that it is defined on categorical values instead of a binary data matrix.

One way of transforming this r × d categorical data matrix to a binary matrix is to pull out the column identifiers where the tids are the same from each row and create a new transaction of column-identifier “items.” Thus, a single row from the r × d matrix will map to multiple transactions. The resulting transaction data set can be represented by a new binary matrix D . Any oﬀ-the-shelf frequent pattern mining algorithm can be applied to this binary matrix to discover relevant column-identifier combinations. The advantage of an oﬀ-the-shelf approach is that many eﬃcient algorithms for the conventional frequent pattern mining model are available. It can be shown that the accuracy of the approach increases exponentially fast with the number of data samples.

4.5.7 Collective Strength

The collective strength of an itemset is defined in terms of its violation rate. An itemset I is said to be in violation of a transaction, if some of the items are present in the transaction, and others are not. The violation rate v(I) of an itemset I is the fraction of violations of the itemset I over all transactions. The collective strength C(I) of an itemset I is defined in terms of the violation rate as follows:

C(I) =		1 − v(I)	·	E[v(I)]	.	(4.11)
	1
		−

Yüklə 17,13 Mb.

Dostları ilə paylaş:

1 ... 78 79 80 81 82 83 84 85 ... 423