Data Mining: The Textbook



Yüklə 17,13 Mb.
səhifə82/423
tarix07.01.2024
ölçüsü17,13 Mb.
#211690
1   ...   78   79   80   81   82   83   84   85   ...   423
1-Data Mining tarjima

particular set of k columns, and it does not discover all the k-itemsets that satisfy the minimum criterion on the Jaccard coefficient.

The min-hash trick can be used to efficiently perform the sorts in an implicit way and transform to a concise sampled representation on which traditional frequent pattern mining algorithms can be applied to discover combinations satisfying the Jaccard threshold. The basic idea is as follows. A random hash function h(·) is applied to each tid. For each column of binary values, the tid, with the smallest hash function value, is selected among all entries that have a unit value in that column. This results in a vector of d different tids. What is the probability that the tids in the first k columns are the same? It is easy to see that this is equal to the Jaccard coefficient because the hashing process simulates the sort, and reports the index of the first non-zero element in the binary matrix. Therefore, by using independent hash functions to create multiple samples, it is possible to estimate the Jaccard coefficient. It is possible to repeat this process with r different hash functions, to create r different samples. Note that the r hash-functions can be applied simultaneously in a single pass over the transaction database. This creates a r × d categorical data matrix of tids. By determining the subsets of columns where the tid value is the same with support equal to a minimum support value, it is possible to estimate all sets of k- items whose Jaccard coefficient is at least equal to the minimum support value. This is a standard frequent pattern mining problem, except that it is defined on categorical values instead of a binary data matrix.


One way of transforming this r × d categorical data matrix to a binary matrix is to pull out the column identifiers where the tids are the same from each row and create a new transaction of column-identifier “items.” Thus, a single row from the r × d matrix will map to multiple transactions. The resulting transaction data set can be represented by a new binary matrix D . Any off-the-shelf frequent pattern mining algorithm can be applied to this binary matrix to discover relevant column-identifier combinations. The advantage of an off-the-shelf approach is that many efficient algorithms for the conventional frequent pattern mining model are available. It can be shown that the accuracy of the approach increases exponentially fast with the number of data samples.


4.5.7 Collective Strength


The collective strength of an itemset is defined in terms of its violation rate. An itemset I is said to be in violation of a transaction, if some of the items are present in the transaction, and others are not. The violation rate v(I) of an itemset I is the fraction of violations of the itemset I over all transactions. The collective strength C(I) of an itemset I is defined in terms of the violation rate as follows:





C(I) =




1 − v(I)

·

E[v(I)]

.

(4.11)




1
















Yüklə 17,13 Mb.

Dostları ilə paylaş:
1   ...   78   79   80   81   82   83   84   85   ...   423




Verilənlər bazası müəlliflik hüququ ilə müdafiə olunur ©azkurs.org 2024
rəhbərliyinə müraciət

gir | qeydiyyatdan keç
    Ana səhifə


yükləyin