Data Mining: The Textbook



Yüklə 17,13 Mb.
səhifə246/423
tarix07.01.2024
ölçüsü17,13 Mb.
#211690
1   ...   242   243   244   245   246   247   248   249   ...   423
1-Data Mining tarjima

h1(.)

h1(x)




ELEMENT x HASHES



















h2(.)







h2(x)




INTO THESE CELLS




w













(Counts incremented)





















h3(x)



hw(.)







h4(x)
















ESTIMATED COUNT OF x = min { h1(x), h2(x), …, hw(x) }




Figure 12.4: The count-min sketch


12.2.2.2 Count-Min Sketch

While the bloom filter is effective for set-membership queries, it is not designed for methods that require count-based summaries. This is because the bloom filter tracks only binary values. The count-min sketch is designed for resolving such queries and is intuitively related to the bloom filter. A count-min sketch consists of a set of w different numeric arrays, each of which has a length m. Thus, the space requirement of the count-min sketch is equal to m · w cells containing numeric values. The elements of each of the w numeric arrays are indexed starting with 0, corresponding to an index range of {0 . . . m − 1}. The count-min sketch can also be viewed as a w × m 2-dimensional array of cells.


Each of the w numeric arrays corresponds to a hash function. The ith numeric array corresponds to the ith hash function hi(·). The hash functions have the following properties:





  1. The ith hash function hi(·) maps a stream element to an integer in the range [0 . . . m− 1]. This mapping can also be viewed as one of the index values in the ith numeric array.




  1. The w hash functions h1(·) . . . hw(·) are fully independent of one another, but pairwise independent over different arguments. In other words, for any two values x1 and x2, hi(x1) and hi(x2) are independent.

The pairwise independence requirement is a weaker one than the full independence require-ment. This is a convenient property of the count-min sketch because it is usually easier to construct pairwise independent hash functions rather than fully independent ones.


The procedure for updating the sketch is as follows. All m · w entries in the count-min sketch are initialized to 0. For each incoming stream element x, the functions h1 (x ) . . . hw(x) are applied to it. For the ith array, the element hi(x) is incremented by 1. Thus, if the count-min sketch CM is visualized as a 2-dimensional w × m numeric array, then the element (i, hi(x)) is incremented2 by 1. Note that the value of hi(x) maps to an integer in the range [0, m − 1]. This is also the range of the indices of each numeric array. A pictorial illustration of the count-min sketch and the corresponding update process is provided in Fig. 12.4. The pseudocode for the overall update process is illustrated in Fig. 12.5. In the pseudocode, the stream is denoted by S, and the count-min sketch data structure is denoted by CM. The inputs to the algorithm are the stream S and two parameters (w, m) specifying the size of the 2-dimensional array for the count-min sketch. A 2-dimensional w × m array CM is initialized with all values set to 0. For each incoming stream element, the counts of all the








  • In the event that each distinct element is associated with a nonnegative frequency, the count-min sketch can be updated with the frequency value. Only the simple case of unit updates is discussed here.

404 CHAPTER 12. MINING DATA STREAMS

Algorithm CountMinConstruct(Stream: S, Width: w, Height: m)


begin
Initialize all entries of w × m array CM to 0;


repeat
Receive next stream element x ∈ S;


for i = 1 to w do

Increment (i, hi(x))th element in CM by 1; until end of stream S;


return CM;


end

Figure 12.5: Update of count-min sketch

Algorithm CountMinQuery(Element: y, Count-min Sketch: CM)


begin
Initialize Estimate = ;


for i = 1 to w do


Estimate = min{Estimate, Vi(y)};


{ Vi(y) is the count of the (i, hi(y))th element in CM } return Estimate;

end

Figure 12.6: Frequency queries for count-min sketch

cells (i, hi(x)) are updated for i ∈ {1 . . . w}. In the pseudocode description, the resulting sketch CM is returned after processing all the stream elements. In practice, the count-min sketch can be used at any time during the progression of the stream S. As in the case of the bloom filter, it is possible for multiple stream elements to map to the same cell in the count-min sketch. Therefore, different stream elements will increment the same cell, and the resulting cell counts are always overestimates of one or more stream element counts.


The count-min sketch can be used for many different queries. The simplest query is to determine the frequency of an element y. The first step is to compute the hash functions h1(y) . . . h w(y). For the ith numeric array in CM, the value Vi (y) of the (i, h i (y))th array element is retrieved. Each value V i(y) is an overestimate of the true frequency of y because of potential collisions. Therefore, the tightest possible estimate may be obtained by using the minimum value min i{Vi(y)} over the different hash functions. The overall procedure for frequency estimation is illustrated in Fig. 12.6.


The count-min sketch causes an overestimation of frequency values because of collisions of nonnegative frequency counts of distinct stream items. It is therefore helpful to determine an upper bound on the estimation quality.


Lemma 12.2.3 Let E(y) be the estimate of the frequency of the item y, using a count-min sketch of size w × m. Let nf be the total frequencies of all items received so far, and G(y) be true frequency of item y. Then, with probability at least 1 − e−w, the upper bound on the estimate E(y) is as follows:




Yüklə 17,13 Mb.

Dostları ilə paylaş:
1   ...   242   243   244   245   246   247   248   249   ...   423




Verilənlər bazası müəlliflik hüququ ilə müdafiə olunur ©azkurs.org 2024
rəhbərliyinə müraciət

gir | qeydiyyatdan keç
    Ana səhifə


yükləyin