Data Mining: The Textbook



Yüklə 17,13 Mb.
səhifə263/423
tarix07.01.2024
ölçüsü17,13 Mb.
#211690
1   ...   259   260   261   262   263   264   265   266   ...   423
1-Data Mining tarjima

P (|Z − E[Z]) > α · E[Z]) ≤ δ

(Hint: This is the “mean–median trick” discussed in the chapter.)





  1. Discuss scenarios in which both the Hoeffding inequality and the Chernoff bound can be used. Which one applies more generally?




  1. Suppose that you have a reservoir of size k = 1000, and you have a sample of a stream containing an exactly equal distribution of two classes. Use the upper-tail Chernoff bound to determine the probability that the reservoir contains more than 600 samples of one of the two classes. Can the lower tail be used?




  1. (Difficult) Work out the full proof of the biased reservoir sampling algorithm.




  1. (Difficult) Work out the proof of correctness of the dot-product estimate obtained with the use of the count-min sketch.




  1. Discuss the generality of different synopsis construction methods to various stream mining problems. Why is it difficult to apply these methods to outlier analysis?




  1. Implement the CluStream algorithm.




  1. Extend the implementation of the previous exercise to the problem of classification with the microclustering method.




  1. Implement the Flajolet–Martin algorithm for distinct element counting.

12.9. EXERCISES

427




  1. Suppose that X is a random variable, which always lies in the range [1, 64]. Suppose that Y is the geometric mean of a large number n of independent and identical real-izations of X. Establish a bound on log2(Y ). Assume that you know the expected value of log2(X).




  1. Let Z be a random variable satisfying E[Z] = 0, and Z ∈ [a, b].




    1. Show that E[et·Z ] ≤ et2 ·(b−a)2 /8.




    1. Use the aforementioned result to complete the proof of the Hoeffding inequality.




  1. Suppose that n distinct items are loaded into a bloom filter of length m with w hash functions.




    1. Show that the probability of a bit taking on the value of 0 is equal to (11/m)nw.




    1. Show that the probability in (a) is approximately equal to e−nw/m.




    1. Show that the expected number of 0-bits m0 in the bloom filter is related to n,

m, and w as follows:

    • m · ln(m/m0) w




  1. Show the proof of the bound discussed in the chapter for the count-min sketch when items with negative counts are included in the sketch.




  1. Let a single component of an AMS sketch be constructed for each of two streams with the same hash-function. Show that the expected value of the product of these components is equal to the dot product of the frequency vector of distinct items in the two streams.




  1. Show that the variance of the square of an AMS sketch component is bounded above by twice the square of the second-order moment of the items in the data stream.




  1. Show the correctness of AMS point query frequency estimation methodology discussed in the chapter. In other words, the expected value of the ri · Q should be equal to the point query result.



Chapter 13


Mining Text Data

The first forty years of life give us the text; the next thirty supply the commentary on it.”—Arthur Schopenhauer


13.1 Introduction


Text data are copiously found in many domains, such as the Web, social networks, newswire services, and libraries. With the increasing ease in archival of human speech and expression, the volume of text data will only increase over time. This trend is reinforced by the increasing digitization of libraries and the ubiquity of the Web and social networks. Some examples of relevant domains are as follows:






  1. Yüklə 17,13 Mb.

    Dostları ilə paylaş:
1   ...   259   260   261   262   263   264   265   266   ...   423




Verilənlər bazası müəlliflik hüququ ilə müdafiə olunur ©azkurs.org 2024
rəhbərliyinə müraciət

gir | qeydiyyatdan keç
    Ana səhifə


yükləyin