Data Mining: The Textbook
12.6.4 Massive-Domain Streaming Classification Many streaming applications contain multidimensional discrete attributes with very high cardinality. In such cases, it becomes difficult to use conventional classifiers because of memory limitations. The count-min sketch can be used to address these challenges. Each class is associated with a sketch that is used to track frequent r-combinations of items in the training data, where r is bounded above by a small number k. For each incoming training data point, all possible r-combinations (for r ≤ k) are treated as pseudo-items that are added to the sketch of the relevant class. Different classes will have different relevant pseudo-items that will show up in the varying frequencies of the cells belonging to sketches of different classes. These differences can be used to determine the most discriminative cells in the different sketches. The frequent discriminative pseudo-items are determined to create implicit rules relating the pseudo-items to the different classes. These rules are implicit because they are not actually materialized, but implicitly stored in the sketches. They are retrieved only at the time of the classification of a test instance. For a given test instance, it is determined, which pseudo- items correspond to the combination of items inside them. The discriminative ones among them are determined by retrieving their statistics from the class- specific sketches. These are then used to perform the classification of the test instance, using the same general approach as a rule-based classifier. The bibliographic notes contain pointers to details of the massive-domain classification work. 12.7 Summary In this chapter, algorithms for stream mining were presented. Streams present several chal-lenges related to high volume, concept drift, the massive-domain nature of data items, and resource constraints. In this context, synopsis construction is one of the most fundamen-tal problems in the streaming scenario. As long as a high-quality stream synopsis can be constructed, it can be leveraged for stream mining algorithms. The major issue with the use of synopsis methods is that different synopsis structures are suited to different applica-tions. The most common synopsis structures used with data streams are reservoir samples and sketches. Reservoir samples provide the greatest flexibility and should be used where possible. The core problems of frequent pattern mining, clustering, outlier detection, and classi-fication have also been addressed in the streaming scenario. Most of these problems can be addressed with reservoir sampling effectively, where approximate solutions are desired. In the particular case of outlier detection, numerous variations of the problem definition are possible in the streaming scenario. 12.8 Bibliographic Notes A detailed discussion of streaming algorithms may be found in [40]. The reservoir-sampling method was originally proposed in [498]. The biased reservoir sampling approach with decay was proposed in [35]. The count-min sketch was described in [165]. Numerous other appli-cations of the count-min sketch are discussed in the same work. The AMS sketch was proposed in [72 ]. The Flajolet–Martin data structure for distinct element counting was pro-posed in [208]. A survey of synopsis construction algorithms in data streams is provided in [40]. A detailed discussion of the capabilities of some of these data structures may also be found in the same work. 426 CHAPTER 12. MINING DATA STREAMS The lossy frequent itemset counting algorithm was proposed in [376]. Surveys on stream-ing frequent pattern mining may be found in [34, 40]. The STREAM algorithm was proposed in [240]. The massive-domain scenario for stream clustering was addressed in [36]. A survey on stream clustering algorithms may be found in [32]. The STORM algorithm for point out-lier detection was discussed in [67], and the extension of the LOF algorithm to data streams was proposed in [426]. The aggregate change detection algorithm was proposed in [21]. Methods for outlier detection in data streams are discussed in [5]. The VFDT and CVFDT algorithms were proposed in [176, 279]. The microcluster-based classification method was discussed in [20], and the ensemble method was discussed in [503]. The massive-domain scenario for streaming classification was discussed in [47]. A survey on stream classification methods may be found in [33]. 12.9 Exercises Let X be a random variable in [0, 1] with mean of 0.5. Show that P (X > 0.9) ≤ 5/9. Suppose the standard deviation of a random variable X is r times its mean. Here, r can be any constant. Show how to combine the Chebychev inequality and Chernoff bound to show that repeated i.i.d. samples can be used to create a well-bounded estimate of X. In other words, we would like to create another random variable Z (using the multiple i.i.d. samples) with the same expected value of X, such that for small δ, we would like to show that: Yüklə 17,13 Mb. Dostları ilə paylaş: |