Data Mining: The Textbook

STARTING STREAM PROGRESSION

Yüklə 17,13 Mb.

səhifə	258/423
tarix	07.01.2024
ölçüsü	17,13 Mb.
	#211690

1 ... 254 255 256 257 258 259 260 261 ... 423

1-Data Mining tarjima

TIME = 0 TIME = 55

STARTING	STREAM PROGRESSION	CURRENT

TIME = 0	TIME = 55

16 24 32 36 40 44 46 48
50 55

Figure 12.7: Recent snapshots are stored more frequently by pyramidal time frame

To illustrate the snapshots, an example will be used. Consider the case when the stream has been running starting at a clock time of 1, and a use of α = 2 and l = 2. Therefore, 22 + 1 = 5 snapshots of each order are stored. Then, at a clock time of 55, snapshots at the clock times illustrated in Table 12.2 are stored. While some snapshots are redundant in this case, they are not stored in a redundant way. The corresponding pattern of storage is illustrated in Fig. 12.7. It is evident that recent snapshots are stored more frequently in the pyramidal pattern of storage.

The following observations are true at any moment in time over the course of the data stream:

The maximum order of any snapshot stored at T time units since the beginning of the stream mining process is log_α(T ).

The maximum number of snapshots maintained at T time units since the beginning of the stream mining process is (αl + 1) · log_α(T ).

For any user-specified time horizon h, at least one stored snapshot can be found, which corresponds to a horizon of length within a factor (1 + 1/αl−1) units of the desired value h. This property is important because the microcluster statistics of time horizon (t_c − h, t_c) can be constructed by subtracting the statistics at time (t_c − h) from those at time t_c. Therefore, the microcluster within the approximate temporal locality of (t_c − h) can be used instead. This enables the approximate clustering of data stream points within an arbitrary time horizon (t_c − h, t_c) from the stored pyramidal pattern of microcluster statistics.

For larger values of l , the time horizon can be approximated as closely as desired. It is instructive to use an example to illustrate the combination of the eﬀectiveness and com-pactness achieved by the pyramidal pattern of snapshot storage. For example, by choosing

12.5. STREAMING OUTLIER DETECTION

417

= 10, it is possible to approximate any time horizon within 0.2 %. At the same time, a total of only (210 + 1) · log₂(100 ∗ 365 ∗ 24 ∗ 60 ∗ 60) ≈ 32, 343 snapshots are required for a stream with a clock granularity of 1 s and running over 100 years. If each snapshot of size k·(2·d+3) requires storage of less than a megabyte, the overall storage required is of order of a few gigabytes. Because historical snapshots can be stored on disk and only the current snapshot needs to be maintained in main memory, this requirement is modest from a practical point of view. As the clustering algorithm progresses, only the relevant snapshots according to the pyramidal time frame are maintained. The remaining snapshots are discarded. This enables the computation of horizon-specific clusters at a modest storage cost.

12.4.3 Massive-Domain Stream Clustering

As discussed earlier, the massive-domain scenario is ubiquitous in the stream context. In many cases, one may need to work with a multidimensional data stream, in which the individual attributes are drawn on a massive domain of possible values. In such cases, stream analysis becomes much more diﬃcult because “concise” summaries of the clusters become much more space-intensive. This is also the motivation for many synopsis structures, such as the bloom filter, the count-min sketch, the AMS sketch, and the Flajolet–Martin algorithm.

The data clustering problem also becomes more challenging in the massive-domain sce-nario, because of the diﬃculty in maintaining concise statistics of the clusters. A recent method CSketch has been designed for clustering massive-domain data streams. The idea in this method is to use a count-min sketch to store the frequencies of attribute–value combi-nations in each cluster. Thus, the number of count-min sketches used is equal to the number of clusters. An online k-means style clustering is applied, in which the sketch is used as the representative for the (discrete) attributes in the cluster. For any incoming data point, a dot product is computed with respect to each cluster.

The computation is performed as follows. For each attribute–value combination in the d-dimensions, the hash function h_r (·) is applied to it for a particular value of r . The frequency of the corresponding sketch cell is determined. The frequencies of all the relevant sketch cells for the d diﬀerent dimensions are added together. This provides an estimate of the dot product. To obtain a tighter estimate, the minimum value over diﬀerent hash functions (diﬀerent values of r ) is used. The dot product is divided by the total frequency of items in the cluster, to avoid bias towards clusters with many data items.

This computation can be performed accurately because the count-min sketch can com-pute the dot product accurately in a small space. The data point is assigned to the cluster with which it has the largest dot product. The statistics in the sketch representing that par-ticular cluster are then updated. Thus, this approach shares a common characteristic with microclustering in terms of how data points are incrementally assigned to clusters. However, it does not implement the merging and removal steps. Furthermore, the sketch represen-tation is used instead of the microcluster representation for cluster statistics maintenance. Theoretical guarantees can be shown on clustering quality, with respect to a clustering that has infinite space availability. The bibliographic notes contain pointers to these results.

12.5 Streaming Outlier Detection

The problem of streaming outlier detection typically arises either in the context of multi-dimensional data or time-series data streams. Outlier detection in multidimensional data streams is generally quite diﬀerent from time series outlier detection. In the latter case,

418 CHAPTER 12. MINING DATA STREAMS

each time series is treated as a unit, whereas temporal correlations are much weaker for multidimensional data. This chapter will address only multidimensional streaming outlier detection, whereas time-series methods will be addressed in Chap. 14.

The multidimensional stream scenario is similar to static multidimensional outlier anal-ysis. The only diﬀerence is the addition of a temporal component to the analysis, though this temporal component is much weaker than in the case of time series data. In the context of multidimensional data streams, eﬃciency is an important concern because the outliers need to be discovered quickly. There are two kinds of outliers that may arise in the context of multidimensional data streams.

One is based on the outlier detection of individual records. For example, a first news story on a specific topic represents an outlier of this type. Such an outlier is also referred to as a novelty.

The second is based on changes in the aggregate trends of the multidimensional data. For example, an unusual event such as a terrorist attack may lead to a burst of news stories on a specific topic. This represents an aggregated outlier based on a specific time window. The second kind of change point almost always begins with an individual outlier of the first type. However, an individual outlier of the first type may not always develop into an aggregate change point. This is closely related to the concept of concept drift. While concept drift is generally gentle, an abrupt change may be viewed as an outlier instant in time rather than an outlier data point.

Both kinds of outliers (or change points) will be discussed in this section.

12.5.1 Individual Data Points as Outliers

The problem of detecting individual data points as outliers is closely related to the problem of unsupervised novelty detection, especially when the entire history of the data stream is used. This problem is studied extensively in the text domain in the context of the problem of first story detection. Such novelties are often trend setters and may eventually become a part of the normal data. However, when an individual record is declared an outlier in the context of a window of data points, it may not necessarily be a novelty. In this context, proximity-based algorithms are particularly easy to generalize to the incremental scenario by almost direct applications of the corresponding algorithms to the window of data points.

Distance-based algorithms can be easily generalized to the streaming scenario. The orig-inal distance-based definition of outliers is modified in the following way:

Yüklə 17,13 Mb.

Dostları ilə paylaş:

1 ... 254 255 256 257 258 259 260 261 ... 423