The outlier score of a data point is defined in terms of its k-nearest neighbor distance to data points in a time window of length W .
Note that this is a relatively straightforward modification of the original distance-based definition. When the entire window of data points can be maintained in main memory, it is fairly easy to determine the outliers by computing the score of every data point in the window. However, incremental maintenance of the scores of data points is more challenging because of the addition and removal of data points from the window. Furthermore, some algorithms such as LOF require the re-computation of statistics such as reachability dis-tances. The LOF algorithm has been extended to the incremental scenario. Two steps are performed in the process:
12.5. STREAMING OUTLIER DETECTION
|
419
|
The statistics of the newly inserted data points are computed such as its reachability distance and LOF score.
The LOF scores of the existing points in the window are updated along with their densities and reachability distances. In other words, the scores of many of the existing data points need to be updated because they are affected by the addition of a new data point. However, not all scores need to be updated because only the locality of the new data point is affected. Similarly, when data points are deleted, only the LOF scores in the locality of the deleted point are affected.
Because distance-based methods are well-known to be computationally expensive, many of the aforementioned methods are still quite expensive in the context of the data stream. Therefore, the complexity of the outlier detection process can be greatly improved by using an online clustering-based approach. The microclustering approach discussed earlier in this chapter automatically discovers outliers, together with clusters.
While clustering-based methods are generally not advisable when the number of data points are limited, this is not the case in streaming analysis. In the context of a data stream, a sufficient number of data points are typically available to maintain the clusters at a very high level of granularity. In the context of a streaming clustering algorithm, the
formation of new clusters is often associated with unsupervised novelties. For example, the CluStream algorithm explicitly regulates the creation of new clusters in the data stream when an incoming data point does not lie within a specified statistical radius of the existing clusters in the data. Such data points may be considered outliers. In many cases, this is the beginning of a new trend, as more data points are added to the cluster at later stages of the algorithm. In some cases, such data points may correspond to novelties, and in other cases, they may correspond to trends that were seen a long time ago, but are no longer reflected in the current clusters. In either case, such data points are interesting outliers. However, it is not possible to distinguish between these different kinds of outliers unless one is willing to allow the number of clusters in the stream to increase over time.
12.5.2 Aggregate Change Points as Outliers
The sudden changes in aggregate local and global trends in the underlying data are often indicative of anomalous events in the data. Many methods also provide statistical ways of quantifying the level of the changes in the underlying data stream. One way of measuring concept drift is to use the concept of velocity density. The idea in velocity density estimation is to construct a density-based velocity profile of the data. This is analogous to the concept of kernel density estimation in static data sets. The kernel density estimation f (X) for n data points and kernel function Kh(·) is defined as follows:
1 n
f (X) = n i=1 Kh(X − Xi)
The kernel function used is a Gaussian kernel with width h.
Kh(X − Xi) ∝ e−||X−Xi||2/(2h2)
The estimation error is defined by the kernel width h that is chosen in a data-driven manner based on Silverman’s approximation rule [471].
The velocity density computations are performed over a temporal window of size ht.
Intuitively, the value of ht defines the time horizon over which the evolution is measured.
420 CHAPTER 12. MINING DATA STREAMS
Thus, if ht is chosen to be large, then the velocity density estimation technique provides long term trends, whereas if ht is chosen to be small then the trends are relatively short term. This provides the user flexibility in analyzing the changes in the data over different time horizons. In addition, a spatial smoothing parameter hs is used that is analogous to the kernel width h in conventional kernel density estimation.
Let t be the current instant and S be the set of data points that have arrived in the time window (t − ht, t). The rate of increase in density at spatial location X and time t is estimated with two measures the forward time-slice density estimate and the reverse time-slice density estimate. Intuitively, the forward time -slice estimate measures the density function for all spatial locations at a given time t based on the set of data points that have arrived in the past time window (t−ht, t). Similarly, the reverse time-slice estimate measures the density function at a given time t based on the set of data points that will arrive in the future time window (t, t + ht). Obviously, this value cannot be computed until these points have actually arrived.
It is assumed that the ith data point in S is denoted by (Xi , ti), where i varies from 1 to |S|. Then, the forward time-slice estimate F(hs,ht)(X, t) of the set S at the spatial location X and time t is given by:
Dostları ilə paylaş: |