|S|
F(hs,ht)(X, t) = Cf · K(hs,ht)(X − Xi, t − ti).
i=1
Here K(hs,ht)(·, ·) is a spatiotemporal kernel smoothing function, hs is the spatial kernel vector, and ht is temporal kernel width. The kernel function K(hs,ht)(X − Xi, t − ti) is a smooth distribution that decreases with increasing value of t − ti. The value of Cf is a suitably chosen normalization constant, so that the entire density over the spatial plane is one unit. Thus, Cf is defined as follows:
All X F(hs,ht)(X, t)δX = 1.
The reverse time- slice density estimate is calculated differently from the forward time-slice density estimate. Assume that the set of points in the time interval (t, t + ht) is denoted by U . As before, the value of Cr is chosen as a normalization constant. Correspondingly, the reverse time-slice density estimate R(hs,ht)(X, t) is defined as follows:
|U|
R(hs,ht)(X, t) = Cr · K(hs,ht)(X − Xi, ti − t).
i=1
In this case, ti − t is used in the argument instead of t − ti. Thus, the reverse time-slice density in the interval (t, t+ht ) would be exactly the same as the forward time-slice density, if time were reversed, and the data stream arrived in reverse order, starting at t + ht and ending at t.
The velocity density V(hs,ht)(X, T ) at spatial location X and time T is defined as follows:
( ) = F(hs,ht)(X, T ) − R(hs,ht)(X, T − ht)
V(hs,ht) X, T ht .
Note that the reverse time- slice density estimate is defined with a temporal argument of (T − ht), and therefore the future points with respect to (T − ht) are known at time T . A
12.6. STREAMING CLASSIFICATION
|
421
|
positive value of the velocity density corresponds to an increase in the data density at a given point. A negative value of the velocity density corresponds to a reduction in the data density at a given point. In general, it has been shown that when the spatiotemporal kernel function is defined as below, then the velocity density is directly proportional to a rate of change of the data density at a given point.
K(hs,ht)(X, t) = (1 − t/ht) · Khs (X).
This kernel function is defined only for values of t in the range (0, ht). The Gaussian spatial kernel function Kh (·) was used because of its well-known effectiveness. Specifically, Kh (·)
s s
is the product of d identical gaussian kernel functions, and hs = (h1s, . . . hds), where his is the smoothing parameter for dimension i.
The velocity density is associated with a data point as well as a time instant, and therefore this definition allows the labeling of both data points and time instants as outliers. However, the interpretation of a data point as an outlier in the context of aggregate change analysis is slightly different from the previous definitions in this section. An outlier is defined on an aggregate basis, rather than in a specific way for that point. Because outliers are data points in regions where abrupt change has occurred, outliers are defined as data points
X at time instants t with unusually large absolute values of the local velocity density. If desired, a normal distribution could be used to determine the extreme values among the absolute velocity density values. Thus, the velocity density approach is able to convert the multidimensional data distributions into a quantification that can be used in conjunction with extreme-value analysis.
It is important to note that the data point X is an outlier only in the context of aggregate changes occurring in its locality, rather than its own properties as an outlier. In the context of the news-story example, this corresponds to a news story belonging to a particular burst of related articles. Thus, such an approach could detect the sudden emergence of local clusters in the data, and report the corresponding data points in a timely fashion. Furthermore, it is also possible to compute the aggregate absolute level of change (over all regions) occurring in the underlying data stream. This is achieved by computing the average absolute velocity density over the entire data space by summing the changes at sample points in the space. Time instants with large values of the aggregate velocity density may be declared as outliers.
12.6 Streaming Classification
The problem of streaming classification is especially challenging because of the impact of concept drift. One simple approach is to use a reservoir sample to create a concise representation of the training data. This concise representation can be used to create an offline model. If desired, a decay-based reservoir sample can be used to handle concept drift. Such an approach has the advantage that any conventional classification algorithm can be used since the challenges associated with the streaming paradigm have already been addressed at the sampling stage. A number of dedicated methods have also been proposed for streaming classification.
12.6.1 VFDT Family
Very fast decision trees (VFDT) are based on the principle of Hoeffding trees . The basic idea is that a decision tree can be constructed on a sample of a very large data set, using a carefully designed approach, so that the resulting tree is the same as what would have been
422 CHAPTER 12. MINING DATA STREAMS
achieved with the original data set with high probability. The Hoeffding bound is used to estimate this probability, and therefore the intermediate steps of the approach are designed with this bound in mind. This is the reason that such trees are referred to as Hoeffding trees.
The Hoeffding tree can be constructed incrementally by growing the tree simultaneously with stream arrival. An important assumption is that the stream does not evolve, and therefore the currently arrived set of points can be viewed as a sample of the full stream. The higher levels of the tree are constructed at earlier stages of the stream, when enough tuples have been collected to quantify the accuracy of the corresponding split criteria. The lower level nodes are constructed later because statistics about lower level nodes can be collected only after the higher level nodes have been constructed. Thus, successive levels of the tree are constructed, as more examples stream in and the tree continues to grow. The key in the Hoeffding tree algorithm is to quantify the point at which statistically sufficient tuples have been collected in order to perform a split, so that the split is approximately the same as what would have been performed with knowledge of the full stream.
The same decision tree will be constructed on the current stream sample and the full stream, as long as the same splits are used at each stage. Therefore, the goal of the approach is to ensure that the splits on the sample are identical to the splits on the full stream. For ease in discussion, consider the case where each attribute6 is binary. In this case, two algorithms will produce exactly the same tree, as long as the same split attribute is selected at each point. The split attribute is selected using a measure such as the Gini index. Consider a particular node in the tree constructed on the original data, and the same node constructed on the sampled data. What is the probability that the same attribute will be selected for the stream sample as for the full stream?
Consider the best and second-best attributes for a split, indexed by i and j, respectively, in the sampled data. Let Gi an Gi be the Gini index values of the split attribute i, as computed on the full stream, and the sampled data, respectively. Because the attribute i was selected for a split in the sampled data, it is evident that Gi < Gj . The problem is that the sampling might cause an error. In other words, for the original data, it might be the case that Gj < Gi. Let the difference Gj − Gi between Gj and Gi be 0. If the number
of samples n for evaluating the split is large enough, then it can be shown with the use of the Hoeffding bound that the undesirable case where Gj < Gi will not occur with at least a user-defined probability 1 − δ. The required value of n would be a function of and δ. In the context of data streams with continuously accumulating samples, the key is to wait for a large enough sample size n before performing the split. In the Hoeffding tree, the Hoeffding
The value of R denotes the range of the split criterion. For the Gini index, the value of R is 1, and for the entropy criterion, the value is log(k), where k is the number of classes. Near ties in the split criterion correspond to small values of . According to Eq. 12.32, such ties will lead to large sample size requirements, and therefore a larger waiting time until one can be sufficiently confident of performing a split with the available stream sample.
The Hoeffding tree approach determines whether the current difference in the Gini index
between the best and second-best split attributes is at least R2·ln(1/δ) in order to initiate
2n
a split. This provides a guarantee on the quality of a split at a particular node. In cases,
The argument also applies to general attributes by first transforming them to binary data with dis-cretization and binarization.
12.6. STREAMING CLASSIFICATION
|
|
|
|
|
423
|
|
|
|
|
Dostları ilə paylaş: |