y = y
|
+
|
t − ti
|
·
|
(y
|
j −
|
y
|
)
|
(14.1)
|
|
|
i
|
|
tj − ti
|
|
i
|
|
|
|
This is simple linear interpolation, although other more complex methods, such as poly-nomial interpolation or spline interpolation, are possible. However, such methods require a larger number of data points in a time window for the estimation. In many cases, such meth-ods do not provide significantly superior results over the straightforward linear interpolation method.
460 CHAPTER 14. MINING TIME SERIES DATA
14.2.2 Noise Removal
Noise-prone hardware, such as sensors, are often used for time series data collection. The approach used by most of the noise removal methods is to remove short-term fluctuations. It should be pointed out that the distinction between noise and interesting outliers is often a difficult one to make. Interesting outliers are fluctuations, caused by specific aspects of the data generation process, rather than artifacts of the data collection process. Therefore, such cleaning and smoothing methods are sometimes not appropriate for problems such as outlier detection. Two methods, referred to as binning and smoothing, are often used for noise removal.
Binning
The method of binning divides the data into time intervals of size k denoted by [t1, tk], [t k+1, t 2k], etc. It is assumed that the timestamps are equally spaced apart. Therefore, each bin is of the same size, and it contains an equal number of points. The average value of the data points in each interval are reported as the smoothed values. Let yi·k+1 . . . yi·k+k be the values at timestamps ti·k+1 . . . ti·k+k. Then, the new binned value will be yi+1, where
Therefore, this approach uses the mean of the values in the bins. It is also possible to use the median of the behavioral attribute values. Typically, the median provides more robust estimates than the mean because the outlier points do not affect the median in a disproportionate way. The main problem with binning is that it reduces the number of available data points by a factor of k. Binning is also referred to as piecewise aggregate approximation (PAA). Such an approach can be rather lossy for large values of k, although it can also be advantageous for fast distance computations [309] because it provides a compressed representation.
Moving-Average Smoothing
Moving-average methods reduce the loss in binning by using overlapping bins, over which the averages are computed. As in the case of binning, averages are computed over windows of the time series. The main difference is that a bin is constructed starting at each timestamp in the series rather than only the timestamps at the boundaries of the bins. Therefore, the bin intervals are chosen to be [t 1, tk], [t2, t k+1], etc. This results in a set of overlapping intervals. The time series values are averaged over each of these intervals. Moving averages are also referred to as rolling averages and they reduce the noise in the time series because of the smoothing effect of averages.
In a real- time application, the moving average becomes available only after the last timestamp of the window. Therefore, moving averages introduce lags into the analysis and also lose some points at the beginning of the series because of boundary effects. Furthermore, short-term trends are sometimes lost because of smoothing. Larger bin sizes result in greater smoothing and lag. Because of the impact of lag, it is possible for the moving average to contain troughs (or downtrends) where there are peaks (or uptrends) in the original series, and vice versa. This can sometimes lead to a misleading understanding of recent trends.
Dostları ilə paylaş: |