Data Mining: The Textbook

Yüklə 17,13 Mb.

səhifə	282/423
tarix	07.01.2024
ölçüsü	17,13 Mb.
	#211690

1 ... 278 279 280 281 282 283 284 285 ... 423

1-Data Mining tarjima

y = y	+	t − t_i	·	(y	j ⁻	y	)	(14.1)


i		t_j − t_i		i

This is simple linear interpolation, although other more complex methods, such as poly-nomial interpolation or spline interpolation, are possible. However, such methods require a larger number of data points in a time window for the estimation. In many cases, such meth-ods do not provide significantly superior results over the straightforward linear interpolation method.

460 CHAPTER 14. MINING TIME SERIES DATA

14.2.2 Noise Removal

Noise-prone hardware, such as sensors, are often used for time series data collection. The approach used by most of the noise removal methods is to remove short-term fluctuations. It should be pointed out that the distinction between noise and interesting outliers is often a diﬃcult one to make. Interesting outliers are fluctuations, caused by specific aspects of the data generation process, rather than artifacts of the data collection process. Therefore, such cleaning and smoothing methods are sometimes not appropriate for problems such as outlier detection. Two methods, referred to as binning and smoothing, are often used for noise removal.

Binning

The method of binning divides the data into time intervals of size k denoted by [t₁, t_k], [t _k₊₁, t ₂_k], etc. It is assumed that the timestamps are equally spaced apart. Therefore, each bin is of the same size, and it contains an equal number of points. The average value of the data points in each interval are reported as the smoothed values. Let y_i·k₊₁ . . . y_i·k₊_k be the values at timestamps t_i·k₊₁ . . . t_i·k₊_k. Then, the new binned value will be y_i₊₁, where

	k
^yi+1 ⁼	_r₌₁^yi·k+r
^yi+1 ⁼	k
	k

Therefore, this approach uses the mean of the values in the bins. It is also possible to use the median of the behavioral attribute values. Typically, the median provides more robust estimates than the mean because the outlier points do not aﬀect the median in a disproportionate way. The main problem with binning is that it reduces the number of available data points by a factor of k. Binning is also referred to as piecewise aggregate approximation (PAA). Such an approach can be rather lossy for large values of k, although it can also be advantageous for fast distance computations [309] because it provides a compressed representation.

Moving-Average Smoothing

Moving-average methods reduce the loss in binning by using overlapping bins, over which the averages are computed. As in the case of binning, averages are computed over windows of the time series. The main diﬀerence is that a bin is constructed starting at each timestamp in the series rather than only the timestamps at the boundaries of the bins. Therefore, the bin intervals are chosen to be [t ₁, t_k], [t₂, t _k₊₁], etc. This results in a set of overlapping intervals. The time series values are averaged over each of these intervals. Moving averages are also referred to as rolling averages and they reduce the noise in the time series because of the smoothing eﬀect of averages.

In a real- time application, the moving average becomes available only after the last timestamp of the window. Therefore, moving averages introduce lags into the analysis and also lose some points at the beginning of the series because of boundary eﬀects. Furthermore, short-term trends are sometimes lost because of smoothing. Larger bin sizes result in greater smoothing and lag. Because of the impact of lag, it is possible for the moving average to contain troughs (or downtrends) where there are peaks (or uptrends) in the original series, and vice versa. This can sometimes lead to a misleading understanding of recent trends.

Yüklə 17,13 Mb.

Dostları ilə paylaş:

1 ... 278 279 280 281 282 283 284 285 ... 423