Contextual attributes: These are the attributes that define the context on the basis of which the implicit dependencies occur in the data. For example, in the case of sensor data, the time stamp at which the reading is measured may be considered the contextual attribute. Sometimes, the time stamp is not explicitly used, but a position index is used. While the time-series data type contains only one contextual attribute, other data types may have more than one contextual attribute. A specific example is spatial data, which will be discussed later in this chapter.
Behavioral attributes: These represent the values that are measured in a particular context. In the sensor example, the temperature is the behavioral attribute value. It is possible to have more than one behavioral attribute. For example, if multiple sensors record readings at synchronized time stamps, then it results in a multidimensional time-series data set.
The contextual attributes typically have a strong impact on the dependencies between the behavioral attribute values in the data. Formally, time-series data are defined as follows:
Definition 1.3.2 (Multivariate Time-Series Data) A time series of length n and dimensionality d contains d numeric features at each of n time stamps t1 . . . tn. Each time-stamp contains a component for each of the d series. Therefore, the set of values received at time stamp ti is Yi = (yi1 . . . yid). The value of the jth series at time stamp ti is yij .
For example, consider the case where two sensors at a particular location monitor the temperature and pressure every second for a minute. This corresponds to a multidimensional series with d = 2 and n = 60. In some cases, the time stamps t1 . . . tn may be replaced by index values from 1 through n, especially when the time-stamp values are equally spaced apart.
Time-series data are relatively common in many sensor applications, forecasting, and financial market analysis. Methods for analyzing time series are discussed in Chap. 14.
1.3.2.2 Discrete Sequences and Strings
Discrete sequences can be considered the categorical analog of time-series data. As in the case of time-series data, the contextual attribute is a time stamp or a position index in the ordering. The behavioral attribute is a categorical value. Therefore, discrete sequence data are defined in a similar way to time-series data.
Definition 1.3.3 (Multivariate Discrete Sequence Data) A discrete sequence of length
n and dimensionality d contains d discrete feature values at each of n different time stamps t1 . . . tn. Each of the n components Yi contains d discrete behavioral attributes (yi1 . . . yid), collected at the ith time-stamp.
For example, consider a sequence of Web accesses, in which the Web page address and the originating IP address of the request are collected for 100 different accesses. This represents a discrete sequence of length n = 100 and dimensionality d = 2. A particularly common case in sequence data is the univariate scenario, in which the value of d is 1. Such sequence data are also referred to as strings.
1.3. THE BASIC DATA TYPES
|
11
|
It should be noted that the aforementioned definition is almost identical to the time-series case, with the main difference being that discrete sequences contain categorical attributes. In theory, it is possible to have series that are mixed between categorical and numerical data. Another important variation is the case where a sequence does not contain categorical attributes, but a set of any number of unordered categorical values. For example, supermarket transactions may contain a sequence of sets of items. Each set may contain any number of items. Such setwise sequences are not really multivariate sequences, but are univariate sequences, in which each element of the sequence is a set as opposed to a unit element. Thus, discrete sequences can be defined in a wider variety of ways, as compared to time-series data because of the ability to define sets on discrete elements.
In some cases, the contextual attribute may not refer to time explicitly, but it might be a position based on physical placement. This is the case for biological sequence data. In such cases, the time stamp may be replaced by an index representing the position of the value in the string, counting the leftmost position as 1. Some examples of common scenarios in which sequence data may arise are as follows:
|