X,
|
|
Y
|
|
|
.
|
|
|
|
d
|
|
|
d
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
i=1 xi2 ·
|
i=1 yi2
|
|
|
|
The aforementioned measure simply uses the raw frequencies between attributes. However, as in other data types, it is possible to use global statistical measures to improve the similarity computation. For example, if two documents match on an uncommon word, it is more indicative of similarity than the case where two documents match on a word that occurs very commonly. The inverse document frequency idi, which is a decreasing function of the number of documents ni in which the ith word occurs, is commonly used for normalization:
Here, the number of documents in the collection is denoted by n. Another common adjust-ment is to ensure that the excessive presence of single word does not throw off the similarity measure. A damping function f (·), such as the square root or the logarithm, is optionally applied to the frequencies before similarity computation.
f (xi) = √xi
f (xi) = log(xi)
In many cases, the damping function is not used, which is equivalent to setting f (xi) to xi.
Therefore, the normalized frequency h(xi) for the ith word may be defined as follows:
h(xi) = f (xi) · idi.
|
(3.12)
|
Then, the cosine measure is defined as in Eq. 3.10, except that the normalized frequencies of the words are used:
|
|
|
|
|
|
|
d
|
|
|
|
|
cos(
|
|
|
|
) =
|
|
|
i=1 h(xi) · h(yi)
|
|
(3.13)
|
|
X,
|
|
Y
|
|
|
.
|
|
|
|
d
|
|
|
d
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
i=1 h(xi)2 ·
|
i=1 h(yi)2
|
|
|
|
Another measure that is less commonly used for text is the Jaccard coefficient J(X, Y ):
|
|
|
|
|
|
|
d
|
|
|
|
|
J(
|
|
|
|
) =
|
|
|
i=1 h(xi) · h(yi)
|
|
(3.14)
|
|
X,
|
|
Y
|
d
|
|
.
|
|
|
|
|
|
|
+
|
d
|
d
|
|
|
|
|
|
|
|
|
i=1 h(xi)2
|
i=1 h(yi)2 −
|
i=1 h(xi) · h(yi)
|
|
|
|
The Jaccard coefficient is rarely used for the text domain, but it is used commonly for sparse binary data sets.
3.4. TEMPORAL SIMILARITY MEASURES
|
77
|
3.3.1 Binary and Set Data
Binary multidimensional data are a representation of set-based data, where a value of 1 indicates the presence of an element in a set. Binary data occur commonly in market-basket domains in which transactions contain information corresponding to whether or not an item is present in a transaction. It can be considered a special case of text data in which word frequencies are either 0 or 1. If SX and SY are two sets with binary representations
and Y , then it can be shown that applying Eq. 3.14 to the raw binary representation of the two sets is equivalent to:
|
|
|
|
|
|
|
d
|
· yi
|
|
|
|SX ∩SY|
|
|
|
|
J(
|
|
|
|
) =
|
|
|
i=1 xi
|
|
=
|
.
|
(3.15)
|
|
X,
|
Y
|
|
|
|
|
|
d
|
+
|
d
|
d
|
|
|
|
|
|
|
|
|
|SX ∪SY|
|
|
|
|
|
|
|
|
i=1 xi2
|
i=1 yi2 −
|
i=1 xi · yi
|
|
|
|
This is a particularly intuitive measure because it carefully accounts for the number of common and disjoint elements in the two sets.
3.4 Temporal Similarity Measures
Temporal data contain a single contextual attribute representing time and one or more behavioral attributes that measure the properties varying along a particular time period. Temporal data may be represented as continuous time series, or as discrete sequences, depending on the application domain. The latter representation may be viewed as the discrete version of the former. It should be pointed out that discrete sequence data are not always temporal because the contextual attribute may represent placement. This is typically the case in biological sequence data. Discrete sequences are also sometimes referred to as strings . Many of the similarity measures used for time series and discrete sequences can be reused across either domain, though some of the measures are more suited to one of the domains. Therefore, this section will address both data types, and each similarity measure will be discussed in a subsection on either continuous series or discrete series, based on its most common use. For some measures, the usage is common across both data types.
3.4.1 Time-Series Similarity Measures
The design of time-series similarity measures is highly application specific. For example, the simplest possible similarity measure between two time series of equal length is the Euclidean metric. Although such a metric may work well in many scenarios, it does not account for several distortion factors that are common in many applications. Some of these factors are as follows:
Behavioral attribute scaling and translation: In many applications, the different time series may not be drawn on the same scales. For example, the time series representing various stocks prices may show similar patterns of movements, but the absolute values may be very different both in terms of the mean and the standard deviation. For example, the share prices of several different hypothetical stock tickers are illustrated in Fig. 3.7. All three series show similar patterns but with different scaling and some random variations. Clearly, they show similar patterns but cannot be meaningfully compared if the absolute values of the series are used.
Temporal (contextual) attribute translation: In some applications, such as real-time analysis of financial markets, the different time series may represent the same periods
78 CHAPTER 3. SIMILARITY AND DISTANCES
Figure 3.7: Impact of scaling, translation, and noise
in time. In other applications, such as the analysis of the time series obtained from medical measurements, the absolute time stamp of when the reading was taken is not important. In such cases, the temporal attribute value needs to be shifted in at least one of the time series to allow more effective matching.
Dostları ilə paylaş: |