Data Mining: The Textbook

Yüklə 17,13 Mb.

səhifə	51/423
tarix	07.01.2024
ölçüsü	17,13 Mb.
	#211690

1 ... 47 48 49 50 51 52 53 54 ... 423

1-Data Mining tarjima

_i₌₁x_i² ·

_i₌₁y_i²

The aforementioned measure simply uses the raw frequencies between attributes. However, as in other data types, it is possible to use global statistical measures to improve the similarity computation. For example, if two documents match on an uncommon word, it is more indicative of similarity than the case where two documents match on a word that occurs very commonly. The inverse document frequency id_i, which is a decreasing function of the number of documents n_i in which the ith word occurs, is commonly used for normalization:

id_i = log(n/n_i).

(3.11)

Here, the number of documents in the collection is denoted by n. Another common adjust-ment is to ensure that the excessive presence of single word does not throw oﬀ the similarity measure. A damping function f (·), such as the square root or the logarithm, is optionally applied to the frequencies before similarity computation.

f (x_i) = ^√x_i

f (x_i) = log(x_i)

In many cases, the damping function is not used, which is equivalent to setting f (x_i) to x_i.

Therefore, the normalized frequency h(x_i) for the ith word may be defined as follows:

h(x_i) = f (x_i) · id_i.

(3.12)

Then, the cosine measure is defined as in Eq. 3.10, except that the normalized frequencies of the words are used:

cos(

) =

_i₌₁^h(^xi) ^· ^h(^yi)

(3.13)

_i₌₁^h(^xi)² ^·

_i₌₁^h(^yi)²

Another measure that is less commonly used for text is the Jaccard coeﬃcient J(X, Y ):

						d
J(			) =			_i₌₁^h(^xi) ^· ^h(^yi)			(3.14)
J(	X,	Y	) =	d		_i₌₁^h(^xi) ^· ^h(^yi)		.	(3.14)
				d	+	d	d
				_i₌₁^h(^xi)²	+	_i₌₁^h(^yi)² ⁻	_i₌₁^h(^xi) ^· ^h(^yi)

The Jaccard coeﬃcient is rarely used for the text domain, but it is used commonly for sparse binary data sets.

3.4. TEMPORAL SIMILARITY MEASURES

3.3.1 Binary and Set Data

Binary multidimensional data are a representation of set-based data, where a value of 1 indicates the presence of an element in a set. Binary data occur commonly in market-basket domains in which transactions contain information corresponding to whether or not an item is present in a transaction. It can be considered a special case of text data in which word frequencies are either 0 or 1. If S_X and S_Y are two sets with binary representations

and Y , then it can be shown that applying Eq. 3.14 to the raw binary representation of the two sets is equivalent to:

· y_i

|S_X ∩S_Y|

) =

_i₌₁^xi

(3.15)

|S_X ∪S_Y|

_i₌₁x_i²

_i₌₁y_i² −

_i₌₁^xi ^· ^yi

This is a particularly intuitive measure because it carefully accounts for the number of common and disjoint elements in the two sets.

3.4 Temporal Similarity Measures

Temporal data contain a single contextual attribute representing time and one or more behavioral attributes that measure the properties varying along a particular time period. Temporal data may be represented as continuous time series, or as discrete sequences, depending on the application domain. The latter representation may be viewed as the discrete version of the former. It should be pointed out that discrete sequence data are not always temporal because the contextual attribute may represent placement. This is typically the case in biological sequence data. Discrete sequences are also sometimes referred to as strings . Many of the similarity measures used for time series and discrete sequences can be reused across either domain, though some of the measures are more suited to one of the domains. Therefore, this section will address both data types, and each similarity measure will be discussed in a subsection on either continuous series or discrete series, based on its most common use. For some measures, the usage is common across both data types.

3.4.1 Time-Series Similarity Measures

The design of time-series similarity measures is highly application specific. For example, the simplest possible similarity measure between two time series of equal length is the Euclidean metric. Although such a metric may work well in many scenarios, it does not account for several distortion factors that are common in many applications. Some of these factors are as follows:

Behavioral attribute scaling and translation: In many applications, the diﬀerent time series may not be drawn on the same scales. For example, the time series representing various stocks prices may show similar patterns of movements, but the absolute values may be very diﬀerent both in terms of the mean and the standard deviation. For example, the share prices of several diﬀerent hypothetical stock tickers are illustrated in Fig. 3.7. All three series show similar patterns but with diﬀerent scaling and some random variations. Clearly, they show similar patterns but cannot be meaningfully compared if the absolute values of the series are used.

Temporal (contextual) attribute translation: In some applications, such as real-time analysis of financial markets, the diﬀerent time series may represent the same periods

78 CHAPTER 3. SIMILARITY AND DISTANCES

Figure 3.7: Impact of scaling, translation, and noise

in time. In other applications, such as the analysis of the time series obtained from medical measurements, the absolute time stamp of when the reading was taken is not important. In such cases, the temporal attribute value needs to be shifted in at least one of the time series to allow more eﬀective matching.

Yüklə 17,13 Mb.

Dostları ilə paylaş:

1 ... 47 48 49 50 51 52 53 54 ... 423