Data Mining: The Textbook

Yüklə 17,13 Mb.

səhifə	50/423
tarix	07.01.2024
ölçüsü	17,13 Mb.
	#211690

1 ... 46 47 48 49 50 51 52 53 ... 423

1-Data Mining tarjima

S(x_i, y_i) =	1/p_k(x_i)²	if x_i = y_i	(3.6)
			(3.6)
0	otherwise
0	otherwise

3.3. TEXT SIMILARITY MEASURES

A related measure is the Goodall measure. As in the case of the inverse occurrence frequency, a higher similarity value is assigned to a match when the value is infrequent. In a simple variant of this measure [104], the similarity on the kth attribute is defined as 1 − p_k(x_i)², when x_i = y_i, and 0 otherwise.

S(x_i, y_i) =	1 − p_k(x_i)²	if x_i = y_i	(3.7)
	0	otherwise

The bibliographic notes contain pointers to various similarity measures for categorical data.

3.2.3 Mixed Quantitative and Categorical Data

It is fairly straightforward to generalize the approach to mixed data by adding the weights of the numeric and quantitative components. The main challenge is in deciding how to assign the weights of the quantitative and categorical components. For example, consider two records X = (X_n, X_c ) and Y = (Y _n, Y_c) where X_n, Y_n are the subsets of numerical attributes and X_c, Y _c are the subsets of categorical attributes. Then, the overall similarity between X and Y is defined as follows:

Sim(

) = λ · N umSim(

) + (1 − λ) · CatSim(

(3.8)

X_n

Y_n

X_c

Y_c

The parameter λ regulates the relative importance of the categorical and numerical attributes. The choice of λ is a diﬃcult one. In the absence of domain knowledge about the relative importance of attributes, a natural choice is to use a value of λ that is equal to the fraction of numerical attributes in the data. Furthermore, the proximity in numerical data is often computed with the use of distance functions rather than similarity functions. However, distance values can be converted to similarity values as well. For a distance value of dist, a common approach is to use a kernel mapping that yields [104] the similarity value of 1/(1 + dist).

Further normalization is required to meaningfully compare the similarity value com-ponents on the numerical and categorical attributes that may be on completely diﬀerent scales. One way of achieving this goal is to determine the standard deviations in the similar-ity values over the two domains with the use of sample pairs of records. Each component of the similarity value (numerical or categorical) is divided by its standard deviation. There-fore, if σ_c and σ_n are the standard deviations of the similarity values in the categorical and

numerical components, then Eq. 3.8 needs to be modified as follows:
Sim(X, Y ) = λ · N umSim(X_n, Y_n)/σ_n + (1 − λ) · CatSim(X_c, Y_c)/σ_c.	(3.9)

By performing this normalization, the value of λ becomes more meaningful, as a true relative weight between the two components. By default, this weight can be set to be proportional to the number of attributes in each component unless specific domain knowledge is available about the relative importance of attributes.

3.3 Text Similarity Measures

Strictly speaking, text can be considered quantitative multidimensional data when it is treated as a bag of words. The frequency of each word can be treated as a quantitative attribute, and the base lexicon can be treated as the full set of attributes. However, the

76 CHAPTER 3. SIMILARITY AND DISTANCES

structure of text is sparse in which most attributes take on 0 values. Furthermore, all word frequencies are nonnegative. This special structure of text has important implications for similarity computation and other mining algorithms. Measures such as the L_p -norm do not adjust well to the varying length of the diﬀerent documents in the collection. For example, the L₂-distance between two long documents will almost always be larger than that between two short documents even if the two long documents have many words in common, and the short documents are completely disjoint. How can one normalize for such irregularities? One way of doing so is by using the cosine measure. The cosine measure computes the angle between the two documents, which is insensitive to the absolute length of the document. Let X = (x₁ . . . x_d) and Y = (y₁ . . . y_d) be two documents on a lexicon of size d. Then, the cosine measure cos(X, Y ) between X and Y can be defined as follows:

cos(

) =

_i₌₁^xi ^· ^yi

(3.10)

Yüklə 17,13 Mb.

Dostları ilə paylaş:

1 ... 46 47 48 49 50 51 52 53 ... 423