S(xi, yi) =
|
1/pk(xi)2
|
if xi = yi
|
(3.6)
|
|
0
|
otherwise
|
|
|
|
|
A related measure is the Goodall measure. As in the case of the inverse occurrence frequency, a higher similarity value is assigned to a match when the value is infrequent. In a simple variant of this measure [104], the similarity on the kth attribute is defined as 1 − pk(xi)2, when xi = yi, and 0 otherwise.
S(xi, yi) =
|
1 − pk(xi)2
|
if xi = yi
|
(3.7)
|
|
0
|
otherwise
|
|
The bibliographic notes contain pointers to various similarity measures for categorical data.
3.2.3 Mixed Quantitative and Categorical Data
It is fairly straightforward to generalize the approach to mixed data by adding the weights of the numeric and quantitative components. The main challenge is in deciding how to assign the weights of the quantitative and categorical components. For example, consider two records X = (Xn, Xc ) and Y = (Y n, Yc) where Xn, Yn are the subsets of numerical attributes and Xc, Y c are the subsets of categorical attributes. Then, the overall similarity between X and Y is defined as follows:
Sim(
|
|
|
|
) = λ · N umSim(
|
|
,
|
|
) + (1 − λ) · CatSim(
|
|
,
|
|
).
|
(3.8)
|
|
X,
|
Y
|
Xn
|
Yn
|
Xc
|
Yc
|
|
The parameter λ regulates the relative importance of the categorical and numerical attributes. The choice of λ is a difficult one. In the absence of domain knowledge about the relative importance of attributes, a natural choice is to use a value of λ that is equal to the fraction of numerical attributes in the data. Furthermore, the proximity in numerical data is often computed with the use of distance functions rather than similarity functions. However, distance values can be converted to similarity values as well. For a distance value of dist, a common approach is to use a kernel mapping that yields [104] the similarity value of 1/(1 + dist).
Further normalization is required to meaningfully compare the similarity value com-ponents on the numerical and categorical attributes that may be on completely different scales. One way of achieving this goal is to determine the standard deviations in the similar-ity values over the two domains with the use of sample pairs of records. Each component of the similarity value (numerical or categorical) is divided by its standard deviation. There-fore, if σc and σn are the standard deviations of the similarity values in the categorical and
numerical components, then Eq. 3.8 needs to be modified as follows:
|
|
Sim(X, Y ) = λ · N umSim(Xn, Yn)/σn + (1 − λ) · CatSim(Xc, Yc)/σc.
|
(3.9)
|
By performing this normalization, the value of λ becomes more meaningful, as a true relative weight between the two components. By default, this weight can be set to be proportional to the number of attributes in each component unless specific domain knowledge is available about the relative importance of attributes.
3.3 Text Similarity Measures
Strictly speaking, text can be considered quantitative multidimensional data when it is treated as a bag of words. The frequency of each word can be treated as a quantitative attribute, and the base lexicon can be treated as the full set of attributes. However, the
76 CHAPTER 3. SIMILARITY AND DISTANCES
structure of text is sparse in which most attributes take on 0 values. Furthermore, all word frequencies are nonnegative. This special structure of text has important implications for similarity computation and other mining algorithms. Measures such as the Lp -norm do not adjust well to the varying length of the different documents in the collection. For example, the L2-distance between two long documents will almost always be larger than that between two short documents even if the two long documents have many words in common, and the short documents are completely disjoint. How can one normalize for such irregularities? One way of doing so is by using the cosine measure. The cosine measure computes the angle between the two documents, which is insensitive to the absolute length of the document. Let X = (x1 . . . xd) and Y = (y1 . . . yd) be two documents on a lexicon of size d. Then, the cosine measure cos(X, Y ) between X and Y can be defined as follows:
|
|
|
|
|
|
|
d
|
|
|
|
|
cos(
|
|
|
|
) =
|
|
|
i=1 xi · yi
|
|
(3.10)
|
|
|
Dostları ilə paylaş: |