Data Mining: The Textbook



Yüklə 17,13 Mb.
səhifə60/423
tarix07.01.2024
ölçüsü17,13 Mb.
#211690
1   ...   56   57   58   59   60   61   62   63   ...   423
1-Data Mining tarjima

f (Oi, Oj , Θ) =

0

if (Oi, Oj ) S .

(3.23)




1

if (Oi, Oj ) ∈ D




This can be expressed as a least squares optimization problem over Θ, with the following error E:





E =

(f (Oi, Oj , Θ) 0)2 +

(f (Oi, Oj , Θ) 1)2.

(3.24)




(Oi,Oj ) S

(Oi,Oj ) D




This objective function can be optimized with respect to Θ with the use of any off-the-shelf optimization solver. If desired, the additional constraint Θ 0 can be added where appropri-ate. For example, when Θ represents the feature weights (a1 . . . ad) in the Minkowski metric, it is natural to make the assumption of nonnegativity of the coefficients. Such a constrained optimization problem can be easily solved using many nonlinear optimization methods. The use of a closed form such as f (Oi, Oj , Θ) ensures that the function f (Oi, Oj , Θ) can be computed efficiently after the one-time cost of computing the parameters Θ.


Where possible, user feedback should be used to improve the quality of the distance function. The problem of learning distance functions can be modeled more generally as that of classification. The classification problem will be studied in detail in Chaps. 10 and 11. Supervised distance function design with the use of Fisher’s method is also discussed in detail in the section on instance-based learning in Chap. 10.


3.7 Summary


The problem of distance function design is a crucial one in the context of data mining applications. This is because many data mining algorithms use the distance function as a key subroutine, and the design of the function directly impacts the quality of the results. Distance functions are highly sensitive to the type of the data, the dimensionality of the data, and the global and local nature of the data distribution.


The Lp-norm is the most common distance function used for multidimensional data. This distance function does not seem to work well with increasing dimensionality. Higher values of p work particularly poorly with increasing dimensionality. In some cases, it has been shown that fractional metrics are particularly effective when p is chosen in the range (0, 1). Numerous proximity-based measures have also been shown to work effectively with increasing dimensionality.


The data distribution also has an impact on the distance function design. The sim-plest possible distance function that uses global distributions is the Mahalanobis metric. This metric is a generalization of the Euclidean measure, and stretches the distance values along the principal components according to their variance. A more sophisticated approach, referred to as ISOMAP, uses nonlinear embeddings to account for the impact of nonlinear data distributions. Local normalization can often provide more effective measures when the distribution of the data is heterogeneous.


Other data types such as categorical data, text, temporal, and graph data present further challenges. The determination of time-series and discrete-sequence similarity measures is closely related because the latter can be considered the categorical version of the former. The main problem is that two similar time series may exhibit different scaling of their behavioral and contextual attributes. This needs to be accounted for with the use of different normalization functions for the behavioral attribute, and the use of warping functions for the




Yüklə 17,13 Mb.

Dostları ilə paylaş:
1   ...   56   57   58   59   60   61   62   63   ...   423




Verilənlər bazası müəlliflik hüququ ilə müdafiə olunur ©azkurs.org 2024
rəhbərliyinə müraciət

gir | qeydiyyatdan keç
    Ana səhifə


yükləyin