Data Mining: The Textbook



Yüklə 17,13 Mb.
səhifə165/423
tarix07.01.2024
ölçüsü17,13 Mb.
#211690
1   ...   161   162   163   164   165   166   167   168   ...   423
1-Data Mining tarjima

X∈F P S(D,sm),X⊆Ti s(X, D)










F P OF (Ti) =

.

(9.4)




|F P S(D, sm)|
















Intuitively, a transaction containing a large number of frequent patterns with high support will have a high value of F P OF (Ti ). Such a transaction is unlikely to be an outlier because it reflects the major patterns in the data. Therefore, lower scores indicate greater propensity to be an outlier.


Such an approach is analogous to nonmembership of data points in clusters to define outliers rather than determining the deviation or sparsity level of the transactions in a more direct way. The problem with this approach is that it may not be able to distinguish between truly isolated data points and ambient noise. This is because neither of these kinds of data points will be likely to contain many frequent patterns. As a result, such an approach may sometimes not be able to effectively determine the strongest anomalies in the data.


9.3 High-Dimensional Outlier Detection


High-dimensional outlier detection can be particularly challenging because of the varying importance of the different attributes with data locality. The idea is that the causality of an anomaly can be typically perceived in only a small subset of the dimensions. The remaining dimensions are irrelevant and only add noise to the anomaly-detection process. Furthermore, different subsets of dimensions may be relevant to different anomalies. As a result, full-dimensional analysis often does not properly expose the outliers in high-dimensional data.


This concept is best understood with a motivating example. In Fig. 9.1, four different 2-dimensional views of a hypothetical data set have been illustrated. Each of these views



9.3. HIGH-DIMENSIONAL OUTLIER DETECTION

269










10




X <− POINT A

























9




























8


































Y

7



















X <− POINT B







FEATURE

6

































































































5





































4





































3





































2





































1





































0





































0

1

2

3

4

5

6

7

8

9




FEATURE X











(a) View 1










Point A is outlier







10










9










8

X <− POINT B




Y










FEATURE

7



















6










5










  1. X <− POINT A





2


2

3

4

5

6

7

8

9

10

11

12
















FEATURE X
















  1. View 3 No outliers







11














































10














































9











































Y

8

























X <− POINT A














































FEATURE

7



























































































6














































5














































4














































3














































2














































1










X <− POINT B






































































0














































0

1

2

3

4

5

6

7

8

9

10

11




FEATURE X





  1. View 2 No outliers







9





























































8































7
















X <− POINT A




Y

6




























FEATURE




























4































5































3































2































1













X <− POINT B











































0































0

2

4

6

8

10

12

14




FEATURE X





  1. View 4

Point B is outlier





Figure 9.1: Impact of irrelevant attributes on outlier analysis


corresponds to a disjoint set of dimensions. As a result, these views look very different from one another. Data point A is exposed as an outlier in the first view of the data set, whereas point B is exposed as an outlier in the fourth view of the data set. However, data points A and B are not exposed as outliers in the second and third views of the data set. These views are therefore not very useful from the perspective of measuring the outlierness of either A or B. Furthermore, from the perspective of any specific data point (e.g., point A), three of the four views are irrelevant. Therefore, the outliers are lost in the random distributions within these views when the distance measurements are performed in full dimensionality. In many scenarios, the proportion of irrelevant views (features) may increase with dimensionality. In such cases, outliers are lost in low-dimensional subspaces of the data because of irrelevant attributes.

The physical interpretation of this situation is quite clear in many application-specific scenarios. For example, consider a credit card fraud application in which different features such as a customer’s purchase location, frequency of purchase, and size of purchase are being tracked over time. In the case of one particular customer, the anomaly may be tracked by examining the location and purchase frequency attributes. In another anomalous customer,



270 CHAPTER 9. OUTLIER ANALYSIS: ADVANCED CONCEPTS

the size and timing of the purchase may be relevant. Therefore, all the features are useful from a global perspective but only a small subset of features is useful from a local perspective.


The major problem here is that the dilution effects of the vast number of “normally noisy” dimensions will make the detection of outliers difficult. In other words, outliers are lost in low-dimensional subspaces when full-dimensional analysis is used because of the masking and dilution effects of the noise in full-dimensional computations. A similar problem is also discussed in Chap. 7 in the context of data clustering.


In the case of data clustering, this problem is solved by defining subspace-specific clus-ters, or projected clusters. This approach also provides a natural path for outlier analysis in high dimensions. In other words, an outlier can now be defined by associating it with one or more subspaces that are specific to that outlier. While there is a clear analogy between the problems of subspace clustering and subspace outlier detection, the difficulty levels of the two problems are not even remotely similar.


Clustering is, after all, about the determination of frequent groups of data points, whereas outliers are about determination of rare groups of data points. As a rule, statistical learning methods find it much easier to determine frequent characteristics than rare characteristics of a data set. This problem is further magnified in high dimensionality. The number of possible subspaces of a d-dimensional data point is 2d. Of these, only a small fraction will expose the outlier behavior of individual data points. In the case of clustering, dense subspaces can be easily determined by aggregate statistical analysis of the data points. This is not true of outlier detection, where the subspaces need to be explicitly explored in a way that is specific to the individual data points.


An effective outlier detection method would need to search the data points and dimen-sions in an integrated way to reveal the most relevant outliers. This is because different subsets of dimensions may be relevant to different outliers, as is evident from the example of Fig. 9.1. The integration of point and subspace exploration leads to a further expansion in the number of possibilities that need to be examined for outlier analysis. This chapter will explore two methods for subspace exploration, though many other methods are pointed out in the bibliographic notes. These methods are as follows:






  1. Yüklə 17,13 Mb.

    Dostları ilə paylaş:
1   ...   161   162   163   164   165   166   167   168   ...   423




Verilənlər bazası müəlliflik hüququ ilə müdafiə olunur ©azkurs.org 2024
rəhbərliyinə müraciət

gir | qeydiyyatdan keç
    Ana səhifə


yükləyin