Data Mining: The Textbook



Yüklə 17,13 Mb.
səhifə166/423
tarix07.01.2024
ölçüsü17,13 Mb.
#211690
1   ...   162   163   164   165   166   167   168   169   ...   423
1-Data Mining tarjima

Grid-based rare subspace exploration: In this case, rare subspaces of the data are explored after discretizing the data into a grid-like structure.




  1. Random subspace sampling: In this case, subspaces of the data are sampled to discover the most relevant outliers.

The following subsections will discuss each of these methods in detail.


9.3.1 Grid-Based Rare Subspace Exploration


Projected outliers are determined by finding localized regions of the data in low-dimensional space that have abnormally low density. Because this is a density-based approach, a grid-based technique is the method of choice. To do so, nonempty grid-based subspaces need to be identified, whose density is very low. This is complementary to the definitions that were used for subspace clustering methods such as CLIQUE. In those cases, frequent subspaces were reported. It should be pointed out that the determination of frequent subspaces is a much simpler problem than the determination of rare subspaces, simply because there are many more rare subspaces than there are dense ones in a typical data set. This results in a combinatorial explosion of the number of possibilities, and level-wise algorithms, such as those used in CLIQUE, are no longer practical avenues for finding rare subspaces. The first step in all these models is to determine a proper statistical definition of rare lower dimensional projections.



9.3. HIGH-DIMENSIONAL OUTLIER DETECTION

271

9.3.1.1 Modeling Abnormal Lower Dimensional Projections


An abnormal lower dimensional projection is one in which the density of the data is excep-tionally lower than the average. The concept of Z -number, introduced in the previous chap-ter, comes in very handy in this respect. The first step is to discretize the data. Each attribute of the data is divided into p ranges. These ranges are constructed on an equidepth basis. In other words, each range contains a fraction f = 1/p of the records.


When k different intervals from different dimensions are selected, it creates a grid cell of dimensionality k. The expected fraction of the data records in this grid cell is equal to f k, if the attributes are statistically independent. In practice, the data are not statistically independent, and, therefore, the distribution of the points in the cube differs significantly from the expected value. Deviations that correspond to rare regions in the data are of particular interest.

Let D be a database with n records, and dimensionality d. Under the independence assumption mentioned earlier, the presence or absence of any point in a k-dimensional cube is a Bernoulli random variable with probability f k . The expected number and standard deviation of the points in a k-dimensional cube are given by n · f k and n · f k · (1 − f k). When the value of n is large, the number of data points in a cube is a random variable that is approximated by a normal distribution, with the aforementioned mean and standard deviation.


Let R represent a cube in k-dimensional space, and nR represent the number of data points inside this cube. Then, the sparsity coefficient S(R) of the cube R can be computed





as follows:











Yüklə 17,13 Mb.

Dostları ilə paylaş:
1   ...   162   163   164   165   166   167   168   169   ...   423




Verilənlər bazası müəlliflik hüququ ilə müdafiə olunur ©azkurs.org 2024
rəhbərliyinə müraciət

gir | qeydiyyatdan keç
    Ana səhifə


yükləyin