Grid-based rare subspace exploration: In this case, rare subspaces of the data are explored after discretizing the data into a grid-like structure.
Random subspace sampling: In this case, subspaces of the data are sampled to discover the most relevant outliers.
The following subsections will discuss each of these methods in detail.
9.3.1 Grid-Based Rare Subspace Exploration
Projected outliers are determined by finding localized regions of the data in low-dimensional space that have abnormally low density. Because this is a density-based approach, a grid-based technique is the method of choice. To do so, nonempty grid-based subspaces need to be identified, whose density is very low. This is complementary to the definitions that were used for subspace clustering methods such as CLIQUE. In those cases, frequent subspaces were reported. It should be pointed out that the determination of frequent subspaces is a much simpler problem than the determination of rare subspaces, simply because there are many more rare subspaces than there are dense ones in a typical data set. This results in a combinatorial explosion of the number of possibilities, and level-wise algorithms, such as those used in CLIQUE, are no longer practical avenues for finding rare subspaces. The first step in all these models is to determine a proper statistical definition of rare lower dimensional projections.
9.3. HIGH-DIMENSIONAL OUTLIER DETECTION
|
271
|
9.3.1.1 Modeling Abnormal Lower Dimensional Projections
An abnormal lower dimensional projection is one in which the density of the data is excep-tionally lower than the average. The concept of Z -number, introduced in the previous chap-ter, comes in very handy in this respect. The first step is to discretize the data. Each attribute of the data is divided into p ranges. These ranges are constructed on an equidepth basis. In other words, each range contains a fraction f = 1/p of the records.
When k different intervals from different dimensions are selected, it creates a grid cell of dimensionality k. The expected fraction of the data records in this grid cell is equal to f k, if the attributes are statistically independent. In practice, the data are not statistically independent, and, therefore, the distribution of the points in the cube differs significantly from the expected value. Deviations that correspond to rare regions in the data are of particular interest.
Let D be a database with n records, and dimensionality d. Under the independence assumption mentioned earlier, the presence or absence of any point in a k-dimensional cube is a Bernoulli random variable with probability f k . The expected number and standard deviation of the points in a k-dimensional cube are given by n · f k and n · f k · (1 − f k). When the value of n is large, the number of data points in a cube is a random variable that is approximated by a normal distribution, with the aforementioned mean and standard deviation.
Let R represent a cube in k-dimensional space, and nR represent the number of data points inside this cube. Then, the sparsity coefficient S(R) of the cube R can be computed
Dostları ilə paylaş: |