Data Mining: The Textbook



Yüklə 17,13 Mb.
səhifə168/423
tarix07.01.2024
ölçüsü17,13 Mb.
#211690
1   ...   164   165   166   167   168   169   170   171   ...   423
1-Data Mining tarjima

9.3. HIGH-DIMENSIONAL OUTLIER DETECTION

273




  1. Selection: This step achieves the hill climbing required by the method, though quite differently from traditional hill-climbing methods. The copies of a solution are repli-cated by ordering them by rank and biasing them in the population in the favor of higher ranked solutions. This is referred to as rank selection. This results in a bias in favor of more optimal solutions.




  1. Crossover: The crossover technique is key to the success of the algorithm because it implicitly defines the subspace exploration process. One solution is to use a uniform two-point crossover to create the recombinant children strings. It can be viewed as the process of combining the characteristics of two solutions to create two new recombinant solutions. Traditional hill-climbing methods only test an adjacent solution for a single string. The recombinant crossover approach examines a more complex neighborhood by combining the characteristics of two different strings, to yield two new neighborhood points.

The two-point crossover mechanism works by determining a point in the string at random, called the crossover point, and exchanging the segments to the right of this point. This is equivalent to creating two new subspaces, by sampling subspaces from both solutions and combining them. However, such a blind recombination process may create poor solutions too often. Therefore, an optimized crossover mechanism is defined. In this mechanism, it is guaranteed that both children solutions correspond to a feasible k-dimensional projection, and the children typically have high fitness values. This is achieved by examining a subset of the different possibilities for recombination and picking the best among them.





  1. Mutation: In this case, random positions in the string are flipped with a predefined mutation probability. Care must be taken to ensure that the dimensionality of the projection does not change after the flipping process. This is an exact analogue of traditional hill-climbing except that it is done to a population of solutions to ensure robustness. Thus, while genetic algorithms try to achieve the same goals as hill climb-ing, they do so in a different way to achieve better solutions.

At termination, the algorithm is followed by a postprocessing phase. In the postprocess-ing phase, all data points containing the abnormal projections are reported by the algorithm as the outliers. The approach also provides the relevant projections that provide the causal-ity (or intensional knowledge) for the outlier behavior of a data point. Thus, this approach also has a high degree of interpretability in terms of providing the reasoning for why a data point should be considered an outlier.


9.3.2 Random Subspace Sampling


The determination of the rare subspaces is a very difficult task, as is evident from the unusual genetic algorithm designed discussed in the previous section. A different way of addressing the same problem is to explore many possible subspaces and examine if at least one of them contains outliers. One such well known approach is feature bagging. The broad approach is to repeatedly apply the following two steps:





  1. Randomly, select between (d/2) and d features from the underlying data set in itera-tion t to create a data set Dt in the tth iteration.




  1. Apply the outlier detection algorithm Ot on the data set Dt to create score vectors St.

274 CHAPTER 9. OUTLIER ANALYSIS: ADVANCED CONCEPTS


Virtually any algorithm Ot can be used to determine the outlier score, as long as the scores across different instantiations are comparable. From this perspective, the LOF algorithm is the ideal algorithm to use because of its normalized scores. At the end of the process, the outlier scores from the different algorithms need to be combined. Two distinct methods are used to combine the different subspaces:






  1. Yüklə 17,13 Mb.

    Dostları ilə paylaş:
1   ...   164   165   166   167   168   169   170   171   ...   423




Verilənlər bazası müəlliflik hüququ ilə müdafiə olunur ©azkurs.org 2024
rəhbərliyinə müraciət

gir | qeydiyyatdan keç
    Ana səhifə


yükləyin