Model combination: The combination process refers to the approach used to integrate the outlier score from different components. For example, in random subspace sam-pling, the cumulative outlier score over different ensemble components is reported. Other combination functions include the use of the maximum score, or the highest rank of the score among all ensembles. It is assumed that higher ranks imply greater propensity to be an outlier. Therefore, the highest rank is similar to the maximum outlier score, except that the rank is used instead of the raw score. The highest-rank combination function is also used by random subspace sampling.
Outlier ensemble methods can be categorized into different types, depending on the depen-dencies among different components and the process of selecting a specific model. The following section will study some of the common methods.
9.4.1 Categorization by Component Independence
This categorization examines whether or not the components are developed independently.
In sequential ensembles, a given algorithm or set of algorithms is applied sequentially, so that future applications of the algorithm are influenced by previous applications. This influence may be realized in terms of either modifications of the base data for analysis, or in terms of the specific choices of the algorithms. The final result is either a weighted combination of, or the final result of the last application of the outlier algorithm component. The typical scenario for sequential ensembles is that of model refinement, where successive iterations continue to improve a particular base model.
In independent ensembles, different algorithms, or different instantiations of the same algorithm, are independently applied to either the complete data or portions of the data. In other words, the various ensemble components are independent of the results of each other’s executions.
In this section, both types of ensembles will be studied in detail.
9.4.1.1 Sequential Ensembles
In sequential ensembles, one or more outlier detection algorithms are applied sequentially to either all or portions of the data. The idea is that the result of a particular algorithmic execution may provide insights that may help refine future executions. Thus, depending upon the approach, either the data set or the algorithm may be changed in sequential executions. For example, consider the case where a clustering model is created for outlier detection. Because outliers interfere with the robust cluster generation, one possibility would be to apply the method to a successively refined data set after removing the obvious outliers through the insights gained in earlier iterations of the ensemble. Typically, the quality of the
276 CHAPTER 9. OUTLIER ANALYSIS: ADVANCED CONCEPTS
Algorithm SequentialEnsemble(Data Set: D
Base Algorithms: A1 . . . Ar)
begin
j = 1;
repeat
Pick an algorithm Qj ∈ {A1 . . . Ar} based
on results from past executions;
Create a new data set fj (D) from D based
on results from past executions;
Apply Qj to fj (D);
until(termination);
return outliers based on combinations of results from previous executions;
end
Figure 9.2: Sequential ensemble framework
outliers in later iterations will be better. This also facilitates a more robust outlier detection model. Thus, the sequential nature of the approach is used for successive refinement. If desired, this approach can either be applied for a fixed number of times or used to converge to a more robust solution. The broad framework of a sequential ensemble approach is provided in Fig. 9.2.
It is instructive to examine the execution of the algorithm in Fig. 9.2 in some detail. In each iteration, a successively refined algorithm may be used on a refined data set, based on the results from previous executions. The function fj (·) is used to create a refinement of the data, which could correspond to data subset selection, attribute-subset selection, or a generic data transformation method. The generality of the aforementioned description ensures that many natural variations of the method can be explored with the use of this ensemble. For example, while the algorithm of Fig. 9.2 assumes that many different algorithms A1 . . . Ar are available, it is possible to select only one of them, and use it on successive modifications of the data. Sequential ensembles are often hard to use effectively in outlier analysis because of the lack of available groundtruth in interpreting the intermediate results. In many cases, the distribution of the outlier scores is used as a proxy for these insights.
9.4.1.2 Independent Ensembles
In independent ensembles, different instantiations of the algorithm or different portions of the data are executed independently for outlier analysis. Alternatively, the same algorithm may be applied, but with either a different initialization, parameter set, or even random seed in the case of a randomized algorithm. The LOF method, the high-dimensional evolu-tionary exploration method, and the random subspace sampling method discussed earlier are all examples of independent ensembles. Independent ensembles are more common in outlier analysis than sequential ensembles. In this case, the combination function requires careful normalization, especially if the different components of the ensemble are heteroge-neous. A general-purpose description of independent ensemble algorithms is provided in the pseudocode description of Fig. 9.3.
Dostları ilə paylaş: |