Data Mining: The Textbook

Yüklə 17,13 Mb.

səhifə	170/423
tarix	07.01.2024
ölçüsü	17,13 Mb.
	#211690

1 ... 166 167 168 169 170 171 172 173 ... 423

1-Data Mining tarjima

Model combination: The combination process refers to the approach used to integrate the outlier score from diﬀerent components. For example, in random subspace sam-pling, the cumulative outlier score over diﬀerent ensemble components is reported. Other combination functions include the use of the maximum score, or the highest rank of the score among all ensembles. It is assumed that higher ranks imply greater propensity to be an outlier. Therefore, the highest rank is similar to the maximum outlier score, except that the rank is used instead of the raw score. The highest-rank combination function is also used by random subspace sampling.

Outlier ensemble methods can be categorized into diﬀerent types, depending on the depen-dencies among diﬀerent components and the process of selecting a specific model. The following section will study some of the common methods.

9.4.1 Categorization by Component Independence

This categorization examines whether or not the components are developed independently.

In sequential ensembles, a given algorithm or set of algorithms is applied sequentially, so that future applications of the algorithm are influenced by previous applications. This influence may be realized in terms of either modifications of the base data for analysis, or in terms of the specific choices of the algorithms. The final result is either a weighted combination of, or the final result of the last application of the outlier algorithm component. The typical scenario for sequential ensembles is that of model refinement, where successive iterations continue to improve a particular base model.

In independent ensembles, diﬀerent algorithms, or diﬀerent instantiations of the same algorithm, are independently applied to either the complete data or portions of the data. In other words, the various ensemble components are independent of the results of each other’s executions.

In this section, both types of ensembles will be studied in detail.

9.4.1.1 Sequential Ensembles

In sequential ensembles, one or more outlier detection algorithms are applied sequentially to either all or portions of the data. The idea is that the result of a particular algorithmic execution may provide insights that may help refine future executions. Thus, depending upon the approach, either the data set or the algorithm may be changed in sequential executions. For example, consider the case where a clustering model is created for outlier detection. Because outliers interfere with the robust cluster generation, one possibility would be to apply the method to a successively refined data set after removing the obvious outliers through the insights gained in earlier iterations of the ensemble. Typically, the quality of the

276 CHAPTER 9. OUTLIER ANALYSIS: ADVANCED CONCEPTS

Algorithm SequentialEnsemble(Data Set: D

Base Algorithms: A₁ . . . A_r)
begin

j = 1;

repeat
Pick an algorithm Q_j ∈ {A₁ . . . A_r} based

on results from past executions;
Create a new data set f_j (D) from D based
on results from past executions;
Apply Q_j to f_j (D);

= j + 1;

until(termination);

return outliers based on combinations of results from previous executions;

end

Figure 9.2: Sequential ensemble framework

outliers in later iterations will be better. This also facilitates a more robust outlier detection model. Thus, the sequential nature of the approach is used for successive refinement. If desired, this approach can either be applied for a fixed number of times or used to converge to a more robust solution. The broad framework of a sequential ensemble approach is provided in Fig. 9.2.

It is instructive to examine the execution of the algorithm in Fig. 9.2 in some detail. In each iteration, a successively refined algorithm may be used on a refined data set, based on the results from previous executions. The function f_j (·) is used to create a refinement of the data, which could correspond to data subset selection, attribute-subset selection, or a generic data transformation method. The generality of the aforementioned description ensures that many natural variations of the method can be explored with the use of this ensemble. For example, while the algorithm of Fig. 9.2 assumes that many diﬀerent algorithms A₁ . . . A_r are available, it is possible to select only one of them, and use it on successive modifications of the data. Sequential ensembles are often hard to use eﬀectively in outlier analysis because of the lack of available groundtruth in interpreting the intermediate results. In many cases, the distribution of the outlier scores is used as a proxy for these insights.

9.4.1.2 Independent Ensembles

In independent ensembles, diﬀerent instantiations of the algorithm or diﬀerent portions of the data are executed independently for outlier analysis. Alternatively, the same algorithm may be applied, but with either a diﬀerent initialization, parameter set, or even random seed in the case of a randomized algorithm. The LOF method, the high-dimensional evolu-tionary exploration method, and the random subspace sampling method discussed earlier are all examples of independent ensembles. Independent ensembles are more common in outlier analysis than sequential ensembles. In this case, the combination function requires careful normalization, especially if the diﬀerent components of the ensemble are heteroge-neous. A general-purpose description of independent ensemble algorithms is provided in the pseudocode description of Fig. 9.3.

Yüklə 17,13 Mb.

Dostları ilə paylaş:

1 ... 166 167 168 169 170 171 172 173 ... 423