Outlier detection in categorical data: Because outlier models use notions such as near-est neighbor computation and clustering, these models need to be adjusted to the data type at hand. This chapter will address the changes required to handle categorical data types.
C. C. Aggarwal, Data Mining: The Textbook, DOI 10.1007/978-3-319-14142-8 9
|
265
|
c Springer International Publishing Switzerland 2015
266 CHAPTER 9. OUTLIER ANALYSIS: ADVANCED CONCEPTS
High-dimensional data: This is a very challenging scenario for outlier detection because of the “curse-of-dimensionality.” Many of the attributes are irrelevant and contribute to the errors in model construction. A common approach to address these issues is that of subspace outlier detection.
Outlier ensembles: In many cases, the robustness of an outlier detection algorithm can be improved with ensemble analysis. This chapter will study the fundamental principles of ensemble analysis for outlier detection.
Outlier analysis has numerous applications in a very wide variety of domains such as data cleaning, fraud detection, financial markets, intrusion detection, and law enforcement. This chapter will also study some of the more common applications of outlier analysis.
This chapter is organized as follows: Section 9.2 discusses outlier detection models for categorical data. The difficult case of high-dimensional data is discussed in Sect. 9.3. Outlier ensembles are studied in Sect. 9.4. A variety of applications of outlier detection are discussed in Sect. 9.5. Section 9.6 provides the summary.
9.2 Outlier Detection with Categorical Data
As in the case of other problems in data mining, the type of the underlying data has a significant impact on the specifics of the algorithm used for solving it. Outlier analysis is no exception. However, in the case of outlier detection, the changes required are relatively minor because, unlike clustering, many of the outlier detection algorithms (such as distance-based algorithms) use very simple definitions of outliers. These definitions can often be modified to work with categorical data with small modifications. In this section, some of the models discussed in the previous chapter will be revisited for categorical data.
9.2.1 Probabilistic Models
Probabilistic models can be modified easily to work with categorical data. A probabilistic model represents the data as a mixture of cluster components. Therefore, each component of the mixture needs to reflect a set of discrete attributes rather than numerical attributes. In other words, a generative mixture model of categorical data needs to be designed. Data points that do not fit this mixture model are reported as outliers.
The k components of the mixture model are denoted by G1 . . . Gk. The generative process uses the following two steps to generate each point in the d-dimensional data set D:
Select a mixture component with prior probability αi, where i ∈ {1 . . . k}.
If the rth component of the mixture was selected in the first step, then generate a data point from Gr.
The values of αi denote the prior probabilities. An example of a model for the mixture component is one in which the jth value of the ith attribute is generated by cluster m with probability pijm. The set of all model parameters is collectively denoted by the notation Θ.
Consider a data point X containing the attribute value indices j1 . . . jd where the rth attribute takes on the value jr. Then, the value of the generative probability gm,Θ(X) of a data point from cluster m is given by the following expression:
|
|
d
|
|
|
gm,Θ(
|
|
) =prjr m.
|
(9.1)
|
|
X
|
|
|
|
r=1
|
|
|
Dostları ilə paylaş: |