Data Mining: The Textbook

Yüklə 17,13 Mb.

səhifə	145/423
tarix	07.01.2024
ölçüsü	17,13 Mb.
	#211690

1 ... 141 142 143 144 145 146 147 148 ... 423

1-Data Mining tarjima

Data cleaning: Outliers often represent noise in the data. This noise may arise as a result of errors in the data collection process. Outlier detection methods are, therefore, useful for removing such noise.

Credit card fraud: Unusual patterns of credit card activity may often be a result of fraud. Because such patterns are much rarer than the normal patterns, they can be detected as outliers.

Network intrusion detection: The traﬃc on many networks can be considered as a stream of multidimensional records. Outliers are often defined as unusual records in this stream or unusual changes in the underlying trends.

C. C. Aggarwal, Data Mining: The Textbook, DOI 10.1007/978-3-319-14142-8 8

237

c Springer International Publishing Switzerland 2015

238 CHAPTER 8. OUTLIER ANALYSIS

Most outlier detection methods create a model of normal patterns. Examples of such models include clustering, distance-based quantification, or dimensionality reduction. Outliers are defined as data points that do not naturally fit within this normal model. The “outlierness” of a data point is quantified by a numeric value, known as the outlier score. Consequently, most outlier detection algorithms produce an output that can be one of two types:

Real-valued outlier score: Such a score quantifies the tendency for a data point to be considered an outlier. Higher values of the score make it more (or, in some cases, less) likely that a given data point is an outlier. Some algorithms may even output a probability value quantifying the likelihood that a given data point is an outlier.

Binary label: A binary value is output, indicating whether or not a data point is an outlier. This type of output contains less information than the first one because a threshold can be imposed on the outlier scores to convert them into binary labels. However, the reverse is not possible. Therefore, outlier scores are more general than binary labels. Nevertheless, a binary score is required as the end result in most appli-cations because it provides a crisp decision.

The generation of an outlier score requires the construction of a model of the normal pat-terns. In some cases, a model may be designed to produce specialized types of outliers based on a very restrictive model of normal patterns. Examples of such outliers are extreme values, and they are useful only for certain specific types of applications. In the following, some of the key models for outlier analysis are summarized. These will be discussed in more detail in later sections.

Extreme values: A data point is an extreme value, if it lies at one of the two ends of a probability distribution. Extreme values can also be defined equivalently for multidi-mensional data by using a multivariate probability distribution, instead of a univariate one. These are very specialized types of outliers but are useful in general outlier anal-ysis because of their utility in converting scores to labels.

Clustering models: Clustering is considered a complementary problem to outlier anal-ysis. The former problem looks for data points that occur together in a group, whereas the latter problem looks for data points that are isolated from groups. In fact, many clustering models determine outliers as a side-product of the algorithm. It is also possible to optimize clustering models to specifically detect outliers.

Yüklə 17,13 Mb.

Dostları ilə paylaş:

1 ... 141 142 143 144 145 146 147 148 ... 423