Intrusion-detection systems: In many networked computer systems, different kinds of data are collected about the operating system calls, network traffic, or other activity in the system. These data may show unusual behavior because of malicious activity. The detection of such activity is referred to as intrusion detection.
Credit card fraud: Unauthorized use of credit cards may show different patterns, such as a buying spree from geographically obscure locations. Such patterns may show up as outliers in credit card transaction data.
18 CHAPTER 1. AN INTRODUCTION TO DATA MINING
Interesting sensor events: Sensors are often used to track various environmental and location parameters in many real applications. The sudden changes in the underly-ing patterns may represent events of interest. Event detection is one of the primary motivating applications in the field of sensor networks.
Medical diagnosis: In many medical applications, the data are collected from a variety of devices such as magnetic resonance imaging (MRI), positron emission tomography (PET) scans, or electrocardiogram (ECG) time series. Unusual patterns in such data typically reflect disease conditions.
Law enforcement: Outlier detection finds numerous applications in law enforcement, especially in cases where unusual patterns can only be discovered over time through multiple actions of an entity. The identification of fraud in financial transactions, trading activity, or insurance claims typically requires the determination of unusual patterns in the data generated by the actions of the criminal entity.
Earth science: A significant amount of spatiotemporal data about weather patterns, climate changes, or land-cover patterns is collected through a variety of mechanisms such as satellites or remote sensing. Anomalies in such data provide significant insights about hidden human or environmental trends that may have caused such anomalies.
The outlier detection problem is studied in detail in Chaps. 8 and 9.
1.4.4 Data Classification
Many data mining problems are directed toward a specialized goal that is sometimes rep-resented by the value of a particular feature in the data. This particular feature is referred to as the class label. Therefore, such problems are supervised, wherein the relationships of the remaining features in the data with respect to this special feature are learned. The data used to learn these relationships is referred to as the training data. The learned model may then be used to determine the estimated class labels for records, where the label is missing.
For example, in a target marketing application, each record may be tagged by a par-ticular label that represents the interest (or lack of it) of the customer toward a particular product. The labels associated with customers may have been derived from the previous buying behavior of the customer. In addition, a set of features corresponding the customer demographics may also be available. The goal is to predict whether or not a customer, whose buying behavior is unknown, will be interested in a particular product by relating the demo-graphic features to the class label. Therefore, a training model is constructed, which is then used to predict class labels. The classification problem is informally defined as follows:
Definition 1.4.5 (Data Classification) Given an n×d training data matrix D (database D), and a class label value in {1 . . . k} associated with each of the n rows in D (records in D), create a training model M, which can be used to predict the class label of a d-dimensional record Y ∈ D.
The record whose class label is unknown is referred to as the test record. It is interesting to examine the relationship between the clustering and the classification problems. In the case of the clustering problem, the data are partitioned into k groups on the basis of similarity. In the case of the classification problem, a (test) record is also categorized into one of k groups, except that this is achieved by learning a model from a training database D, rather than on the basis of similarity. In other words, the supervision from the training data redefines the
Dostları ilə paylaş: |