Given a set of training data points, each of which is associated with a class label, deter-mine the class label of one or more previously unseen test instances.
Most classification algorithms typically have two phases:
Training phase: In this phase, a training model is constructed from the training instances. Intuitively, this can be understood as a summary mathematical model of the labeled groups in the training data set.
Testing phase: In this phase, the training model is used to determine the class label (or group identifier) of one or more unseen test instances.
The classification problem is more powerful than clustering because, unlike clustering, it captures a user-defined notion of grouping from an example data set. Such an approach has almost direct applicability to a wide variety of problems, in which groups are defined naturally based on external application-specific criteria. Some examples are as follows:
Customer target marketing: In this case, the groups (or labels) correspond to the user interest in a particular product. For example, one group may correspond to customers interested in a product, and the other group may contain the remaining customers. In many cases, training examples of previous buying behavior are available. These can be used to provide examples of customers who may or may not be interested in a specific product. The feature variables may correspond to the demographic profiles of the customers. These training examples are used to learn whether or not a customer, with a known demographic profile, but unknown buying behavior, may be interested in a particular product.
Medical disease management: In recent years, the use of data mining methods in medical research has gained increasing traction. The features may be extracted from patient medical tests and treatments, and the class label may correspond to treatment outcomes. In these cases, it is desired to predict treatment outcomes with models constructed on the features.
Document categorization and filtering: Many applications, such as newswire services, require real-time classification of documents. These are used to organize the docu-ments under specific topics in Web portals. Previous examples of documents from each topic may be available. The features correspond to the words in the document. The class labels correspond to the various topics, such as politics, sports, current events, and so on.
Multimedia data analysis: It is often desired to perform classification of large volumes of multimedia data such as photos, videos, audio, or other more complex multimedia data. Previous examples of particular activities of users associated with example videos may be available. These may be used to determine whether a particular video describes a specific activity. Therefore, this problem can be modeled as a binary classification problem containing two groups corresponding to the occurrence or nonoccurrence of a specific activity.
The applications of classification are diverse because of the ability to learn by example.
It is assumed that the training data set is denoted by D with n data points and d features, or dimensions. In addition, each of the data points in D is associated with a label drawn from {1 . . . k}. In some models, the label is assumed to be binary (k = 2) for
10.2. FEATURE SELECTION FOR CLASSIFICATION
|
287
|
simplicity. In the latter case, a commonly used convention is to assume that the labels are drawn from {−1, +1}. However, it is sometimes notationally convenient to assume that the labels are drawn from {0, 1}. This chapter will use either of these conventions depending on the classifier. A training model is constructed from D, which is used to predict the label of unseen test instances. The output of a classification algorithm can be one of two types:
Dostları ilə paylaş: |