Data Mining: The Textbook

Yüklə 17,13 Mb.

səhifə	12/423
tarix	07.01.2024
ölçüsü	17,13 Mb.
	#211690

1 ... 8 9 10 11 12 13 14 15 ... 423

1-Data Mining tarjima

Feature extraction: An analyst may be confronted with vast volumes of raw documents, system logs, or commercial transactions with little guidance on how these raw data should be transformed into meaningful database features for processing. This phase is highly dependent on the analyst to be able to abstract out the features that are most relevant to a particular application. For example, in a credit-card fraud detection application, the amount of a charge, the repeat frequency, and the location are often good indicators of fraud. However, many other features may be poorer indicators of fraud. Therefore, extracting the right features is often a skill that requires an understanding of the specific application domain at hand.

Data cleaning: The extracted data may have erroneous or missing entries. Therefore, some records may need to be dropped, or missing entries may need to be estimated. Inconsistencies may need to be removed.

Feature selection and transformation: When the data are very high dimensional, many data mining algorithms do not work eﬀectively. Furthermore, many of the high-dimensional features are noisy and may add errors to the data mining process. There-fore, a variety of methods are used to either remove irrelevant features or transform the current set of features to a new data space that is more amenable for analysis. Another related aspect is data transformation, where a data set with a particular set of attributes may be transformed into a data set with another set of attributes of the same or a diﬀerent type. For example, an attribute, such as age, may be partitioned into ranges to create discrete values for analytical convenience.

6 CHAPTER 1. AN INTRODUCTION TO DATA MINING

The data cleaning process requires statistical methods that are commonly used for miss-ing data estimation. In addition, erroneous data entries are often removed to ensure more accurate mining results. The topics of data cleaning is addressed in Chap. 2 on data pre-processing.

Feature selection and transformation should not be considered a part of data preprocess-ing because the feature selection phase is often highly dependent on the specific analytical problem being solved. In some cases, the feature selection process can even be tightly inte-grated with the specific algorithm or methodology being used, in the form of a wrapper model or embedded model. Nevertheless, the feature selection phase is usually performed before applying the specific algorithm at hand.

1.2.2 The Analytical Phase

The vast majority of this book will be devoted to the analytical phase of the mining process. A major challenge is that each data mining application is unique, and it is, therefore, diﬃcult to create general and reusable techniques across diﬀerent applications. Nevertheless, many data mining formulations are repeatedly used in the context of diﬀerent applications. These correspond to the major “superproblems” or building blocks of the data mining process. It is dependent on the skill and experience of the analyst to determine how these diﬀerent formulations may be used in the context of a particular data mining application. Although this book can provide a good overview of the fundamental data mining models, the ability to apply them to real-world applications can only be learned with practical experience.

1.3 The Basic Data Types

One of the interesting aspects of the data mining process is the wide variety of data types that are available for analysis. There are two broad types of data, of varying complexity, for the data mining process:

Nondependency-oriented data: This typically refers to simple data types such as multi-dimensional data or text data. These data types are the simplest and most commonly encountered. In these cases, the data records do not have any specified dependencies between either the data items or the attributes. An example is a set of demographic records about individuals containing their age, gender, and ZIP code.

Yüklə 17,13 Mb.

Dostları ilə paylaş:

1 ... 8 9 10 11 12 13 14 15 ... 423