Data Mining: The Textbook

Yüklə 17,13 Mb.

səhifə	27/423
tarix	07.01.2024
ölçüsü	17,13 Mb.
	#211690

1 ... 23 24 25 26 27 28 29 30 ... 423

1-Data Mining tarjima

Sensor data: Sensor data is often collected as large volumes of low-level signals, which are massive. The low-level signals are sometimes converted to higher-level features using wavelet or Fourier transforms. In other cases, the time series is used directly after some cleaning. The field of signal processing has an extensive literature devoted to such methods. These technologies are also useful for porting time-series data to multidimensional data.

Image data: In its most primitive form, image data are represented as pixels. At a slightly higher level, color histograms can be used to represent the features in diﬀer-ent segments of an image. More recently, the use of visual words has become more

2.2. FEATURE EXTRACTION AND PORTABILITY

popular. This is a semantically rich representation that is similar to document data. One challenge in image processing is that the data are generally very high dimen-sional. Thus, feature extraction can be performed at diﬀerent levels, depending on the application at hand.

Web logs: Web logs are typically represented as text strings in a prespecified format. Because the fields in these logs are clearly specified and separated, it is relatively easy to convert Web access logs into a multidimensional representation of (the relevant) categorical and numeric attributes.

Network traﬃc: In many intrusion-detection applications, the characteristics of the network packets are used to analyze intrusions or other interesting activity. Depending on the underlying application, a variety of features may be extracted from these packets, such as the number of bytes transferred, the network protocol used, and so on.

Document data: Document data is often available in raw and unstructured form, and the data may contain rich linguistic relations between diﬀerent entities. One approach is to remove stop words, stem the data, and use a bag-of-words representation. Other methods use entity extraction to determine linguistic relationships.

Named-entity recognition is an important subtask of information extraction. This approach locates and classifies atomic elements in text into predefined expressions of names of persons, organizations, locations, actions, numeric quantities, and so on. Clearly, the ability to identify such atomic elements is very useful because they can be used to understand the structure of sentences and complex events. Such an approach can also be used to populate a more conventional database of relational elements or as a sequence of atomic entities, which is more easily analyzed. For example, consider the following sentence:

Bill Clinton lives in Chappaqua.

Here, “Bill Clinton” is the name of a person, and “Chappaqua” is the name of a place. The word “lives” denotes an action. Each type of entity may have a diﬀerent significance to the data mining process depending on the application at hand. For example, if a data mining application is mainly concerned with mentions of specific locations, then the word “Chappaqua” needs to be extracted.

Popular techniques for named entity recognition include linguistic grammar-based techniques and statistical models. The use of grammar rules is typically very eﬀective, but it requires work by experienced computational linguists. On the other hand, sta-tistical models require a significant amount of training data. The techniques designed are very often domain-specific. The area of named entity recognition is vast in its own right, which is outside the scope of this book. The reader is referred to [400] for a detailed discussion of diﬀerent methods for entity recognition.

Feature extraction is an art form that is highly dependent on the skill of the analyst to choose the features and their representation that are best suited to the task at hand. While this particular aspect of data analysis typically belongs to the domain expert, it is perhaps the most important one. If the correct features are not extracted, the analysis can only be as good as the available data.

30 CHAPTER 2. DATA PREPARATION

2.2.2 Data Type Portability

Data type portability is a crucial element of the data mining process because the data is often heterogeneous, and may contain multiple types. For example, a demographic data set may contain both numeric and mixed attributes. A time-series data set collected from an electrocardiogram (ECG) sensor may have numerous other meta-information and text attributes associated with it. This creates a bewildering situation for an analyst who is now faced with the diﬃcult challenge of designing an algorithm with an arbitrary combination of data types. The mixing of data types also restricts the ability of the analyst to use oﬀ-the-shelf tools for processing. Note that porting data types does lose representational accuracy and expressiveness in some cases. Ideally, it is best to customize the algorithm to the particular combination of data types to optimize results. This is, however, time-consuming and sometimes impractical.

This section will describe methods for converting between various data types. Because the numeric data type is the simplest and most widely studied one for data mining algo-rithms, it is particularly useful to focus on how diﬀerent data types may be converted to it. However, other forms of conversion are also useful in many scenarios. For example, for similarity-based algorithms, it is possible to convert virtually any data type to a graph and apply graph- based algorithms to this representation. The following discussion, summarized in Table 2.1, will discuss various ways of transforming data across diﬀerent types.

2.2.2.1 Numeric to Categorical Data: Discretization

The most commonly used conversion is from the numeric to the categorical data type. This process is known as discretization. The process of discretization divides the ranges of the numeric attribute into φ ranges. Then, the attribute is assumed to contain φ diﬀerent categorical labeled values from 1 to φ, depending on the range in which the original attribute lies. For example, consider the age attribute. One could create ranges [0, 10], [11, 20], [21, 30], and so on. The symbolic value for any record in the range [11, 20] is “2” and the symbolic value for a record in the range [21, 30] is “3”. Because these are symbolic values, no ordering is assumed between the values “2” and “3”. Furthermore, variations within a range are not distinguishable after discretization. Thus, the discretization process does lose some information for the mining process. However, for some applications, this loss of information is not too debilitating. One challenge with discretization is that the data may be nonuniformly distributed across the diﬀerent intervals. For example, for the case of the salary attribute, a large subset of the population may be grouped in the [40, 000, 80, 000] range, but very few will be grouped in the [1, 040, 000, 1, 080, 000] range. Note that both ranges have the same size. Thus, the use of ranges of equal size may not be very helpful in discriminating between diﬀerent data segments. On the other hand, many attributes, such as age, are not as nonuniformly distributed, and therefore ranges of equal size may work reasonably well. The discretization process can be performed in a variety of ways depending on application-specific goals:

1. Equi-width ranges: In this case, each range [a, b] is chosen in such a way that b − a is the same for each range. This approach has the drawback that it will not work for data sets that are distributed nonuniformly across the diﬀerent ranges. To determine the actual values of the ranges, the minimum and maximum values of each attribute are determined. This range [min, max] is then divided into φ ranges of equal length.

Equi-log ranges: Each range [a, b] is chosen in such a way that log(b) − log(a) has the same value. This kinds of range selection has the eﬀect of geometrically increasing

Yüklə 17,13 Mb.

Dostları ilə paylaş:

1 ... 23 24 25 26 27 28 29 30 ... 423