Dependency-oriented data: In these cases, implicit or explicit relationships may exist between data items. For example, a social network data set contains a set of vertices (data items) that are connected together by a set of edges (relationships). On the other hand, time series contains implicit dependencies. For example, two successive values collected from a sensor are likely to be related to one another. Therefore, the time attribute implicitly specifies a dependency between successive readings.
In general, dependency-oriented data are more challenging because of the complexities cre-ated by preexisting relationships between data items. Such dependencies between data items need to be incorporated directly into the analytical process to obtain contextually mean-ingful results.
1.3. THE BASIC DATA TYPES
|
|
|
7
|
|
Table 1.1: An example of a multidimensional data set
|
|
|
|
|
|
|
|
|
Name
|
Age
|
Gender
|
Race
|
ZIP code
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
John S.
|
45
|
M
|
African American
|
05139
|
|
|
Manyona L.
|
31
|
F
|
Native American
|
10598
|
|
|
Sayani A.
|
11
|
F
|
East Indian
|
10547
|
|
|
Jack M.
|
56
|
M
|
Caucasian
|
10562
|
|
|
Wei L.
|
63
|
M
|
Asian
|
90210
|
|
|
|
|
|
|
|
|
1.3.1 Nondependency-Oriented Data
This is the simplest form of data and typically refers to multidimensional data. This data typically contains a set of records. A record is also referred to as a data point, instance, example, transaction, entity, tuple, object, or feature-vector , depending on the application at hand. Each record contains a set of fields, which are also referred to as attributes, dimen-sions, and features. These terms will be used interchangeably throughout this book. These fields describe the different properties of that record. Relational database systems were tra-ditionally designed to handle this kind of data, even in their earliest forms. For example, consider the demographic data set illustrated in Table 1.1. Here, the demographic proper-ties of an individual, such as age, gender, and ZIP code, are illustrated. A multidimensional data set is defined as follows:
Definition 1.3.1 (Multidimensional Data) A multidimensional data set D is a set of
records, X1 . . . Xn, such that each record Xi contains a set of d features denoted by
(x1i . . . xdi).
Throughout the early chapters of this book, we will work with multidimensional data because it is the simplest form of data and establishes the broader principles on which the more complex data types can be processed. More complex data types will be addressed in later chapters of the book, and the impact of the dependencies on the mining process will be explicitly discussed.
1.3.1.1 Quantitative Multidimensional Data
The attributes in Table 1.1 are of two different types. The age field has values that are numerical in the sense that they have a natural ordering. Such attributes are referred to as continuous, numeric, or quantitative. Data in which all fields are quantitative is also referred to as quantitative data or numeric data. Thus, when each value of xji in Definition 1.3.1 is quantitative, the corresponding data set is referred to as quantitative multidimensional data. In the data mining literature, this particular subtype of data is considered the most common, and many algorithms discussed in this book work with this subtype of data. This subtype is particularly convenient for analytical processing because it is much easier to work with quantitative data from a statistical perspective. For example, the mean of a set of quantitative records can be expressed as a simple average of these values, whereas such computations become more complex in other data types. Where possible and effective, many data mining algorithms therefore try to convert different kinds of data to quantitative values before processing. This is also the reason that many algorithms discussed in this (or virtually any other) data mining textbook assume a quantitative multidimensional representation. Nevertheless, in real applications, the data are likely to be more complex and may contain a mixture of different data types.
8 CHAPTER 1. AN INTRODUCTION TO DATA MINING
1.3.1.2 Categorical and Mixed Attribute Data
Many data sets in real applications may contain categorical attributes that take on discrete unordered values. For example, in Table 1.1, the attributes such as gender, race, and ZIP code, have discrete values without a natural ordering among them. If each value of xji in Definition 1.3.1 is categorical, then such data are referred to as unordered discrete-valued or categorical. In the case of mixed attribute data, there is a combination of categorical and numeric attributes. The full data in Table 1.1 are considered mixed-attribute data because they contain both numeric and categorical attributes.
The attribute corresponding to gender is special because it is categorical, but with only two possible values. In such cases, it is possible to impose an artificial ordering between these values and use algorithms designed for numeric data for this type. This is referred to as binary data, and it can be considered a special case of either numeric or categorical data. Chap. 2 will explain how binary data form the “bridge” to transform numeric or categorical attributes into a common format that is suitable for processing in many scenarios.
1.3.1.3 Binary and Set Data
Binary data can be considered a special case of either multidimensional categorical data or multidimensional quantitative data. It is a special case of multidimensional categorical data, in which each categorical attribute may take on one of at most two discrete values. It is also a special case of multidimensional quantitative data because an ordering exists between the two values. Furthermore, binary data is also a representation of setwise data, in which each attribute is treated as a set element indicator. A value of 1 indicates that the element should be included in the set. Such data is common in market basket applications. This topic will be studied in detail in Chaps. 4 and 5.
1.3.1.4 Text Data
Text data can be viewed either as a string, or as multidimensional data, depending on how they are represented. In its raw form, a text document corresponds to a string. This is a dependency -oriented data type, which will be described later in this chapter. Each string is a sequence of characters (or words) corresponding to the document. However, text documents are rarely represented as strings. This is because it is difficult to directly use the ordering between words in an efficient way for large-scale applications, and the additional advantages of leveraging the ordering are often limited in the text domain.
In practice, a vector-space representation is used, where the frequencies of the words in the document are used for analysis. Words are also sometimes referred to as terms. Thus, the precise ordering of the words is lost in this representation. These frequencies are typically normalized with statistics such as the length of the document, or the frequencies of the individual words in the collection. These issues will be discussed in detail in Chap. 13 on text data. The corresponding n × d data matrix for a text collection with n documents and d terms is referred to as a document-term matrix.
When represented in vector-space form, text data can be considered multidimensional quantitative data, where the attributes correspond to the words, and the values correspond to the frequencies of these attributes. However, this kind of quantitative data is special because most attributes take on zero values, and only a few attributes have nonzero values. This is because a single document may contain only a relatively small number of words out of a dictionary of size 105. This phenomenon is referred to as data sparsity, and it significantly impacts the data mining process. The direct use of a quantitative data mining
Dostları ilə paylaş: |