Data Mining: The Textbook
algorithm is often unlikely to work with sparse data without appropriate modifications. The sparsity also affects how the data are represented. For example, while it is possible to use the representation suggested in Definition 1.3.1, this is not a practical approach. Most values of xji in Definition 1.3.1 are 0 for the case of text data. Therefore, it is inefficient to explicitly maintain a d-dimensional representation in which most values are 0. A bag-of-words representation is used containing only the words in the document. In addition, the frequencies of these words are explicitly maintained. This approach is typically more efficient. Because of data sparsity issues, text data are often processed with specialized methods. Therefore, text mining is often studied as a separate subtopic within data mining. Text mining methods are discussed in Chap. 13. 1.3.2 Dependency-Oriented Data Most of the aforementioned discussion in this chapter is about the multidimensional sce-nario, where it is assumed that the data records can be treated independently of one another. In practice, the different data values may be (implicitly) related to each other temporally, spatially, or through explicit network relationship links between the data items. The knowl-edge about preexisting dependencies greatly changes the data mining process because data mining is all about finding relationships between data items. The presence of preexisting dependencies therefore changes the expected relationships in the data, and what may be considered interesting from the perspective of these expected relationships. Several types of dependencies may exist that may be either implicit or explicit: Implicit dependencies: In this case, the dependencies between data items are not explicitly specified but are known to “typically” exist in that domain. For exam-ple, consecutive temperature values collected by a sensor are likely to be extremely similar to one another. Therefore, if the temperature value recorded by a sensor at a particular time is significantly different from that recorded at the next time instant then this is extremely unusual and may be interesting for the data mining process. This is different from multidimensional data sets where each data record is treated as an independent entity. Explicit dependencies: This typically refers to graph or network data in which edges are used to specify explicit relationships. Graphs are a very powerful abstraction that are often used as an intermediate representation to solve data mining problems in the context of other data types. In this section, the different dependency-oriented data types will be discussed in detail. 1.3.2.1 Time-Series Data Time-series data contain values that are typically generated by continuous measurement over time. For example, an environmental sensor will measure the temperature continu-ously, whereas an electrocardiogram (ECG) will measure the parameters of a subject’s heart rhythm. Such data typically have implicit dependencies built into the values received over time. For example, the adjacent values recorded by a temperature sensor will usually vary smoothly over time, and this factor needs to be explicitly used in the data mining process. The nature of the temporal dependency may vary significantly with the application. For example, some forms of sensor readings may show periodic patterns of the measured 10 CHAPTER 1. AN INTRODUCTION TO DATA MINING attribute over time. An important aspect of time -series mining is the extraction of such dependencies in the data. To formalize the issue of dependencies caused by temporal corre-lation, the attributes are classified into two types: Yüklə 17,13 Mb. Dostları ilə paylaş: |