Transaction streams: Transaction streams are typically created by customer buying activity. An example is the data created by using a credit card, point-of-sale transac-tion at a supermarket, or the online purchase of an item.
Web click-streams: The activity of users at a popular Web site creates a Web click stream. If the site is sufficiently popular, the rate of generation of the data may be large enough to necessitate the need for a streaming approach.
C. C. Aggarwal, Data Mining: The Textbook, DOI 10.1007/978-3-319-14142-8 12
|
389
|
c Springer International Publishing Switzerland 2015
390 CHAPTER 12. MINING DATA STREAMS
Social streams: Online social networks such as Twitter continuously generate massive text streams because of user activity. The speed and volume of the stream typically scale superlinearly with the number of actors in the social network.
Network streams: Communication networks contain large volumes of traffic streams. Such streams are often mined for intrusions, outliers, or other unusual activity.
Data streams present a number of unique challenges because of the processing constraints associated with the large volumes of continuously arriving data. In particular, data stream-ing algorithms typically need to operate under the following constraints, at least a few of which are always present, whereas others are occasionally present:
One-pass constraint: Because volumes of data are generated continuously and rapidly, it is assumed that the data can be processed only once. This is a hard constraint in all streaming models. The data are almost never assumed to be archived for future processing. This has significant consequences for algorithmic development in stream-ing applications. In particular, many data mining algorithms are inherently iterative and require multiple passes over the data. Such algorithms need to be appropriately modified to be usable in the context of the streaming model.
Concept drift: In most applications, the data may evolve over time. This means that various statistical properties, such as correlations between attributes, correlations between attributes and class labels, and cluster distributions may change over time. This aspect of data streams is almost always present in practical applications, but is not necessarily a universal assumption for all algorithms.
Resource constraints: The data stream is typically generated by an external process, over which a user may have very little control. Therefore, the user also has little control over the arrival rate of the stream. In cases, where the arrival rates vary with time, it may be difficult to execute online processing continuously during peak periods. In these cases, it may be necessary to drop tuples that cannot be processed in a timely fash-ion. This is referred to as loadshedding. Even though resource constraints are almost universal to the streaming paradigm, surprisingly few algorithms incorporate them.
Massive-domain constraints: In some cases, when the attribute values are discrete, they may have a large number of distinct values. For example, consider a scenario, where analysis of pairwise communications in an e-mail network is desired. The num-ber of distinct pairs of e-mail addresses in an e-mail network with 108 participants is of the order of 1016. When expressed in terms of required storage, the number of pos-sibilities easily exceeds the petabyte order. In such cases, storing even simple statistics such as the counts or the number of distinct stream elements becomes very challeng-ing. Therefore, a number of specialized data structures for synopsis construction of massive-domain data streams have been designed.
Because of the large volume of data streams, virtually all streaming methods use an online synopsis construction approach in the mining process. The basic idea is to create an online synopsis that is then leveraged for mining. Many different kinds of synopsis can be con-structed depending upon the application at hand. The nature of a synopsis highly influences the type of insights that can be mined from it. Some examples of synopsis structures include random samples, bloom filters, sketches, and distinct element-counting data structures. In addition, some traditional data mining applications, such as clustering, can be leveraged to create effective synopses from the data.
Dostları ilə paylaş: |