Data Mining: The Textbook

SPLIT AT C SPLIT AT D

Yüklə 17,13 Mb.

səhifə	261/423
tarix	07.01.2024
ölçüsü	17,13 Mb.
	#211690

1 ... 257 258 259 260 261 262 263 264 ... 423

1-Data Mining tarjima

SPLIT AT C		SPLIT AT D		SPLIT AT B

	A	SATISFIES	A	SATISFIES	A	SATISFIES		A
	A		A
	HOEFFDING			HOEFFDING		HOEFFDING
		BOUND		BOUND	BB	BOUND
B	C	B		C	BB	C	B		C
B	C	B		C
				DE	D	EH	I	D	E
				DE
					F	G		F	G
		DECISION TREE GROWS AS DATA STREAMS IN

Figure 12.8: Incremental process of Hoeﬀding tree construction

where there are near ties in split quality (very small values of ), the algorithm will need to wait for a larger value of n until the aforementioned split condition is satisfied. It can be shown that the probability that the Hoeﬀding tree makes the same classification on the instance as a tree constructed with infinite data is given by at least 1 − δ/p, where p is the probability that the instance will be assigned to a particular leaf. The memory requirements are modest because only the counts of the diﬀerent discrete values of the attributes (over diﬀerent classes) need to be maintained at various nodes to make split decisions.

The major theoretical implication of the Hoeﬀding tree algorithm is that one does not need all the data to grow exactly the same tree as what would be constructed by a poten-tially infinite data stream. Rather, the total number of required tuples is limited once the probabilistic certainty level δ is fixed. The major bottleneck of the approach is that the construction of some of nodes is delayed because of near ties during tree construction. Most of the time is spent in breaking near ties. In the Hoeﬀding tree algorithm, once a decision is made about a split (and it is a poor one), it cannot be reversed. The incremental process of Hoeﬀding tree construction is illustrated in Fig. 12.8. It is noteworthy that test instance classification can be performed at any point during stream progression, but the size of the tree increases over time together with classification accuracy.

The VFDT approach improves over the Hoeﬀding tree algorithm by breaking ties more aggressively and through the deactivation of less promising leaf nodes. It also has a number of optimizations to improve accuracy, such as the dropping of poor splitting attributes, and batching intermediate computations over multiple data points. However, it is not designed to handle concept drift. The CVFDT approach was subsequently designed to address concept drift. CVFDT incorporates two main ideas to address the additional challenges of drift:

A sliding window of training items is used to limit the impact of historical behavior.

Alternate subtrees at each internal node i are constructed to account for the fact that the best split attribute may no longer remain the top choice because of stream evolution.

Because of the sliding window approach, a diﬀerence from the previous method is in the update of the attribute frequency statistics at the nodes, as the sliding window moves forward. For the incoming items, their statistics are added to the attribute value frequencies in the current window, and the expiring items at the other end of the window are removed from the statistics as well. Therefore, when these statistics are updated, some nodes may no longer meet the Hoeﬀding bound. Such nodes are replaced. CVFDT associates each internal node i with a list of alternate subtrees corresponding to splits on diﬀerent attributes. These

424 CHAPTER 12. MINING DATA STREAMS

alternate subtrees are grown along with the main tree used for classification. These alternate trees are used periodically to perform the replacement once the best split attribute has changed. Experimental results show that the CVFDT approach generally achieves higher accuracy in concept-drifting data streams.

12.6.2 Supervised Microcluster Approach

The supervised microcluster is essentially an instance-based classification approach. In this model, it is assumed that a training and a test stream are simultaneously received over time. Because of concept drift, it is important to adjust the model dynamically over time.

In the nearest-neighbor classification approach, the dominant class label among the top-k nearest neighbors is reported as the relevant result. In the streaming scenario, it is diﬃcult to eﬃciently compute the k nearest neighbors for a particular test instance because of the increasing size of the stream. However, fine-grained microclustering can be used to create a fixed-size summary of the data stream that does not increase with stream progression. A supervised variant of microclustering is used in which data points of diﬀerent classes are not allowed to mix within clusters. It is relatively easy to maintain such microclusters with minor changes to the CluStream algorithm. The main diﬀerence is that data points are assigned to microclusters belonging to the same class during the cluster update process. Thus, labels are associated with microclusters rather than individual data points. The dominant label of the top-k nearest microclusters is reported as the relevant label.

This does not, however, account for the changes that need to be made to the algorithm as a result of concept drift. Because of concept drift, the trends in the stream will change. Therefore, it is more relevant to use microclusters from specific time horizons to increase accuracy. While the most recent horizon may often be relevant, this may sometimes not be the case when the trends in the stream revert back suddenly to older trends. Therefore, a part of the training stream is separated out as the validation stream. Recent parts of the validation stream are utilized as test cases to evaluate the accuracy over diﬀerent time horizons. The optimal horizon is selected. The k-nearest neighbor approach is applied to test instances over this optimally selected horizon.

12.6.3 Ensemble Method

A robust ensemble method was also proposed for the classification of data streams. The method is also designed to handle concept drift because it can eﬀectively account for evo-lution in the underlying data. The data stream is partitioned into chunks, and multiple classifiers are trained on each of these chunks. The final classification score is computed as a function of the score on each of these chunks. In particular, ensembles of classification models are scored, such as C4.5, RIPPER, naive Bayesian, from sequential chunks of the data stream. The classifiers in the ensemble are weighted based on their expected classi-fication accuracy under the time-evolving environment. This ensures that the approach is able to achieve a higher degree of accuracy because the classifiers are dynamically tuned to optimize the accuracy for that part of the data stream. It was shown that an ensemble classifier produces a smaller error than a single classifier if the weights of all classifiers are assigned based on their expected classification accuracy.

Yüklə 17,13 Mb.

Dostları ilə paylaş:

1 ... 257 258 259 260 261 262 263 264 ... 423