Data Mining: The Textbook

Yüklə 17,13 Mb.

səhifə	254/423
tarix	07.01.2024
ölçüsü	17,13 Mb.
	#211690

1 ... 250 251 252 253 254 255 256 257 ... 423

1-Data Mining tarjima

Objective(S, Y) =			dist(	,	).	(12.31)
			dist(	,	).	(12.31)
		X_i	^Yj_i
	^Xi∈^S,Xi⇐^Yj_i

The assignment operator is denoted by “⇐” above. The squared distance between a data point and its assigned cluster center is denoted by dist(X_i, Y_j_i ), where the data record X_i is assigned to the representative Y_j_i . In principle, any partitioning algorithm, such as k-means or k-medoids, can be applied to the segment S_i in order to determine the representatives Y₁ . . . Y_k. For the purpose of discussion, this algorithm will be treated as a black box.

After the first segment S₁ has been processed, we now have a set of k medians that are stored away. The number of points assigned to each representative is stored as a “weight” for that representative. Such representatives are considered level-1 representatives. The next segment S ₂ is independently processed to find its k optimal median representatives. Thus, at the end of processing the second segment, one will have 2 · k such representatives. Thus, the memory requirement for storing the representatives also increases with time, and after processing r segments, one will have a total of r · k representatives. When the number of representatives exceeds m, a second level of clustering is applied to these set of r · k points, except that the stored weights on the representatives are also used in the clustering process. The resulting representatives are stored as level-2 representatives. In general, when the number of representatives of level-p reaches m, they are converted to k level-(p + 1) representatives. Thus, the process will result in increasing the number of representatives of all levels, though the number of representatives at higher levels will increase exponentially slower than those at the lower levels. At the end of processing the entire data stream (or when a specific need for the clustering result arises), all remaining representatives of diﬀerent levels are clustered together in one final application of the k-medians subroutine.

The specific choice of the algorithm used for the k-medians problem is critical in ensuring a high-quality clustering. The other factor that aﬀects the quality of the final output is the eﬀect of the problem decomposition into chunks followed by hierarchical clustering. How does such a problem decomposition aﬀect the final quality of the output? It has been shown in the STREAM paper [240], that the final quality of the output cannot be arbitrarily worse than the particular subroutine that is used at the intermediate stage for k-medians clustering.

Lemma 12.4.1 Let the subroutine used for k-medians clustering in the STREAM algorithm have an approximation factor of c. Then, the STREAM algorithm will have an approxima-tion factor of no worse than 5 · c.

A variety of solutions are possible for the k-medians problem. In principle, virtually any approximation algorithm can be used as a black box. A particularly eﬀective solution is based on the problem of facility location. The reader is referred to the bibliographic notes for pointers to the relevant approach.

This terminology is diﬀerent from the k-medians approach introduced in Chap. 6. The relevant subrou-tines in the STREAM algorithm are more similar to a k-medoids algorithm. Nevertheless, the “k-medians” terminology is used here to ensure consistency with the original research paper describing STREAM [240].

Yüklə 17,13 Mb.

Dostları ilə paylaş:

1 ... 250 251 252 253 254 255 256 257 ... 423