Objective(S, Y) =
|
|
|
|
|
|
dist(
|
|
,
|
|
).
|
(12.31)
|
|
|
|
|
|
|
Xi
|
Yji
|
|
|
Xi∈S,Xi⇐Yji
|
|
|
The assignment operator is denoted by “⇐” above. The squared distance between a data point and its assigned cluster center is denoted by dist(Xi, Yji ), where the data record Xi is assigned to the representative Yji . In principle, any partitioning algorithm, such as k-means or k-medoids, can be applied to the segment Si in order to determine the representatives Y1 . . . Yk. For the purpose of discussion, this algorithm will be treated as a black box.
After the first segment S1 has been processed, we now have a set of k medians that are stored away. The number of points assigned to each representative is stored as a “weight” for that representative. Such representatives are considered level-1 representatives. The next segment S 2 is independently processed to find its k optimal median representatives. Thus, at the end of processing the second segment, one will have 2 · k such representatives. Thus, the memory requirement for storing the representatives also increases with time, and after processing r segments, one will have a total of r · k representatives. When the number of representatives exceeds m, a second level of clustering is applied to these set of r · k points, except that the stored weights on the representatives are also used in the clustering process. The resulting representatives are stored as level-2 representatives. In general, when the number of representatives of level-p reaches m, they are converted to k level-(p + 1) representatives. Thus, the process will result in increasing the number of representatives of all levels, though the number of representatives at higher levels will increase exponentially slower than those at the lower levels. At the end of processing the entire data stream (or when a specific need for the clustering result arises), all remaining representatives of different levels are clustered together in one final application of the k-medians subroutine.
The specific choice of the algorithm used for the k-medians problem is critical in ensuring a high-quality clustering. The other factor that affects the quality of the final output is the effect of the problem decomposition into chunks followed by hierarchical clustering. How does such a problem decomposition affect the final quality of the output? It has been shown in the STREAM paper [240], that the final quality of the output cannot be arbitrarily worse than the particular subroutine that is used at the intermediate stage for k-medians clustering.
Lemma 12.4.1 Let the subroutine used for k-medians clustering in the STREAM algorithm have an approximation factor of c. Then, the STREAM algorithm will have an approxima-tion factor of no worse than 5 · c.
A variety of solutions are possible for the k-medians problem. In principle, virtually any approximation algorithm can be used as a black box. A particularly effective solution is based on the problem of facility location. The reader is referred to the bibliographic notes for pointers to the relevant approach.
This terminology is different from the k-medians approach introduced in Chap. 6. The relevant subrou-tines in the STREAM algorithm are more similar to a k-medoids algorithm. Nevertheless, the “k-medians” terminology is used here to ensure consistency with the original research paper describing STREAM [240].
Dostları ilə paylaş: |