Data Reduction for Clustering on Streams
By Harry Lang
We explore clustering problems in the streaming sliding window model in both general metric spaces and Euclidean space. We present the first polylogarithmic space O(1)-approximation to the metric k-median and metric k-means problems in the sliding window model, answering the main open problem posed by Babcock, Datar, Motwani and O'Callaghan, which has remained unanswered for over a decade. Our algorithm uses O(k^3log^6 n) space and poly(k,log n) update time. This is an exponential improvement on the space required by the technique due to Babcock, et al. We introduce a data structure that extends smooth histograms as introduced by Braverman and Ostrovsky to operate on a broader class of functions. In particular, we show that using only polylogarithmic space we can maintain a summary of the current window from which we can construct an O(1)-approximate clustering solution. Merge-and-reduce is a generic method in computational geometry for adapting offline algorithms to the insertion-only streaming model. Several well-known coreset constructions are maintainable in the insertion-only streaming model using this method, including well-known coreset techniques for the k-median and k-means problems in both low-and high-dimensional Euclidean space. Previous work has adapted coreset techniques to the insertion-deletion model, but translating them to the sliding window model has remained a challenge. We give the first algorithm that, given an insertion-only streaming coreset of space s (maintained using merge-and-reduce method), maintains this coreset in the sliding window model using O(s^2ϵ^{−2}log n)space. For clustering problems, our results constitute the first significant step towards resolving problem number 20 from the List of Open Problems in Sublinear Algorithms.