On Reservoir Sampling with Deletions. Rainer GEMULLA, Wolfgang LEHNER, Peter J. HAAS. La revue MODULAD, numéro 36, Juillet 2007
Abstract: Perhaps the most flexible synopsis of a database is a random sample of the data; such samples are widely used to speed up processing of analytic queries and data-mining tasks, enhance query optimization, and facilitate information integration. In this paper, we describe a recently proposed method for incrementally maintaining a uniform random sample of the items in a dataset in the presence of an arbitrary sequence of insertions and deletions.
Our scheme, called “random pairing” (RP), maintains a bounded-size uniform sample by using newly inserted data items to compensate for previous deletions. The RP algorithm is the first extension of the almost 40-year-old reservoir sampling algorithm to handle deletions; RP reduces to the “passive” algorithm in [1] when the insertions and deletions correspond to a moving window over a data stream. We also prove that it is not possible to “resize” a bounded-size random sample upwards without accessing the base data.
Keywords: Reservoir sampling, Sample maintenance, Synopsis.
Download paper : On Reservoir Sampling with Deletions
|