On
Reservoir Sampling with Deletions. Rainer GEMULLA, Wolfgang
LEHNER, Peter J. HAAS.
La revue MODULAD, numéro 36, Juillet 2007
Abstract:
Perhaps the most flexible synopsis
of a database is a random sample of the data;
such samples are widely used to speed up processing of analytic
queries and data-mining
tasks, enhance query optimization, and facilitate information integration.
In this paper, we
describe a recently proposed method for incrementally maintaining
a uniform random sample
of the items in a dataset in the presence of an arbitrary sequence
of insertions and deletions.
Our scheme, called “random pairing” (RP), maintains
a bounded-size uniform sample by using newly inserted
data items to compensate for previous deletions. The RP algorithm
is the first extension of the almost 40-year-old reservoir
sampling algorithm to handle deletions; RP reduces
to the “passive” algorithm in [1] when the
insertions and deletions correspond to a moving window
over a data stream. We also prove that it is not possible to “resize” a
bounded-size random sample upwards without accessing the base data.
Keywords: Reservoir sampling, Sample maintenance,
Synopsis.
Download paper : On Reservoir
Sampling with Deletions
|