Duplicate insensitive cardinality estimation in distributed databases
From wiki.eser.org
Contents |
[edit] approach: Flajolet-Martin sketches (fm sketches)
[edit] videos
- Peer to Peer Web Search with Minerva -- 2006 google video
[edit] papers
- Probabilistic counting algorithms for data base applications -- original paper on technique
- Approximate Aggregation Techniques for Sensor Databases -- improved "sumation FM sketch" technique which supports efficient execution of "bulk insertions" to the FM sketch
- A Simple and Efficient Estimation Method for Stream Expression Cardinalities
- Global Document Frequency Estimation in Peer to Peer Web Search -- discussion of the hash sketching technique used in the minerva p2p search engine
- Overlap-Aware Global df Estimation in Distributed Information Retrieval Systems -- a much more extensive discussion
- Estimating Join-Distinct Aggregates over Update Streams -- powerpoint slides
- Robust Sketching and Aggregation of Distributed Data Streams -- great overview -- discusses the following sketches:
- fm sketches
- count-min sketches
- count-min-fm sketches
- quantile digest
- quantile fm digest's
- accompanying slides
[edit] implementations
- Berkeley "P2" distributed dataflow overlay network project
- AT&T research sketch library
- AT&T research implementation of 'count sketch
- a one-pass algorithm for finding most frequent items in a datastream
- Finding Frequent Items in Data Streams -- paper
- source code
- Rutgers MassDAL Public Code Bank: Sketches, Frequent Items, Changes (Deltoids)
- java and C source code downloads of various sketching algorithm implementations
[edit] count sketches
- proposed in 2002
- problem: in sublinear space, estimate the number of distinct item IDs in a data set with only one pass
- must support union operator of results from subsets
- must support small per item processing overhead
- must support small space relative to stream space
- note: exact count is impossible without linear space
[edit] papers and presentations
- Approximate Aggregation Techniques for Sensor Databases -- very clear powerpoint presentation on both count sketches and sum sketches (used in distributed, low-power wireless sensor nets)