Discrete Math/Theory of Computing seminar April 29, 2003

Tracking Frequent Items Dynamically

Graham Cormode, DIMACS

Tuesday, April 29, 4:30 PM, CORE 431
Abstract.

Most database management systems maintain statistics on the underlying relation. One of the important statistics is that of the "hot items" in the relation: those that appear many times (most frequently, or more than some threshold). For example, end-biased histograms keep the hot items as part of the histogram and are used in selectivity estimation. Hot items are used as simple outliers in data mining, and in anomaly detection in networking applications.

I will describe some novel solutions to this problem based on viewing this as a group testing problem. We give approaches using both adaptive and non-adaptive group testing. These algorithms maintain a small space data structure that monitors the transactions on the relation, and when required, quickly output all hot items, without rescanning the relation in the database. With user-specified probability, it is able to report all hot items. I will then go on to describe some work in progress on extensions to this model, when we wish to find items whose frequency has changed by a significant amount, either in absolute or relative terms; and also finding related items which together are "hot" but individually are not.

Back to Discrete Math/Theory of Computing seminar