### ICDE2012_TopKPairs - School of Computer Science and

```Efficiently Monitoring Top-k
Pairs over Sliding Windows
Computer Science and Engineering
Presented By: Zhitao Shen1
Joint work with
Muhammad Aamir Cheema1, Xuemin Lin21, Wenjie Zhang1, Haixun Wang3
1The
University of New South Wales, Australia
2 East China Normal University
3 Microsoft Research Asia
Introduction
Top-k Pairs Query:
• Given a scoring function score() that computes the score of a pair of
objects, return k pairs of objects with the smallest scores.
Examples:
• k closest pairs queries
• k furthest pairs queries
Top-k Pairs against sliding windows
• Given a data stream, return top-k pairs among the most recent N objects.
Applications
• Wireless sensor network, stock market, traffic monitoring and transaction
monitoring
2
Motivation
No existing work for general pairs queries over sliding windows
Select a.id, b.id from trans a, trans b
Support arbitrary scoring functions.
where a.id <> b.id and a.account = b.account
order by |a.time - b.time| - dist(a.loc, b.loc)
limit k
Example:
window [24 hours]
Fraud detection over transaction streams
– Query the transaction pairs that have small time difference but the
locations are far away.
203-13845
10:15:20
New York
\$1000
203-13845
3
10:18:10
L.A.
\$1000
Problem Definitions (Preliminaries)
Sliding Windows
– A sliding window contains most recent N objects of
the data stream.
– The number of pairs is N(N – 1) / 2
The age of a pair
depends on the
older object.
older
. . . . .
o7
o6
Age of an object: 5
o5
o4
o3
o2
o01
4 Sliding
3 window
2
1
of size0 5
Lower bound runtime cost : O(N) for each new object
Lower bound storage cost : O(N)
4
Contributions
Unified framework
• First to study top-k pairs queries over sliding windows.
• Support arbitrarily complex scoring functions
• Support efficient queries for any window size n ≤ N and any k ≤ K
Lower bound
5
Expected cost for our algorithms
Storage requirement
O(N)
O(N) + O(K log(N/K)) for each
scoring function
Skyband maintenance
cost for each object
O(N)
O(N (log (log N) + log K))
O(k)
O(log(log n) + log K + k)
Preliminaries
o4
Score
Map all the pairs to an age–score space
Top-2 pairs
efficiently
maintain the
K-skyband
use the Kskyband to
efficiently
obtain top-k
pairs against
any sliding
window n ≤ N
p6
p1
p9
p8
p2
p7
p4
2
3
o1
p2 dominates p5 because
p2.score < p5.score and p2
expires no later than p5.
Naive: O(N |SKB|) for
checking all N-1 pairs
Our: O(N log|SKB|)
4
Age
K-skyband[Papadias et al., TODS05] keeps the minimum set for the
candidate results.
6
o0
Expected size of skyband is
O(K log(N/K))
p5
1
o2
p1(o0, o1) (p1.age, p1.score)
(1, 3)
p10
p3
o3
Efficient Skyband Maintenance
p5 s1
How can we efficiently
compute the K-staircase
and K-skyband?
Update the K-staircase
and K-skyband in
O(|SKB| log K)),
Score
Can we find a boundary between the
skyband points and non-skyband points?
p1
p2
K-staircase
K-staircase
p7
s2
p6
Check if a pair is
dominated by K-skyband
in O(log |SKB|) time for
each new pair by doing
binary search.
p3
s1
p5
p4
2-skyband
7
s2
Age
p1
Score
Can we do better for
any sliding window
size n < N?
Use Priority Search
Tree to index the
skyband points
AnyWindow
windowsize
size==Nn < N
p2
p1
p3
p5
p1
p4
p7
p6
p5
p7
2-skyband
p2
3
p6
1
p8
p8
8
6
2
4
p3
p4
9
8
5
Priority Search Tree
Age
Self-balancing tree
Efficient 3-sides range query
Score
Our contribution: Retrieve
top-k pairs in the 1-sided
range.
An algorithm similar to
post-order traversal
costs O(log|SKB| + k)
Any window size = n < N
p2
p1
p3
p5
p1
p4
p7
p6
p5
p7
2-skyband
p2
3
p6
1
p8
p8
9
6
2
4
p3
p4
5
Priority Search Tree
Age
8
9
What else in the paper?
Efficient continuous queries on the skyband.
• Continuously monitoring the top-k results for any fixed k (k ≤ K) and
n (n ≤ N).
• Amortized O(k/n (log |SKB| + k)) time per update.
Optimization on monotonic scoring functions.
• Handling the k-closest pairs, k-furthest pairs queries.
• Applying Threshold Algorithm on sorted lists
• Improving the number of considered pairs for each new object from
N to (d+1) N d/(d+1) K 1/(d+1)
10
Experimental Settings
Real dataset.
– Sensor data in the Intel research lab
– 2.3 million records.
score(ox , o y ) 
| o x .time- o y .time|
| o x .temp- o y .temp| | o x .humidity- o y .humidity|
Synthetic data.
– Uniform, correlated and anti-correlated distributions.
– 2 million objects
– Closest and furthest pairs in Manhattan distance
11
Experiments (Overall Cost on real data)
SCase: our algorithm using K-staircase to maintain the skyband.
Naïve: maintains kN pairs and sort them on their scores.
LB: shows lower bound cost
Varying K
12
Varying N (in thousands)
Linear: scan the skyband points to find the top-k pairs.
Snapshot: our snapshot query algorithm.
Continuous: our continuous query algorithm.
LB: an algorithm to obtain top-k results in O(k) time.
Varying K
13
Varying |Q| (in thousands)
Conclusion:
• First to study a broad class of top-k pairs queries over
sliding windows.
• We present efficient algorithms and show that the
performance of our algorithm is reasonably close to the
lower bound cost.
• We provide extensive experiment results on both real
and synthetic data sets to show the efficiency and
scalability of the proposed algorithms.
14
Thank You!
Any Questions?
15
Related Work
Top-k Query Processing
• Fagin’s Algorithm (FA), threshold Algorithm (TA), no-random access
(NRA)
Top-k Pairs Queries Processing
• k-closest pairs queries
• k-furthest pairs queries
• Top-k pairs queries [Cheema et al., ICDE’11]
Data Stream Processing
• Top-k query processing over data stream [Mouratidis et al.,
SIGMOD’06]
• k-nearest neighbour queries [Böhm et al., ICDE’07]
16
Experiments (Skyband Maintenance algorithm)
Basic: maintening algorithm without K-staircase
SCase: our algorithm using K-staircase to maintain the skyband.
TA: Optimized algorithm for monotonic scoring functions.
LB: show lower bound cost
Varying K
17
# of attributes
```