pptx - Grigory Yaroslavtsev

Sublinear Algorihms for Big Data
Lecture 2
Grigory Yaroslavtsev
http://grigory.us
Recap
• (Markov) For every  > 0:
1
Pr  ≥    ≤

• (Chebyshev) For every  > 0:

Pr  −   ≥    ≤
2
• (Chernoff) Let 1 …  be independent and
identically distributed r.vs with range [0, c] and
1
expectation . Then if  =
and 1 >  > 0,

2
Pr  −  ≥  ≤ 2 exp −
3
Today
•
•
•
•
•
Approximate Median
Alon-Mathias-Szegedy Sampling
Frequency Moments
Distinct Elements
Count-Min
Data Streams
• Stream:  elements from universe  =
{1, 2, … , }, e.g.
1 , 2 , … ,  = 〈5, 8, 1, 1, 1, 4, 3, 5, … , 10〉
•  = frequency of  in the stream = # of
occurrences of value
= 〈1 , … ,  〉
Approximate Median
•  = {1 , … ,  } (all distinct) and let
= | ∈  ∶  ≤ |
• Problem: Find -approximate median, i.e.  such
that

−  <   < +
2
2
• Exercise: Can we approximate the value of the
median with additive error ± in sublinear time?
• Algorithm: Return the median of a sample of size
taken from  (with replacement).
Approximate Median
• Problem: Find -approximate median, i.e.
such that

−  <   < +
2
2
• Algorithm: Return the median of a sample of
size  taken from  (with replacement).
7
2
log
2

• Claim: If  =
then this algorithm gives median with probability 1 −
Approximate Median
• Partition  into 3 groups

=  ∈ :   ≤ −
2

=  ∈ : −  ≤   ≤ +
2
2

=  ∈ :   ≥ +
2

• Key fact: If less than elements from each of  and  are
2
in sample then its median is in
• Let  = 1 if -th sample is in  and 0 otherwise.
• Let  =
.
By Chernoff, if  >
7
2
log
2

1
2 2−
− 3

Pr  ≥ ≤ Pr  ≥ 1 +    ≤
≤
2
2
• Same for  + union bound ⇒ error probability ≤
AMS Sampling
• Problem: Estimate ∈[] ( ), for an arbitrary
function  with  0 = 0.
• Estimator: Sample  , where  is sampled uniformly at
random from [] and compute:
=  ≥  ∶  =
Output:  = (  − ( − 1))
• Expectation:
=
=

=1
Pr  =  [| = ]

− −1

=

Frequency Moments
• Define  =

for  ∈ {0,1,2, … }
– 0 = # number of distinct elements
– 1 = # elements
– 2 = “Gini index”, “surprise index”
Frequency Moments
• Define  =

for  ∈ {0,1,2, … }
• Use AMS estimator with  =    −  − 1
=

• Exercise: 0 ≤  ≤   ∗−1 , where ∗ = max

• Repeat  times and take average . By Chernoff:
2
Pr  −  ≥  ≤ 2 exp −
3  ∗−1
• Taking  =
3∗−1 log
2
1

gives Pr  −  ≥  ≤
Frequency Moments
• Lemma:
∗−1
≤ 1−1/

1
3∗−1 log

2
1
1−

1

log
• Result:  =
=
log
2

memory suffices for ,  -approximation of
• Question: What if we don’t know ?
• Then we can use probabilistic guessing (similar to
Morris’s algorithm), replacing log  with log .
Frequency Moments
• Lemma:
∗−1
≤ 1−1/

• Exercise:  ≥
(Hint: worst-case when 1 = ⋯ =

= . Use convexity of   =   ).

• Case 1:
∗
≤

∗−1

1

−1

1
1−

≤
=

1−
Frequency Moments
• Lemma:
•
∗−1
≤ 1−1/

Case 2: ∗ ≥

∗−1 ∗−1
≤
≤

∗
∗
≤

1
1−
1−

=
1

Hash Functions
• Definition: A family  of functions from  →  is -wise
independent if for any distinct 1 , … ,  ∈  and 1 , …  ∈
:
1
Pr ℎ 1 = 1 , ℎ 2 = 2 , … , ℎ  =  =
ℎ∈

• Example: If  ⊆ 0, … ,  − 1 ,  = 0, … ,  − 1 for prime
−1
: 0 ≤ 0 , 1 , … , −1 ≤  − 1
= ℎ  =
=0
is a -wise independent family of hash functions.
Linear Sketches
• Sketching algorithm: picks a random matrix  ∈
× , where  ≪  and computes .
• Can be incrementally updated:
– We have a sketch
– When  arrives, new frequencies are  ′ =  +
– Updating the sketch:
′ =   +  =  +
=  +  − ℎ
• Need to choose random matrices carefully
2
• Problem: ,  -approximation for 2 =
• Algorithm:
2

– Let  ∈ −1,1 × , where entries of each row are 4wise independent and rows are independent
– Don’t store the matrix:  4-wise independent hash
functions
– Compute , average squared entries “appropriately”
• Analysis:
– Let  be any entry of .
– Lemma:   2 = 2
– Lemma: Var  2 ≤ 422
2 : Expectaton
• Let  be a row of  with entries  ∈ −1,1 .
2

2 =

=1

2 2 +
=
=1
[    ]
≠

2 +
=
=1
[  ]
≠
= 2 + ≠       = 2
• We used 2-wise independence for    =     .
2 : Variance
2
( 2 − 2 2 ] =

≠
2 2 2 2 + 4
= 2
≠
2   2
≠≠
0 : Distinct Elements
• Problem: (, )-approximation for 0 =  0
• Simplified: For fixed  > 0, with prob. 1 −
distinguish:
0 > 1 +   vs. 0 < 1 −
• Original problem reduces by trying
values of T:
= 1, 1 +  , 1 +  2 , … ,
log

0 : Distinct Elements
• Simplified: For fixed  > 0, with prob. 1 −
distinguish:
0 > 1 +   vs. 0 < 1 −
• Algorithm:
– Choose random sets 1 , … ,  ⊆ [] where
1
Pr  ∈  =

– Compute  =
∈
– If at least / of the values  are zero, output
0 < 1 −
0 > 1 +   vs. 0 < 1 −
• Algorithm:
– Choose random sets 1 , … ,  ⊆ [] where Pr  ∈  =
1

– Compute  = ∈
– If at least / of the values  are zero, output 0 <
1 −
• Analysis:
– If 0 > 1 +  , then Pr  = 0 <
– If 0 < 1 −  , then Pr  = 0 >
– Chernoff:  =
1
1
log
2

1

1

3

+
3
−
gives correctness w.p. 1 −
0 > 1 +   vs. 0 < 1 −
• Analysis:
– If 0 > 1 +  , then Pr  = 0 <
– If 0 < 1 −  , then Pr  = 0 >
• If  is large and  is small then:
Pr  = 0 = 1 −
• If 0 > 1 +  :

− 0

−
+
≈
≤
− 1+
1
≤ −
3
≥
− 1 −
1
≥ +
3
• If 0 < 1 −  :

− 0

1 0

1

1

3

3

− 0
Count-Min Sketch
• Stream:  elements from universe  =
{1, 2, … , }, e.g.
1 , 2 , … ,  = 〈5, 8, 1, 1, 1, 4, 3, 5, … , 10〉
•  = frequency of  in the stream = # of
occurrences of value ,  = 〈1 , … ,  〉
• Problems:
– Point Query: For  ∈  estimate
– Range Query: For ,  ∈ [] estimate  + ⋯ +
– Quantile Query: For  ∈ [0,1] find  with 1 + ⋯ +
≈
– Heavy Hitters: For  ∈ [0,1] find all  with  ≥
Count-Min Sketch: Construction
• Let 1 , … ,  :  → [] be 2-wise
independent hash functions
• We maintain  ⋅  counters with values:
, = # elements  in the stream with   =
• For every  the value ,() ≥  and so:
≤  = min(1,1  , … , ,1  )
• If  =
2

and  =
1
log 2

then:
Pr  ≤  ≤  +  ≥ 1 − .
Count-Min Sketch: Analysis
• Define random variables 1 … ,  such that , () =  +
=

≠,  = ()
• Define , = 1 if   =  () and 0 otherwise:
=
• By 2-wise independence:
= ≠   , =
• By Markov inequality,
,
≠
≠
Pr   =
1
1
Pr  ≥  ≤
=
2
≤

Count-Min Sketch: Analysis
• All  are independent

1
Pr  ≥    1 ≤  ≤  ≤
=
2
• The w.p. 1 −  there exists  such that  ≤
= min 1,1  , … , ,  =
= min  , +1 … ,  +  ≤  +
• CountMin estimates values  up to ± with total
memory
log  log
2
1

• Define log  partitions of  :
0 = 1,2,3, …
1 = 1,2 , 3,4 , … ,  − 1,
2 = { 1,2,3,4 , 5,6,7,8 , … , { − 3,  − 2,  − 1, }}
…
Ilog n = {{1, 2,3, … , }}
• Exercise: Any interval (, ) can be written as a disjoint
union of at most 2 log  such intervals.
• Example: For  = 256: 48,107 = 48,48 ∪ 49,64 ∪
65,96 ∪ 97,104 ∪ 105,106 ∪ 107,107
Count-Min: Range Queries and Quantiles
• Range Query: For ,  ∈ [] estimate  + ⋯
• Approximate median: Find  such that:
1 + ⋯ +  ≥
1 + ⋯ + −1

2
+  and

≤ −
2
Count-Min: Range Queries and Quantiles
• Algorithm: Construct log  Count-Min sketches,
one for each  such that for any  ∈  we have
an estimate  for  such that:
Pr  ≤  ≤  +  ≥ 1 −
• To estimate ,  , let 1 … ,  be decomposition:
[,] = 1 + ⋯ +
• Hence,
Pr  , ≤  , ≤ 2  log  ≥ 1 − 2 log
Count-Min: Heavy Hitters
• Heavy Hitters: For  ∈ [0,1] find all  with  ≥
but no elements with  ≤ ( − )
• Algorithm:
– Consider binary tree whose leaves are [n] and
associate internal nodes with intervals
corresponding to descendant leaves
– Compute Count-Min sketches for each
– Level-by-level from root, mark children  of
marked nodes if  ≥
– Return all marked leaves
• Finds heavy-hitters in ( −1 log ) steps
Thank you!
• Questions?