Report

Sublinear Algorihms for Big Data Lecture 2 Grigory Yaroslavtsev http://grigory.us Recap • (Markov) For every > 0: 1 Pr ≥ ≤ • (Chebyshev) For every > 0: Pr − ≥ ≤ 2 • (Chernoff) Let 1 … be independent and identically distributed r.vs with range [0, c] and 1 expectation . Then if = and 1 > > 0, 2 Pr − ≥ ≤ 2 exp − 3 Today • • • • • Approximate Median Alon-Mathias-Szegedy Sampling Frequency Moments Distinct Elements Count-Min Data Streams • Stream: elements from universe = {1, 2, … , }, e.g. 1 , 2 , … , = 〈5, 8, 1, 1, 1, 4, 3, 5, … , 10〉 • = frequency of in the stream = # of occurrences of value = 〈1 , … , 〉 Approximate Median • = {1 , … , } (all distinct) and let = | ∈ ∶ ≤ | • Problem: Find -approximate median, i.e. such that − < < + 2 2 • Exercise: Can we approximate the value of the median with additive error ± in sublinear time? • Algorithm: Return the median of a sample of size taken from (with replacement). Approximate Median • Problem: Find -approximate median, i.e. such that − < < + 2 2 • Algorithm: Return the median of a sample of size taken from (with replacement). 7 2 log 2 • Claim: If = then this algorithm gives median with probability 1 − Approximate Median • Partition into 3 groups = ∈ : ≤ − 2 = ∈ : − ≤ ≤ + 2 2 = ∈ : ≥ + 2 • Key fact: If less than elements from each of and are 2 in sample then its median is in • Let = 1 if -th sample is in and 0 otherwise. • Let = . By Chernoff, if > 7 2 log 2 1 2 2− − 3 Pr ≥ ≤ Pr ≥ 1 + ≤ ≤ 2 2 • Same for + union bound ⇒ error probability ≤ AMS Sampling • Problem: Estimate ∈[] ( ), for an arbitrary function with 0 = 0. • Estimator: Sample , where is sampled uniformly at random from [] and compute: = ≥ ∶ = Output: = ( − ( − 1)) • Expectation: = = =1 Pr = [| = ] − −1 = Frequency Moments • Define = for ∈ {0,1,2, … } – 0 = # number of distinct elements – 1 = # elements – 2 = “Gini index”, “surprise index” Frequency Moments • Define = for ∈ {0,1,2, … } • Use AMS estimator with = − − 1 = • Exercise: 0 ≤ ≤ ∗−1 , where ∗ = max • Repeat times and take average . By Chernoff: 2 Pr − ≥ ≤ 2 exp − 3 ∗−1 • Taking = 3∗−1 log 2 1 gives Pr − ≥ ≤ Frequency Moments • Lemma: ∗−1 ≤ 1−1/ 1 3∗−1 log 2 1 1− 1 log • Result: = = log 2 memory suffices for , -approximation of • Question: What if we don’t know ? • Then we can use probabilistic guessing (similar to Morris’s algorithm), replacing log with log . Frequency Moments • Lemma: ∗−1 ≤ 1−1/ • Exercise: ≥ (Hint: worst-case when 1 = ⋯ = = . Use convexity of = ). • Case 1: ∗ ≤ ∗−1 1 −1 1 1− ≤ = 1− Frequency Moments • Lemma: • ∗−1 ≤ 1−1/ Case 2: ∗ ≥ ∗−1 ∗−1 ≤ ≤ ∗ ∗ ≤ 1 1− 1− = 1 Hash Functions • Definition: A family of functions from → is -wise independent if for any distinct 1 , … , ∈ and 1 , … ∈ : 1 Pr ℎ 1 = 1 , ℎ 2 = 2 , … , ℎ = = ℎ∈ • Example: If ⊆ 0, … , − 1 , = 0, … , − 1 for prime −1 : 0 ≤ 0 , 1 , … , −1 ≤ − 1 = ℎ = =0 is a -wise independent family of hash functions. Linear Sketches • Sketching algorithm: picks a random matrix ∈ × , where ≪ and computes . • Can be incrementally updated: – We have a sketch – When arrives, new frequencies are ′ = + – Updating the sketch: ′ = + = + = + − ℎ • Need to choose random matrices carefully 2 • Problem: , -approximation for 2 = • Algorithm: 2 – Let ∈ −1,1 × , where entries of each row are 4wise independent and rows are independent – Don’t store the matrix: 4-wise independent hash functions – Compute , average squared entries “appropriately” • Analysis: – Let be any entry of . – Lemma: 2 = 2 – Lemma: Var 2 ≤ 422 2 : Expectaton • Let be a row of with entries ∈ −1,1 . 2 2 = =1 2 2 + = =1 [ ] ≠ 2 + = =1 [ ] ≠ = 2 + ≠ = 2 • We used 2-wise independence for = . 2 : Variance 2 ( 2 − 2 2 ] = ≠ 2 2 2 2 + 4 = 2 ≠ 2 2 ≠≠ 0 : Distinct Elements • Problem: (, )-approximation for 0 = 0 • Simplified: For fixed > 0, with prob. 1 − distinguish: 0 > 1 + vs. 0 < 1 − • Original problem reduces by trying values of T: = 1, 1 + , 1 + 2 , … , log 0 : Distinct Elements • Simplified: For fixed > 0, with prob. 1 − distinguish: 0 > 1 + vs. 0 < 1 − • Algorithm: – Choose random sets 1 , … , ⊆ [] where 1 Pr ∈ = – Compute = ∈ – If at least / of the values are zero, output 0 < 1 − 0 > 1 + vs. 0 < 1 − • Algorithm: – Choose random sets 1 , … , ⊆ [] where Pr ∈ = 1 – Compute = ∈ – If at least / of the values are zero, output 0 < 1 − • Analysis: – If 0 > 1 + , then Pr = 0 < – If 0 < 1 − , then Pr = 0 > – Chernoff: = 1 1 log 2 1 1 3 + 3 − gives correctness w.p. 1 − 0 > 1 + vs. 0 < 1 − • Analysis: – If 0 > 1 + , then Pr = 0 < – If 0 < 1 − , then Pr = 0 > • If is large and is small then: Pr = 0 = 1 − • If 0 > 1 + : − 0 − + ≈ ≤ − 1+ 1 ≤ − 3 ≥ − 1 − 1 ≥ + 3 • If 0 < 1 − : − 0 1 0 1 1 3 3 − 0 Count-Min Sketch • https://sites.google.com/site/countminsketch/ • Stream: elements from universe = {1, 2, … , }, e.g. 1 , 2 , … , = 〈5, 8, 1, 1, 1, 4, 3, 5, … , 10〉 • = frequency of in the stream = # of occurrences of value , = 〈1 , … , 〉 • Problems: – Point Query: For ∈ estimate – Range Query: For , ∈ [] estimate + ⋯ + – Quantile Query: For ∈ [0,1] find with 1 + ⋯ + ≈ – Heavy Hitters: For ∈ [0,1] find all with ≥ Count-Min Sketch: Construction • Let 1 , … , : → [] be 2-wise independent hash functions • We maintain ⋅ counters with values: , = # elements in the stream with = • For every the value ,() ≥ and so: ≤ = min(1,1 , … , ,1 ) • If = 2 and = 1 log 2 then: Pr ≤ ≤ + ≥ 1 − . Count-Min Sketch: Analysis • Define random variables 1 … , such that , () = + = ≠, = () • Define , = 1 if = () and 0 otherwise: = • By 2-wise independence: = ≠ , = • By Markov inequality, , ≠ ≠ Pr = 1 1 Pr ≥ ≤ = 2 ≤ Count-Min Sketch: Analysis • All are independent 1 Pr ≥ 1 ≤ ≤ ≤ = 2 • The w.p. 1 − there exists such that ≤ = min 1,1 , … , , = = min , +1 … , + ≤ + • CountMin estimates values up to ± with total memory log log 2 1 Dyadic Intervals • Define log partitions of : 0 = 1,2,3, … 1 = 1,2 , 3,4 , … , − 1, 2 = { 1,2,3,4 , 5,6,7,8 , … , { − 3, − 2, − 1, }} … Ilog n = {{1, 2,3, … , }} • Exercise: Any interval (, ) can be written as a disjoint union of at most 2 log such intervals. • Example: For = 256: 48,107 = 48,48 ∪ 49,64 ∪ 65,96 ∪ 97,104 ∪ 105,106 ∪ 107,107 Count-Min: Range Queries and Quantiles • Range Query: For , ∈ [] estimate + ⋯ • Approximate median: Find such that: 1 + ⋯ + ≥ 1 + ⋯ + −1 2 + and ≤ − 2 Count-Min: Range Queries and Quantiles • Algorithm: Construct log Count-Min sketches, one for each such that for any ∈ we have an estimate for such that: Pr ≤ ≤ + ≥ 1 − • To estimate , , let 1 … , be decomposition: [,] = 1 + ⋯ + • Hence, Pr , ≤ , ≤ 2 log ≥ 1 − 2 log Count-Min: Heavy Hitters • Heavy Hitters: For ∈ [0,1] find all with ≥ but no elements with ≤ ( − ) • Algorithm: – Consider binary tree whose leaves are [n] and associate internal nodes with intervals corresponding to descendant leaves – Compute Count-Min sketches for each – Level-by-level from root, mark children of marked nodes if ≥ – Return all marked leaves • Finds heavy-hitters in ( −1 log ) steps Thank you! • Questions?