Report

Sketching, Sampling and other Sublinear Algorithms: Streaming Alex Andoni (MSR SVC) A scenario Challenge: compute something on the table, 131.107.65.14 using small space. 18.9.22.69 Example of “something”: 131.107.65.14 • # distinct IPs • max frequency 80.97.56.20 • other statistics… 18.9.22.69 IP Frequency 131.107.65.14 3 18.9.22.69 2 80.97.56.20 2 128.112.128.81 9 127.0.0.1 8 257.2.5.7 0 7.8.20.13 1 80.97.56.20 131.107.65.14 Sublinear: a panacea? Sub-linear space algorithm for solving Travelling Salesperson Problem? Hard to solve sublinearly even very simple problems: Sorry, perhaps a different lecture Ex: what is the count of distinct IPs seen Will settle for: Approximate algorithms: 1+ approximation IP Frequency 131.107.65.14 3 18.9.22.69 2 80.97.56.20 2 128.112.128.81 9 127.0.0.1 8 257.2.5.7 0 8.3.20.12 1 true answer ≤ output ≤ (1+) * (true answer) Randomized: above holds with probability 95% Quick and dirty way to get a sense of the data Streaming data Data through a router Data stored on a hard drive, or streamed remotely More efficient to do a linear scan on a hard drive Working memory is the (smaller) main memory 2 2 Application areas Data can come from: Network logs, sensor data Real time data Search queries, served ads Databases (query planning) … Problem 1: # distinct elements Problem: compute the number of distinct elements in the stream Trivial solution: () space for distinct elements Will see: (log ) space (approximate) 2 5 7 5 5 i Frequency 2 1 5 3 7 1 Distinct Elements: idea 1 [Flajolet-Martin’85, Alon-Matias-Szegedy’96] Algorithm: Hash function ℎ: → 0,1 Compute ℎ = min∈ ℎ() 1 Output is −1 ℎ Process(int i): if (h(i) < minHash) minHash = h(index); repeats of the same element i don’t matter 1 = , for distinct elements +1 5 0 Initialize: minHash=1 hash function h into [0,1] Output: 1/minHash-1 “Analysis”: Algorithm DISTINCT: ℎ(5) 1/( + 1) 7 ℎ(7) 2 ℎ(2) 1 Distinct Elements: idea 2 Algorithm Algorithm DISTINCT: DISTINCT: Store ℎ approximately Randomness: 2-wise enough! Store just the count of trailing zeros Need only (log log ) bits Initialize: Initialize: minHash2=0 minHash=1 hash hash function function hh into into [0,1] [0,1] Process(int Process(int i): i): if if (h(i) (h(i) << 1/2^minHash2) minHash) minHash2 minHash == h(index); ZEROS(h(index)); Output: Output:2^minHash2 1/minHash-1 (log ) bits Better accuracy using more space: x=0.0000001100101 ZEROS(x) error 1 + repeat (1/ 2 ) times with different hash functions HyperLogLog: can also with just one hash function [FFGM’07] Problem 2: max count heavy hitters Problem: compute the maximum frequency of an element in the stream Bad news: 2 5 7 5 5 Hard to distinguish whether an element repeated (max = 1 vs 2) Good news: Can find “heavy hitters” elements with frequency > total frequency / s using space proportional to s IP Frequency 2 1 5 3 7 1 Heavy Hitters: CountMin [Charikar-Chen-FarachColton’04, Cormode-Muthukrishnan’05] Algorithm CountMin: 2 ℎ3 2 321 5 ℎ1 (2) 7 5 Initialize(r, L): array Sketch[L][w] L hash functions h[L], into {0,…w-1} 21 4321 321 1 freq freq freq freq 11 ℎ2 (2) 1 1 5 2 =1 5 =3 7 =1 11 = 1 Process(int i): for(j=0; j<L; j++) Sketch[j][ h[j](i) ] += 1; Output: foreach i in PossibleIP { freq[i] = int.MaxValue; for(j=0; j<L; j++) freq[i] = min(freq[i], Sketch[j][h[j](i)]); } // freq[] is the frequency estimate Heavy Hitters: analysis 5 3 2 1 1 3 mass” Algorithm CountMin: 4 1 = frequency of 5, plus “extra Expected “extra mass” ≤ total mass / w Chebyshev: true with probability >1/2 = (log ) to get high probability (for all elements) Compute heavy hitters from freq[] Initialize(r, L): array Sketch[L][w] L hash functions h[L], into {0,…w-1} Process(int i): for(j=0; j<L; j++) Sketch[j][ h[j](i) ] += 1; Output: foreach i in PossibleIP { freq[i] = int.MaxValue; for(j=0; j<L; j++) freq[i] = min(freq[i], Sketch[j][h[j](i)]); } // freq[] is the frequency estimate Problem 3: Moments Problem: compute frequency moment variance 2 = ()2 or higher moments = () for > 2 Skewness (k=3), kurtosis (k=4), etc a different proxy for max: lim = max () →∞ IP Frequency () 2 1 5 3 7 2 () () 1 1 9 81 4 16 2 =1+9+4=14 2 = 3.74 4 =1+81+16=98 4 4 = 3.15 2 moment Use Johnson-Lindenstrauss lemma! (2nd lecture) Store sketch = Update on element : ( + ) = + Guarantees: = frequency vector = by matrix of Gaussian entries = (1/ 2 ) counters (words) () time to update Better: ±1 entries, (1) update [AMS’96, TZ’04] : precision sampling => next Scenario 2: distributed traffic Statistics on traffic difference/aggregate between two routers Eg: traffic different by how many packets? Linearity is the power! Sketch(data 1) + Sketch(data 2) = Sketch(data 1 + data 2) Sketch(data 1) - Sketch(data 2) = Sketch(data 1 - data 2) 131.107.65.14 35.8.10.140 18.9.22.69 IP Frequency IP 18.9.22.69 Frequency 131.107.65.14 1 131.107.65.14 1 18.9.22.69 1 18.9.22.69 2 35.8.10.140 1 Two sketches should be sufficient to compute something on the difference or sum Common primitive: estimate sum Given: quantities 1 , 2 , … in the range [0,1] Goal: estimate = 1 + 2 + ⋯ “cheaply” Standard sampling: pick random set = {1, … } of size Estimator: = ⋅ (1 + 2 + ⋯ ) Chebyshev bound: with 90% success probability 1 – (/) < < 2 + (/) 2 For constant additive error, need = Ω() Compute an estimate from 1, 3 a3 a1 a1 a2 a3 a4 Precision Sampling Framework Alternative “access” to ’s: For each term , we get a (rough) estimate up to some precision , chosen in advance: | – | < Challenge: achieve good trade-off between quality of approximation to use only weak precisions (minimize “cost” of estimating ) Compute an estimate from 1 , 2 , 3 , 4 u1 a1 ã1 u2 a2 ã2 u3 ã3 a3 u4 ã4 a4 Formalization Sum Estimator Adversary 1. fix precisions 1. fix 1, 2, … 3. given 1 , 2 , … , output s.t. − < 1. What is cost? 2. fix 1 , 2 , … s.t. | − | < Here, average cost = 1/ ⋅ 1/ to achieve precision , use 1/ “resources”: e.g., if is itself a sum = computed by subsampling, then one needs Θ(1/ ) samples For example, can choose all = 1/ Average cost ≈ Precision Sampling Lemma [A-Krauthgamer-Onak’11] Goal: estimate ∑ai from {ãi} satisfying |ai-ãi|<ui. Precision Sampling Lemma: can get, with 90% success: O(1) 1.5 multiplicative error: ε additive error and 1+ε – ε <<S̃S̃ << (1+ ε)S +ε S –S O(1) 1.5*S + O(1) O(ε-3 log with average cost equal to O(log n) n) Example: distinguish Σai=3 vs Σai=0 Consider two extreme cases: if three ai=1: enough to have crude approx for all (ui=0.1) if all ai=3/n: only few with good approx ui=1/n, and the rest with ui=1 Precision Sampling Algorithm Precision Sampling Lemma: can get, with 90% success: ε additive error and 1+ε O(1) 1.5 multiplicative error: S –S O(1) 1.5*S + O(1) – ε <<S̃S̃ << (1+ ε)S +ε Algorithm: O(ε-3 log with average cost equal to O(log n) n) Choose each ui[0,1] i.i.d. distrib. = minimum of O(ε-3) u.r.v. concrete function of [ãi of /uii‘s- 4/ε] Estimator: S̃ = count number s.t. ã+i and / ui >u6i’s (up to a normalization constant) Proof of correctness: we use only ãi which are 1.5-approximation to ai E[S̃] ≈ ∑ Pr[ai / ui > 6] = ∑ ai/6. E[1/ui] = O(log n) w.h.p. Moments ( ) via precision sampling Theorem: linear sketch for with (1) approximation, and (1−2/ log ) space (90% succ. prob.). Sketch: Pick random [0,1], {±1}, and let = ⋅ / throw into one hash table , x= x1 1−2/ = ( log ) cells x2 x3 x4 Estimator: 1/ max y1+ y4 H= y 3 Randomness: (1) independence suffices y2+ y5+ y6 x5 x6 Streaming++ LOTS of work in the area: Surveys Muthukrishnan: http://algo.research.googlepages.com/eight.ps McGregor: http://people.cs.umass.edu/~mcgregor/papers/08graphmining.pdf Chakrabarti: http://www.cs.dartmouth.edu/~ac/Teach/CS49Fall11/Notes/lecnotes.pdf Open problems: http://sublinear.info Examples: Moments, sampling Median estimation, longest increasing sequence Graph algorithms Numerical algorithms (e.g., regression, SVD approximation) E.g., dynamic graph connectivity [AGG’12, KKM’13,…] Fastest (sparse) regression […CW’13,MM’13,KN’13,LMP’13] related to Compressed Sensing