Report

Leveraging Big Data: Lecture 2 http://www.cohenwang.com/edith/bigdataclass2013 Instructors: Edith Cohen Amos Fiat Haim Kaplan Tova Milo Counting Distinct Elements 32, 12, 14, 32, 7, 12, 32, 7, 6, 12, 4, Elements occur multiple times, we want to count the number of distinct elements. Number of distinct element is ( =6 in example) Total number of elements is 11 in this example Exact counting of distinct element requires a structure of size Ω ! We are happy with an approximate count that uses a small-size working memory. Distinct Elements: Approximate Counting 32, 12, 14, 32, 7, 12, 32, 7, 6, 12, 4, We want to be able to compute and maintain a small sketch () of the set of distinct items seen so far = {, , , , , } Distinct Elements: Approximate Counting Size of sketch s() ≪ = Can query s(N) to get a good estimate () of (small relative error) For a new element , easy to compute s( ∪ ) from s and For data stream computation If 1 and 2 are (possibly overlapping) sets then we can compute the union sketch from their sketches: (1 ∪ 2 ) from (1 ) and s 2 For distributed computation Distinct Elements: Approximate Counting 32, 12, 14, 32, 7, 12, 32, 7, 6, 12, 4, Size-estimation/Minimum value technique: [Flajolet-Martin 85, C 94] ℎ ∼ [0,1] ℎ is a random hash function from element IDs to uniform random numbers in [0,1] Maintain the Min-Hash value : Initialize ← 1 Processing an element : ← min {, ℎ } Distinct Elements: Approximate Counting 32, 12, 14, 32, 7, 12, 32, 7, 6, 12, 4, 0 1 2 3 3 4 4 4 4 5 5 6 ℎ() 0.45 0.35 0.74 0.45 0.21 0.35 0.45 0.21 0.14 0.35 0.92 1 0.45 0.35 0.35 0.35 0.21 0.21 0.21 0.21 0.14 0.14 0.14 The minimum hash value is: Unaffected by repeated elements. Is non-increasing with the number of distinct elements . Distinct Elements: Approximate Counting How does the minimum hash give information on the number of distinct elements ? 1 0 minimum The expectation of the minimum is = + A single value gives only limited information. To boost information, we maintain ≥ values Why expectation is 1 ? +1 Take a circle of length 1 Throw a random red point to “mark” the start of a segment of length 1 (circle points map to [0,1] ) Throw another point independently at random The circle is cut into + 1 segments by these points. The expected length of each segment is 1 +1 Same also for the segment clockwise from the red point. Min-Hash Sketches These sketches maintain values , , … , from the range of the hash function (distribution). k-mins sketch: Use “independent” hash functions: ℎ1 , ℎ2 , … , ℎ Track the respective minimum 1 , 2 , … , for each function. Bottom-k sketch: Use a single hash function: ℎ Track the smallest values 1 , 2 , … , k-partition sketch: Use a single hash function: ℎ′ Use the first log 2 bits of ℎ′() to map uniformly to one of parts. Call the remaining bits ℎ(x). For = 1, … , : Track the minimum hash value of the elements in part . All sketches are the same for = 1 Min-Hash Sketches k-mins, bottom-k, k-partition Why study all 3 variants ? Different tradeoffs between update cost, accuracy, usage… Beyond distinct counting: Min-Hash sketches correspond to sampling schemes of large data sets Similarity queries between datasets Selectivity/subset queries These patterns generally apply as methods to gather increased confidence from a random “projection”/sample. Min-Hash Sketches: Examples k-mins, k-partition, bottom-k = = { 32 , 12 , 14 , 7 , 6 , 4 } The min-hash value and sketches only depend on The random hash function/s The set of distinct elements Not on the order elements appear or their multiplicity Min-Hash Sketches: Example k-mins = 32 12 14 7 6 4 ℎ1 () 0.45 0.35 0.74 0.21 0.14 0.92 ℎ2 () 0.19 ℎ3 () 0.10 0.51 0.07 0.70 0.55 0.20 0.71 0.93 0.50 0.89 0.18 (1 , 2 , 3 ) = ( 0.14 , 0.07 , 0.10 ) Min-Hash Sketches: k-mins k-mins sketch: Use “independent” hash functions: ℎ1 , ℎ2 , … , ℎ Track the respective minimum 1 , 2 , … , for each function. Processing a new element : For = 1, … , : ← min{ , ℎ } ℎ1 = 0.35 ℎ2 = 0.51 ℎ3 = 0.71 Computation: () Whether sketch is actually updated or not. Min-Hash Sketches: Example k-partition = 32 12 3 () 2 14 7 6 4 1 1 2 3 ℎ() 0.07 0.70 0.55 0.20 0.19 0.51 (1 , 2 , 3 ) = ( 0.07 , 0.19 , 0.20 ) part-hash value-hash Min-Hash Sketches: k-partition k-partition sketch: Use a single hash function: ℎ′ Use the first log 2 bits of ℎ′() to map uniformly to one of parts. Call the remaining bits ℎ(x). For = 1, … , : Track the minimum hash value of the elements in part . Processing a new element : ← first log 2 bits of ℎ′() ℎ ← remaining bits of ℎ′() ← min{ , ℎ} =2 ℎ = 0.19 2 ← min{2 , 0.19} Computation: (1) to test or update Min-Hash Sketches: Example Bottom-k = 32 12 14 ℎ() 0.19 0.51 0.07 0.70 0.55 0.20 7 (1 , 2 , 3 ) = ( 0.07 , 6 4 0.19 , 0.20 ) Min-Hash Sketches: bottom-k Bottom-k sketch: Use a single hash function: ℎ Track the smallest values 1 < 2 < ⋯ < Processing a new element : If ℎ < yk : (1 , … , ) ← sort{1 , … , −1 , ℎ()} Computation: The sketch (1 , … , ) is maintained as a sorted list or as a priority queue. (1) to test if an update is needed () to update a sorted list. (log ) to update a priority queue. We will see that #changes ≪ #distinct elements Min-Hash Sketches: Number of updates Claim: The expected number of actual updates (changes) of the min-hash sketch is O( ln ) Proof: First Consider = . Look at distinct elements in the order they first occur. The th distinct element has lower hash value than the current minimum with probability . This is the probability of being first in a random permutation of elements. ⟹ Total expected number of updates is 1 =1 = ≤ ln . 32, 12, 14, 32, 7, 12, 32, 7, 6, Update Prob. 1 1 2 1 3 0 1 4 0 0 0 1 5 12, 4, 0 1 6 Min-Hash Sketches: Number of updates Claim: The expected number of actual updates (changes) of the min-hash sketch is O( ln ) Proof (continued): Recap for = 1 (single min-hash value): the th distinct element causes an update with 1 1 probability ⟹ expected total is =1 ≤ ln . i k-mins: min-hash values (apply times) Bottom-k: We keep the smallest elements, so update th probability of the distinct element is min{1, } (probability of being in the first in a random permutation) k-partition: min-hash values for ≈ / distinct values. Merging Min-Hash Sketches !! We apply the same set of hash function to all elements/data sets/streams. The union sketch from sketches of two sets ’,’’: k-mins: take minimum per hash function ← min {′ , ′′ } k-partition: take minimum per part i ← min {′ , ′′ } Bottom-k: The smallest in union of data must be in the smallest of their own set: {1 , … , } = bottom{1′ , … , ′ , 1′′ , … , ′′ } Using Min-Hash Sketches Recap: We defined Min-Hash Sketches (3 types) Adding elements, merging Min-Hash sketches Some properties of these sketches Next: We put Min-Hash sketches to work Estimating Distinct Count from a Min-Hash Sketch Tools from estimation theory The Exponential Distribution Exp() − PDF e − , ≥ 0 ; CDF 1 − e ; == Very useful properties: Memorylessness: ∀, ≥ 0, Pr > + > ] = Pr > Min-to-Sum conversion: min Exp 1 , … , Exp ∼ Exp(1 + ⋯ + ) Relation with uniform: ln 1− − − ∼ 0,1 ⇔ ∼ Exp() ⇔ 1 − e ∼ Exp() ∼ 0,1 1 n Estimating Distinct Count from a MinHash Sketch: k-mins • Change to exponential distribution ℎ ∼ Exp(1) • Using Min-to-Sum property, ∼ Exp() – In fact, we can just work with ℎ ∼ U[0,1] and use ← −ln 1 − when estimating. • Number of distinct elements becomes a parameter estimation problem: Given independent samples from Exp() , estimate Estimating Distinct Count from a MinHash Sketch: k-mins 1 Each ∼ Exp() has expectation and variance 2 = 1 . 2 =1 The average = has expectation = 1 2 variance = 2 . The cv is = 1/ . is a good unbiased estimator But 1 1 for which is the inverse of what we want. What about estimating ? 1 and Estimating Distinct Count from a MinHash Sketch: k-mins What about estimating ? 1) We can use the biased estimator 1 = =1 To say something useful on the estimate quality: We apply Chebyshev’s inequality to bound the probability 1 1 that is far from its expectation and thus is far n from 2) Maximum Likelihood Estimation (general and powerful technique) Chebyshev’s Inequality For any random variable with expectation and standard deviation , for any ≥ 1 1 Pr − ≥ ≤ 2 For , = For < Pr 1 1 , 2 1 , = 1 − ≥ ≤ Pr − Using = 2 1 ≥ 2 ≤ 4 2 Using Chebyshev’s Inequality For 0 < < Pr 1 1 , 2 2 1− > 1 1+ ; 1 1− >1+ >1+ 1 2 − ≥ = 1 − Pr − ≤ − ≤ = 1 1 − Pr (1 − ) ≤ ≤ (1 + ) = 1 1 1 1 1 − Pr ≥≥ ≤ 1− 1+ 1 1 1 − Pr 1+ ≥ ≥ (1 − ) 2 2 1 = Pr − ≥ 2 Maximum Likelihood Estimation Set of independent ∼ Fi () ; we do not know The MLE is the value that maximizes the likelihood (joint density) function (; ). The maximum over of the probability of observing { } Properties: Principled way of deriving estimators Converges in probability to true value (with enough i.i.d samples)… but generally biased (Asymptotically!) optimal – minimizes MSE (mean square error) – meets Cramér-Rao lower bound Estimating Distinct Count from a MinHash Sketch: k-mins MLE Given independent samples from Exp() , estimate Likelihood function for yi (joint density function): − =1 e k −n =1 =n e ; = Take a logarithm (does not change the maximum): ℓ 1 , … , ; = log ; = ln − =1 Differentiate to find maximum: MLE estimate = ℓ ; = − =1 =0 =1 We get the same estimator, depends only on the sum! Given independent samples from Exp() , estimate We can think of several ways to combine and use these samples and decrease the variance: • average (sum) • median • remove outliers and average remaining, … We want to get the most value (best estimate) from the information we have (the sketch). What combinations should we consider ? Sufficient Statistic A function T y = T y1 , … , is a sufficient statistic for estimating some function of the parameter if the likelihood function has the factored form ( ; ) Likelihood function (joint density) for exponential i.i.d random variables from Exp() : ; = e − k −n =1 =n e =1 ⇒ The sum =1 is a sufficient statistic for Sufficient Statistic A function T y = T y1 , … , is a sufficient statistic for estimating some function of the parameter if the likelihood function has the factored form ; = ( ; ) In particular: The MLE depends on only through () The maximum with respect to does not depend on . The maximum of ( ; ) , computed by deriving with respect to , is a function of T . Sufficient Statistic T y = T y1 , … , is a sufficient statistic for if the likelihood function has the form ; = ( ; ) Lemma: sufficient ⟺ Conditional distribution of given () does not depend on If we fix , the density function is ; ∝ If we know the density up to fixed factor, it is determined completely by normalizing to 1 Rao-Blackwell Theorem Recap: T y is a sufficient statistic for ⟺ Conditional distribution of given () does not depend on Rao-Blackwell Theorem: Given an estimator () of that is not a function of the sufficient statistic, we can get an estimator with at most the same MSE that depends only on (): [()|()] [()|()] does not depend on (critical) Process is called: Rao-Blackwellization of () Rao-Blackwell Theorem (1 , 2 ; ) (1,3) Density function of 1 , 2 given parameter (2,2) (4,0) (1,2) (3,1) (2,1) (3,2) (3,0) (1,4) Rao-Blackwell Theorem Sufficient statistic: (1 , 2 ; ) (1,3) T 1 , 2 = y1 + y2 (2,2) (4,0) (1,2) (3,1) (2,1) (3,2) (3,0) (1,4) Rao-Blackwell Theorem (1 , 2 ; ) Sufficient statistic: T 1 , 2 = y1 + y2 (1,3) (2,2) (4,0) (1,2) (3,1) (2,1) (3,2) (3,0) (1,4) Rao-Blackwell Theorem (1 , 2 ; ) Sufficient statistic: 1 , 2 ; |y1 + y2 T 1 , 2 = y1 + y2 (1,3) (2,2) (4,0) (1,2) (3,1) (2,1) (3,2) (3,0) (1,4) Rao-Blackwell Theorem Estimator ( , ) 3 (1,3) 0 (4,0) 2 (2,2) 2 (1,2) 1 (3,1) T 1 , 2 = y1 + y2 1 (2,1) 0 (3,0) 2 (3,2) 4 (1,4) Rao-Blackwell Theorem ( , ) T 1 , 2 = y1 + y2 Rao-Blackwell: ′ = [ , | + ] 3 (1,3) 0 1 (2,1) 2 (2,2) 1.5 (4,0) 1 2 (1,2) 1 (3,1) 0 (3,0) 2 (3,2) 4 (1,4) 3 Rao-Blackwell Theorem ( , ) T 1 , 2 = y1 + y2 Rao-Blackwell: ′ = [ , | + ] Law of total expectation: [′] = [] Expectation (bias) remains the same MSE (Mean Square Error) can only decrease ′ MSE ≤ MSE[] Why does the MSE decrease? Suppose we have two points with equal probabilities. We have an estimator of that gives estimates and on these points. We replace it by an estimator that instead + returns the average: 2 The (scaled) contribution of these two points to the square error changes from − 2 + − 2 to 2 + 2 − 2 Why does the MSE decrease? Show that − 2 + − 2 + ≥2 − 2 2 Sufficient Statistic for estimating from k-mins sketches Given independent samples from Exp() , estimate ; = − =1 e k −n =1 =n e The sum =1 is a sufficient statistic for estimating 1 any function of (including , , n2 ) Rao-Blackwell ⇒ We can not gain by using estimators with a different dependence on { } (e.g. functions of median or of a smaller sum) Estimating Distinct Count from a MinHash Sketch: k-mins MLE MLE estimate = =1 • = =1 , the sum of i.i.d ∼ Exp() random variables), has PDF −1 , () = e− −1 ! The expectation of the MLE estimate is ∞ , = −1 0 Estimating Distinct Count from a MinHash Sketch: k-mins Unbiased Estimator = −1 =1 (for > 1) The variance of the unbiased estimate is ∞ 2 − 1 1 2 2 2 = − = , 2 −2 0 The CV is = 1 −2 Is this the best we can do ? Cramér-Rao lower bound (CRLB) Are we using the information in the sketch in the best possible way ? Cramér-Rao lower bound (CRLB) Information theoretic lower bound on the variance of any unbiased estimator of . Likelihood function: ; Log likelihood: ℓ ; = ln ; Fisher Information 2 ℓ ; = −E 2 CRLB: Any unbiased estimator has V ≥ 1 CRLB for estimating Likelihood function for n, y = yi ; = − =1 k −n =1 =n e Log likelihood ℓ ; = ln − n Negated second 2 ℓ ; derivative: 2 Fisher information: I n = CRLB : var ≥ 1 = 2 =1 = − 2 2 ℓ ; −E[ 2 ]= 2 Estimating Distinct Count from a MinHash Sketch: k-mins Unbiased Estimator = Our estimator has CV −1 =1 (for > 1) 1 −2 The Cramér-Rao lower bound on CV is 1 ⇒ we are using the information in the sketch nearly optimally ! Estimating Distinct Count from a MinHash Sketch: Bottom-k Bottom-k sketch 1 < 2 < ⋯ < Can we specify the distribution? Use Exponential D. = 1 same as k-mins 1 ∼ Exp() The minimum 2 of the remaining − 1 elements is Exp − 1 | 2 > 1 . Since memoryless, 2 − 1 ∼ Exp( − 1). More generally +1 − ∼ Exp( − ). What is the relation with k-mins sketches? Bottom-k versus k-mins sketches Bottom-k sketch: samples from Exp , Exp − 1 , … , Exp( − + 1) K-mins sketch: samples from Exp To obtain ∼ Exp from ∼ Exp − (without knowing ) we can take min{, } where ∼ Exp We can use k-mins estimators with bottom-k. Can do even better by taking expectation over choices of . Bottom-k sketches carry strictly more information than k-mins sketches! Estimating Distinct Count from a MinHash Sketch: Bottom-k Likelihood function of 1 , … , , : −(+1−)( −−1 ) ; = ( + 1 − )e = =1 n! e−(n+1) n−k ! = =1( −−1 ) − −1 =1 e Does not depend on n e n! e− n−k ! −−1 n+1 Depends on n What does estimation theory tell us? = Estimating Distinct Count from a MinHash Sketch: Bottom-k What does estimation theory tell us? Likelihood function ! − −1 =1 ; = e e− − ! +1 (maximum value in the sketch) is a sufficient statistic for estimating (or any function of ). Captures everything we can glean from the bottom-k sketch on Bottom-k: MLE for Distinct Count Likelihood function (probability density) is ! − −1 =1 ; = e e− +1 − ! Find the value of which maximizes ; : Look only at part that depends on Take the logarithm (same maximum) −1 ℓ ; = ln( − ) − + 1 =0 Bottom-k: MLE for Distinct Count We look for which maximizes −1 ℓ ; = ln( − ) − + 1 =0 ℓ ; = −1 =0 1 − − −1 MLE is the solution of: Need to solve numerically =0 1 = − Summary: k-mins count estimators k-mins sketch with U 0,1 dist: 1 ′, … , ′ With Exp dist: 1 , … , = −ln(1 − ′ ) Sufficient statistic for (any function of) : MLE/Unbiased est for MLE for : 1 : =1 cv: 1 CRLB: =1 Unbiased est for : k−1 =1 cv: 1 1 CRLB: −2 =1 1 Summary: bottom-k count estimators Bottom-k sketch with U 0,1 : 1′ <⋯< ′ With Exp dist: 1 < ⋯ < = −ln(1 − ′ ) Sufficient statistic for (any function of) : Contains strictly more information than k-mins When ≫ , approximately the same as k-mins −1 MLE for is the solution of: 1 = − =0 Bibliography • • • • • See lecture 3 We will continue with Min Hash sketches Use as random samples Applications to similarity Inverse-Probability based distinct count estimators