Report

Leveraging Big Data: Lecture 3 http://www.cohenwang.com/edith/bigdataclass2013 Instructors: Edith Cohen Amos Fiat Haim Kaplan Tova Milo Overview: More on Min-Hash Sketches Subset/Selection size queries from random samples Min-hash sketches as samples Other uses of the sampling “view” of sketches: Sketch-based similarity estimation Inverse-probability distinct count estimators Min-hash sketches on a small range (fewer bits) How samples are useful We often want to know more than just the number of distinct elements : How many distinct search queries (or distinct query/location pairs)… Involve the recent election? Are related to flu ? Reflect financial uncertainty ? How many distinct IP flows going through our network… use a particular protocol ? are originated from a particular location ? Such subset queries are specified by a predicate. They can be answered approximately from the sample. Min-hash Sketches as Random Samples A min-hash sketch as a random sample: A distinct element is sampled if it “contributed” to the sketch ∖ ≠ s To facilitate subset queries, we need to retain meta-data/IDs of sampled elements. Min-hash samples can be efficiently computed over data streams over distributed data (using mergeability) K-mins sketch as a sample k-mins = 32 12 14 7 6 4 ℎ1 () 0.45 0.35 0.74 0.21 0.14 0.92 ℎ2 () 0.19 ℎ3 () 0.10 0.51 0.07 0.70 0.55 0.20 0.71 0.93 0.50 0.89 0.18 (1 , 2 , 3 ) = ( 0.14 , 0.07 , 0.10 ) k-mins sample: (6,14, 32) Sampling scheme: k times with replacement k-partition sketch as a sample k-partition = 32 12 3 () 2 14 7 6 4 1 1 2 3 ℎ() 0.07 0.70 0.55 0.20 0.19 0.51 (1 , 2 , 3 ) = ( 0.07 , part-hash value-hash 0.19 , 0.20 ) k-partition sample: (14 , 32 , 4) Sampling scheme: throw elements into buckets Choose one uniformly from each nonempty bucket Bottom-k sketch as a sample Bottom-k = 32 12 14 ℎ() 0.19 0.51 0.07 0.70 0.55 0.20 (1 , 2 , 3 ) = { 0.07 , 7 6 4 0.19 , 0.20 } Bottom-k sample: {14 , 32 , 4} Sampling scheme: choose without replacement Selection/Subset queries from min-hash samples ′ ≤ distinct elements sampled The sample is exchangeable (fixing the sample size, all subsets are equally likely). When ≫ all three schemes are similar. Let be the subset of elements satisfying our selection predicate. We want to estimate The number | ∩ | of distinct elements satisfying the predicate or Their fraction |∩| || ≡ Subset queries: k-mins samples One uniform sample ∈ has probability to be from . Its “presence” I∈ is 1 with probability and 0 with probability 1 − . The expectation and variance of I∈ are =⋅+ − ⋅= = ⋅ + − ⋅ − = ( − ) Our estimator for a k-mins sample (1 , … , ) ( times with replacement) is: = Expectation: = = ∈ 1− 2 Variance: = k Subset queries: bottom-k and k-partition samples Sampling is without replacement: Exactly ′ = times with bottom-k ≤ ′ ≤ times with k-partition (’ is the number of nonempty “buckets” when tossing balls into buckets) |∩| |∩| We use the estimator: = = || ′ |∩| The expectation is: = ≡ || The Variance (Conditioned on ′ ) is: ′− − = ( − ) ′ − we show: Expectation of (k-partition and bottom-k) We condition on the number of sampled (distinct) elements ′ ≥ : Consider the “positions” i = 1, … , ′ in the sample and their “contributions” to . We have = = . If a position gets an element ∈ ∩ (probability ), then Ti = ′ . Otherwise, Ti = 0. Therefore, E = − = Var = ′ − ′ = ′ of expectation [] = ′ = [ ] = ′ ′ From linearity k-partition: Since this is the expectation for every possible ′ , it is also the expectation overall. Variance of (k-partition and bottom-k) Conditioned on ′ ≥ ∶ Var = For ≠ , Cov[i , ] = E − = −1 1 −1 ′2 2 − ′2 =− Cov[ , ] = Var = Var = ′ ′ = (1−) n−1 k′2 − 1− 2 ,∈{,…,′ } Cov[ , ] ′ 1− − −1 n − 1 k ′2 ′ 1− ′ − 1 = (1 − ) ′ −1 ′ ′ Subset estimation: Summary For any predicate, we obtain an unbiased estimator of the fraction = with standard deviation ≤ |∩| || ∈ [, ] … More accurate when is close to 0 or to 1 With bottom-k more accurate when = () Next: Sketch-based similarity estimation Applications of similarity Modeling using features Scalability using sketches Terms and shingling technique for text documents. Jaccard and cosine similarity Sketch-based similarity estimators Search example User issues a query (over images, movies, text document, Webpages) Search engine finds many matching documents: Doc 2′ Doc 2 Doc 3′′ Doc 1 Doc 1′ Doc 1′′ Doc 3′ Doc 2′′ Doc 3 Elimination of near duplicates A lot of redundant information – many documents are very similar. Want to eliminate near-duplicates Doc 2′ Doc 2 Doc 3′′ Doc 1 Doc 1′ Doc 1′′ Doc 3′ Doc 2′′ Doc 3 Elimination of near duplicates A lot of redundant information – many documents are very similar. Want to eliminate near-duplicates Doc 2′ Doc 1 Doc 3 Elimination of near duplicates Return to the human user a concise, informative, result. Doc 1 Doc 2′ Doc 3 Return to user Identifying similar documents in a collection of documents (text/images) Why is similarity interesting ? Search (query is also treated as a “document”) Find text documents on a similar topic Face recognition Labeling documents (collection of images, only some are labeled, extend label from similarity) …. Identifying near-duplicates (very similar documents) Why do we want to find near-duplicates ? Plagiarism Copyright violations Clean up search results Why we find many near-duplicates ? Mirror pages Variations on the same source Exact match is easy: use hash/signature Document Similarity Modeling: Identify a set of features for our similarity application. Similar documents should have similar features: similarity is captured by the similarity of the feature sets/vectors (use a similarity measure) Analyse each document to extract the set of relevant features Sketch-based similarity: Making it scalable Sketch the set of features of each document such that similar sets imply similar sketches Estimate similarity of two feature sets from the similarity of the two sketches Doc 1 Doc 2 (0,0,1,0,1,1,0…) (1,0,1,1,1,1,0,…) Sketch 1 Sketch 2 Similarity of text documents What is a good set of features ? Approach: Features = words (terms) View each document as a bag of words Similar documents have similar bags This works well (with TF/IDF weighting…) to detect documents on a similar topic. It does not geared for detecting nearduplicates. Shingling technique for text documents (Web pages) [Broder 97] For a parameter : Each feature corresponds to a -gram (shingle): an ordered set of “tokens” (words) Very similar documents have similar sets of features (even if sentences are shifted, replicated) All 3-shingles in title: technique for text Shingling technique for text documents Web for text documents documents Web pages Similarity measures We measure similarity of two documents by the similarity of their feature sets/vectors Comment: will focus on sets/binary vectors today. In general, we sometimes want to associate “weights” with presence of features in a document Two popular measures are The Jaccard coefficient Cosine similarity Jaccard Similarity A common similarity measure of two sets Features 1 of document 1 Features 2 of document 2 Ratio of size of intersection to size of union: 1 , 2 |1 ∩ 2 | = |1 ∪ 2 | 3 = = 0.375 8 Comment: Weighted Jaccard Similarity of weighted (nonnegative) vectors Sum of min over sum of max , = min{ , } max{ , } = (0.00, 0.23, 0.00, 0.00, 0.03, 0.00, 1.00,0.13) = (0.34, 0.21, 0.00, 0.03, 0.05, 0.00,1.00, 0.00) min = (0.00, 0.21, 0.00, 0.00, 0.03, 0.00, 1.00, 0.00) max = (0.34, 0.23, 0.00, 0.03, 0.05, 0.00, 1.00, 0.13) 1.24 , = 1.78 Cosine Similarity Similarity measure between two vectors: The cosine of the angle between the two vectors. C , = Euclidean Norm: ⋅ V 2 2 = 2 2 Cosine Similarity (binary) View each set ′ ⊂ as a vector ( ′ ) with entry to each element in the domain ∈ ′ ⇔ i ′ = 1 ∉ ′ ⇔ i ′ = 0 Cosine similarity between 1 and 2 : 1 ⋅ (2 ) 1 ∩ 2 C 1 , 2 = = 1 2 2 2 1 |2 | = 3 5 6 ≈ 0.55 Estimating Similarity of sets using their Min-Hash sketches We sketch all sets using the same hash functions. There is a special relation between the sketches: We say the sketches are “coordinated” Coordination is what allows the sketches to be mergeable. If we had used different hash functions for each set, the sketches would not have been mergeable. Coordination also implies that similar sets have similar sketches (LSH property). This allows us to obtain good estimates of the similarity of two sets from the similarity of sketches of the sets. Jaccard Similarity from Min-Hash sketches 1 , 2 |1 ∩ 2 | = |1 ∪ 2 | For each we have a Min-Hash sketch ( ) (use the same hash function/s ℎ for all sets) Merge (1 ) and (2 ) to obtain (1 ∪ 2 ) For each ∈ s(N1 ∪ N2 ) we know everything on its membership in 1 or 2 : ∈ ( ∪ ) is in if and only if ∈ ( ) In particular, we know if ∈ 1 ∩ 2 is the fraction of union members that are intersection members: apply subset estimator to (1 ∪ 2 ) k-mins sketches: Jaccard estimation =4 1 = (0.22, 0.11, 0.14, 0.22) 2 = (0.18, 0.24, 0.14, 0.35) 1 ∪ 2 = (0.18, 0.11, 0.14, 0.22) ∈ 1 ∖ 2 2 1 = = 4 2 ⇒ ∈ 2 ∖ 1 1 = 4 ∈ 1 ∩ 2 1 = 4 |1 ∖2 | |2 ∖1 | |1 ∩2 | Can estimate = , , |1 ∪2 | |1 ∪2 | |1 ∪2 | 1− 2 unbiasedely with = k-partition sketches: Jaccard estimation =4 1 = (1.00, 1.00, 0.14, 0.21) ′ = 2 2 = (0.18, 1.00, 0.14, 0.35) ′ = 3 1 ∪ 2 = (0.18, 1.00, 0.14, 0.21) ′ = 3 ∈ 1 ∖ 2 1 = 3 ⇒ ∈ 2 ∖ 1 1 = 3 ∈ 1 ∩ 2 1 = 3 |1 ∖2 | |2 ∖1 | |1 ∩2 | Can estimate = , , |1 ∪2 | |1 ∪2 | |1 ∪2 | 1− 2 unbiasedely with = (conditioned on ′ ’) Bottom-k sketches: Jaccard estimation =4 1 = {0.09, 0.14, 0.18, 0.21} 2 = {0.14, 0.17, 0.19, 0.35} Smallest = 4 in union of sketches 1 ∪ 2 = {0.09, 0.14, 0.17, 0.18} ∈ 1 ∖ 2 2 = 4 ⇒ ∈ 2 ∖ 1 1 = 4 ∈ 1 ∩ 2 1 = 4 |1 ∖2 | |2 ∖1 | |1 ∩2 | Can estimate = , , |1 ∪2 | |1 ∪2 | |1 ∪2 | 1− −1 2 unbiasedely with = 1− −1 Bottom-k sketches: better estimate =4 1 = {0.09, 0.14, 0.18, 0.21} 2 = {0.14, 0.17, 0.19, 0.35} 1 ∪ 2 = {0.09, 0.14, 0.17, 0.18} 0.19, 0.21 ′ = 6 > 4 ∈ 1 ∖ 2 ∈ 2 ∖ 1 ∈ 1 ∩ 2 We can look beyond the union sketch: We have complete membership information on all elements with ℎ ≤ min {max 1 , max 2 }. We have 2k > ′ ≥ elements! Bottom-k sketches: better estimate =4 1 = {0.09, 0.14, 0.18, 0.21} 2 = {0.14, 0.17, 0.19, 0.35} 1 ∪ 2 = {0.09, 0.14, 0.17, 0.18} 0.19, 0.21 ′ = 6 > 4 ∈ 1 ∖ 2 3 1 = = 6 2 ⇒ ∈ 2 ∖ 1 2 1 = = 6 3 ∈ 1 ∩ 2 1 = 6 |1 ∖2 | |2 ∖1 | |1 ∩2 | Can estimate = , , |1 ∪2 | |1 ∪2 | |1 ∪2 | ′ −1 1− unbiasedely with 2 = 1− (conditioned on ’) ′ −1 Cosine Similarity from Min-Hash sketches: Crude estimator C 1 , 2 = 1 ∩ 2 1 |2 | C 1 , 2 = (1 , 2 ) 1 , 2 |1 ∩ 2 | = |1 ∪ 2 | 1 ∪ 2 1 |2 | We have estimates with good relative error (and 1 1 concentration) for |1 ∪ 2 | , , Plug-in N1 N2 Next: Back to distinct counting Inverse-probability distinct count estimators Separately estimate “presence” of each element Historic Inverse-probability distinct count estimators General approach for deriving estimators: For all distributions, all Min-Hash sketch types 1 2 the variance of purely sketch-based estimators Inverse probability estimators [Horvitz Thompson 1952] Model: There is a hidden value . It is observed/sampled with probability > 0. We want to estimate ≥ 0. If is sampled we know both , and can compute (). Inverse Probability Estimator: If is sampled = Else, = 0 . () is unbiased: = 1 − ⋅ Var = E 1 2 ( − 1) 2 − 2 = 0+ 2 = () − 2 = comment: variance is minimum possible for unbiased nonnegative estimator if domain includes with = 0 Inverse-Probability estimate for a sum We want to estimate the sum: = (). We have a sample of elements. () > 0 ⟹ > 0 and we know , () when ∈ . We use: = () when ∈ . =0 otherwise. Sum estimator: = = ∈ Unbiased implies unbiased . It is important, so bias does not add up For distinct count = I∈ (indicator function). Inverse-Probability estimate for a sum We want to estimate the sum: = (). We have a sample of elements. () > 0 ⟹ > 0 and we know , () when ∈ . We use: = () when ∈ . =0 otherwise. Sum estimator: = = ∈ () can be conditioned on a part in some partition of outcomes. But elements with f > 0 must have > 0 in all parts (otherwise we get bias) Bottom-k sketches: Inverse probability estimator We work with the uniform distribution ℎ ∼ [0,1] For each distinct element, we consider the probability that it is one of the lowest-hash − 1 elements. For sketch 1 < ⋯ < , we say element is “sampled” ⟺ for some ≤ − 1, = ℎ() −1 Caveat: Probability is = for all elements, but we do not know . ⇒ Need to use conditioning. Bottom-k sketches: Inverse probability estimator We use an inverse probability estimate: If is not sampled (not one of the − 1 smallest-hash elements) estimate is 0. Otherwise, it is . () But we do not know ! what can we do ? We compute () conditioned on fixing on ∖ but taking ∼ , Need to be able to compute () only for “sampled” elements. Bottom-k sketches: Inverse probability estimator What is the probability that is sampled if we fix ℎ on ∖ {} but take ℎ ∼ 0,1 ? is sampled ⟺ ℎ < ( − 1)th ℎ | ∈ ∖ For sampled , ( − 1)th ℎ | ∈ ∖ = ⟹ () = ⟹ Inverse probability estimate is 1 () = 1 Summing over the − 1 “sampled” elements: −1 = Explaining conditioning in Inverse Probability Estimate for bottom-k Probability Space on {ℎ | ∈ ∖ }. Partitioned according to = ( − 1)th ℎ | ∈ ∖ Conditional probability that is sampled in the part is Pr ℎ() < = If is “sampled” in outcome, we know (it is 1 equal to ), estimate is . (If is not sampled then = −1 > 0 – this is needed for unbiasedness but estimate for is 0) Explaining conditioning in Inverse Probability Estimate for bottom-k = {, , , , } =3 The probability that has one of the − = smallest values in ℎ , ℎ , … , ℎ is Pr ∈ = but we can not compute it since we do not know (= ). The conditional probability Pr[ ∈ Explaining conditioning in Inverse Probability Estimate for bottom-k ℎ , … , ℎ() (.1,.3,.5,.6) = 0.3 (.2,.3,.5,.71) (.15,.3,.32,.4) =3 ? Pr ∈ ℎ , … , ℎ()] (.11,.2,.28,.3) = 0.2 (.03,.2,.4,.66) (.1,.2,.7,.8) (.1,.4,.5,.8) = 0.4 (.12,.4,.45,.84) Bottom-k sketches: Inverse probability estimators k−1 = We obtain an unbiased estimator. No need to track element IDs (sample view only used for analysis). How good is this estimator? We can (do not) show: CV is ≤ estimator 1 −2 at least as good as the k-mins Better distinct count estimators ? Recap: Our estimators (k-mins, bottom-k) have CV 1 −2 CRLB (k-mins) says CV ≥ 1 Can we improve ? Also, what about k-partition? CRLB applies when we are limited to using only the information in the sketch. Idea: Use information we discard along the way ≤ “Historic” Inverse Probability Estimators We maintain an approximate count together with the sketch: , … , , Initially , … , ← (, … , ) ← When the sketch is updated, we compute the probability that a new distinct element would cause an update to the current sketch. We increase the counter ← + 1 Easy to apply with all min-hash sketches The estimate is unbiased We can (do not) show CV ≤ 1 2−2 < 1 1 2 −2 Maintaining a k-mins “historic” sketch k-mins sketch: Use “independent” hash functions: ℎ1 , ℎ2 , … , ℎ Track the respective minimum 1 , 2 , … , for each function. Update probability: probability that at least for one = 1, … , , we get ℎ < : =1− (1 − ) Processing a new element : ← 1 − (1 − ) For = 1, … , : ← min{ , ℎ } If change in : ← + 1 Maintaining a k-partition “historic” sketch Processing a new element : ← first log 2 bits of ℎ′() ℎ ← remaining bits of ℎ′() If < ℎ , ← 1 ← =1 1 + , ← ℎ Update probability: probability that ℎ < for part selected uniformly at random Maintaining a bottom-k “historic” sketch Bottom-k sketch: Use a single hash function: ℎ Track the smallest values 1 < 2 < ⋯ < Processing a new element : If ℎ < yk c←+ 1 1 , 2 , … , ← sort {1 , 2 , … , −1 , ℎ()} Probability of update is: yk Summary: Historic distinct estimators Recap: 1 2−2 1 1 2 −2 Maintain sketch and count. CV is < Easy to apply. Trivial to query. Unbiased. More: (we do not show here) CV is almost tight for this type of estimator (estimate presence of each distinct element 1 entering sketch). ⟹ Can’t do better than 2 Mergeability: Stated for streams. “Sketch” parts are mergeable but merging “counts” requires work (which uses the sketch parts) Approach: carefully estimate the overlap (say, using similarity estimators) Next: Working with a small range So far Min-Hash sketches were stated/analyzed for distributions (random hash functions) with a continuous range We explain how to work with a discrete range, how small the representation can be, and how estimators are affected. Back-of-the-envelope calculations Working with a small (discrete) range When implementing min-hash sketches: We work with discrete distribution of the hash range We want to use as fewer bits to represent the sketch. Natural discrete distribution: ℎ = 2− with probability 2− Same as using u ∼ [0,1] and retaining only 1 the negated exponent ⌊log 2 ⌋. u Expectation of the min is about 1 n ≈ 1 −log n 2 Expected max exponent size is ≈ log 2 log 2 Elements sorted by hash Negated exponent: 1 0.1xxxxx 2 0.01xx 4 0.0001xx 3 0.001xx Working with a small (discrete) range Can also retain few () bits beyond the exponent. Sketch size is ≈ + log 2 log 2 Can be reduced further to log 2 log 2 + by noting that exponents parts are very similar, so can store only the minimum once and “offsets”. How does this rounding affect the estimators (properties and accuracy) ? We do “back-of-the-envelope” calculations Working with a small (discrete) range “parameter estimation” estimators; Similarity estimators We need to keep enough bits to ensure distinctness of minhash values in the same sketch (for similarity, two sketches) with good probability. To apply “continuous” estimators, we can take a random completion and apply the estimators. k-mins and k-partition: we can separately look at each “coordinate”. The expected number of elements with same “minimum” exponent is fixed. (The probability of exponent − is 2− , so expectation is 2− ). So we can work with a fixed . Working with a small (discrete) range “parameter estimation” estimators; Similarity estimators We need to keep enough bits to ensure distinctness of minhash values in the same sketch (for similarity, two sketches) with good probability. To apply “continuous” estimators, we can take a random completion and apply the estimators. bottom-k: we need to separate the smallest values. We expect about /2 to have the maximum represented exponent. So we need log log + log bits per register. We work with = log Working with a small (discrete) range Inverse probability (also historic) estimators: Estimators apply directly to discrete range: simply work with the probability that a hash from the discrete domain is strictly below current “threshold” Unbiasedness still holds (on streams) even with likely hash collisions (with k-mins and k-partition) 1 1−2− Variance increases by × ⇒ we get most of the value of continuous domain with small For mergeability (support the needed similarity-like estimates to merge counts) or with bottom-k, we need to work with larger = ( ) to ensure that hash collisions are not likely (on same sketch or two sketches). Distinct counting/Min-Hash sketches bibliography 1 First use of k-mins Min-Hash sketches for distinct counting; first streaming algorithm for approximate distinct counting: P. Flajolet and N. Martin, N. “Probabilistic Counting Algorithms for Data Base Applications” JCSS (31), 1985. Use of Min-Hash sketches for similarity, union size, mergeability, size estimation (k-mins, propose bottom-k): E. Cohen “Size estimation framework with applications to transitive closure and reachability”, JCSS (55) 1997 Use of shingling with k-mins sketches for Jaccard similarity of text documents: A. Broder “On the Resemblance and Containment of Documents” Sequences 1997 A. Broder and S. Glassman and M. Manasse and G. Zweig “Syntactic Clustering of the Web” SRC technical note 1997 Better similarity estimators (beyond the union sketch) from bottom-k samples: E. Cohen and H. Kaplan “Leveraging discarded sampled for tighter estimation of multiple-set aggregates: SIGMETRICS 2009. Asymptotic Lower bound on distinct counter size (taking into account hash representation) N. Alon Y. Matias M. Szegedy “The space complexity of approximating the frequency moments” STOC 1996 Introducing k-partition sketches for distinct counting: Z. Bar-Yossef, T. S. Jayram, R. Kumar, D. Sivakumar, L. Trevisan “Counting distinct elements in a data stream” RANDOM 2002. Distinct counting/Min-Hash sketches bibliography 2 Practical distinct counters based on k-partition sketches: P. Flajolet, E. Fusy, O. Gandouet, F. Meunier “Hyperloglog: The analysis of a near-optimal cardinality estimation algorithm” S. Heule, M. Nunkeser, A. Hall “Hyperloglog in practice” algorithmic engineering of a state of the art cardinality estimation algorithm”, EDBT 2013 Theoretical algorithm with asymptotic bounds that match the AMS lower bound: D.M. Kane, J. Nelson, D. P, Woodruff “An optimal algorithm for the distinct elements problem”, PODS 2010 Inverse probability “historic” estimators, Application of Cramer Rao on min-hash sketches: E. Cohen “All-Distances Sketches, Revisited: Scalable Estimation of the Distance Distribution and Centralities in Massive Graphs” arXiv 2013. The concepts of min-hash sketches and sketch coordination are related to concepts from the survey sampling literature: Order samples (bottom-k), coordination of samples using the PRN method (Permanent Random Numbers). More on Bottom-k sketches, ML estimator for bottom-k: E. Cohen, H. Kaplan “Summarizing data using bottom-k sketches” PODS 2007. “Tighter Estimation using bottom-k sketches” VLDB 2008. Inverse probability estimator with priority (type of bottom-k) sketches: N. Alon, N. Duffield, M. Thorup, C. Lund: “Estimating arbitrary subset sums with a few probes” PODS 2005