### lecture2

```Leveraging Big Data: Lecture 2
http://www.cohenwang.com/edith/bigdataclass2013
Instructors:
Edith Cohen
Amos Fiat
Haim Kaplan
Tova Milo
Counting Distinct Elements
32, 12, 14, 32, 7, 12, 32, 7, 6,
12, 4,
 Elements occur multiple times, we want to
count the number of distinct elements.
 Number of distinct element is  ( =6 in
example)
 Total number of elements is 11 in this example
Exact counting of  distinct element requires a structure
of size Ω  !
We are happy with an approximate count that uses a
small-size working memory.
Distinct Elements: Approximate Counting
32, 12, 14, 32, 7, 12, 32, 7, 6,
12, 4,
We want to be able to compute and maintain
a small sketch () of the set  of distinct
items seen so far  = {, , , , , }
Distinct Elements: Approximate Counting
 Size of sketch s() ≪  =
 Can query s(N) to get a good estimate () of
(small relative error)
 For a new element , easy to compute s( ∪ )
from s  and
 For data stream computation
 If 1 and 2 are (possibly overlapping) sets then
we can compute the union sketch from their
sketches: (1 ∪ 2 ) from (1 ) and s 2
 For distributed computation
Distinct Elements: Approximate Counting
32, 12, 14, 32, 7, 12, 32, 7, 6,
12, 4,
Size-estimation/Minimum value technique:
[Flajolet-Martin 85, C 94]
ℎ  ∼ [0,1] ℎ is a random hash function from
element IDs to uniform random numbers in [0,1]
Maintain the Min-Hash value :
 Initialize  ← 1
 Processing an element :  ← min {, ℎ  }
Distinct Elements: Approximate Counting

32, 12, 14, 32, 7, 12, 32, 7, 6,
12, 4,
0 1 2 3 3 4 4 4 4 5 5 6
ℎ() 0.45 0.35 0.74 0.45 0.21 0.35 0.45 0.21 0.14 0.35 0.92
1
0.45 0.35 0.35
0.35 0.21 0.21
0.21
0.21
0.14
0.14
0.14
The minimum hash value  is:
Unaffected by repeated elements.
Is non-increasing with the number of distinct elements .
Distinct Elements: Approximate Counting
How does the minimum hash  give information on
the number of distinct elements  ?
1
0
minimum
The expectation of the minimum is
=

+
A single value gives only limited information.
To boost information, we maintain  ≥  values
Why expectation is
1
?
+1
 Take a circle of length 1
 Throw a random red point to “mark” the start of a
segment of length 1 (circle points map to [0,1] )
 Throw another  point independently at random
 The circle is cut into  + 1 segments by these points.
 The expected length of each segment is
1
+1
 Same also for the segment clockwise from the red
point.
Min-Hash Sketches
These sketches maintain  values  ,  , … ,  from the
range of the hash function (distribution).
k-mins sketch: Use  “independent” hash functions: ℎ1 , ℎ2 , … , ℎ
Track the respective minimum 1 , 2 , … ,  for each function.
Bottom-k sketch: Use a single hash function: ℎ
Track the  smallest values 1 , 2 , … ,
k-partition sketch: Use a single hash function: ℎ′
Use the first log 2  bits of ℎ′() to map  uniformly to one of
parts. Call the remaining bits ℎ(x).
For  = 1, … ,  : Track the minimum hash value  of the elements
in part .
All sketches are the same for  = 1
Min-Hash Sketches
k-mins, bottom-k, k-partition
Why study all 3 variants ? Different tradeoffs between
update cost, accuracy, usage…
Beyond distinct counting:
 Min-Hash sketches correspond to sampling schemes of
large data sets
 Similarity queries between datasets
 Selectivity/subset queries
 These patterns generally apply as methods to gather
increased confidence from a random “projection”/sample.
Min-Hash Sketches: Examples
k-mins, k-partition, bottom-k  =
= { 32 , 12 , 14 , 7 , 6 , 4 }
The min-hash value and sketches only depend on
 The random hash function/s
 The set  of distinct elements
Not on the order elements appear or their multiplicity
Min-Hash Sketches: Example
k-mins  =
32 12 14 7 6 4

ℎ1 () 0.45 0.35 0.74 0.21 0.14 0.92
ℎ2 () 0.19
ℎ3 () 0.10
0.51
0.07 0.70 0.55 0.20
0.71
0.93 0.50 0.89 0.18
(1 , 2 , 3 ) = ( 0.14 ,
0.07 , 0.10 )
Min-Hash Sketches: k-mins
k-mins sketch: Use  “independent” hash functions: ℎ1 , ℎ2 , … , ℎ
Track the respective minimum 1 , 2 , … ,  for each function.
Processing a new element  :
For  = 1, … ,  :  ← min{  , ℎ  }
ℎ1  = 0.35
ℎ2  = 0.51
ℎ3  = 0.71
Computation: ()
Whether sketch is actually updated or not.
Min-Hash Sketches: Example
k-partition  =
32 12

3
() 2
14
7
6
4
1
1
2
3
ℎ()
0.07 0.70 0.55 0.20
0.19
0.51
(1 , 2 , 3 ) = ( 0.07 ,
0.19 , 0.20 )
part-hash
value-hash
Min-Hash Sketches: k-partition
k-partition sketch: Use a single hash function: ℎ′
Use the first log 2  bits of ℎ′() to map  uniformly to one of
parts. Call the remaining bits ℎ(x).
For  = 1, … ,  : Track the minimum hash value  of the elements
in part .
Processing a new element  :
  ← first log 2  bits of ℎ′()
 ℎ ← remaining bits of ℎ′()
  ← min{ , ℎ}
=2
ℎ  = 0.19
2 ← min{2 , 0.19}
Computation: (1) to test or update
Min-Hash Sketches: Example
Bottom-k  =

32
12
14
ℎ()
0.19
0.51
0.07 0.70 0.55 0.20
7
(1 , 2 , 3 ) = ( 0.07 ,
6
4
0.19 , 0.20 )
Min-Hash Sketches: bottom-k
Bottom-k sketch: Use a single hash function: ℎ
Track the  smallest values 1 < 2 < ⋯ <
Processing a new element  :
If ℎ  < yk :
(1 , … ,  ) ← sort{1 , … , −1 , ℎ()}
Computation:
The sketch (1 , … ,  ) is maintained as a sorted list or as a
priority queue.
 (1) to test if an update is needed
 () to update a sorted list. (log ) to update a priority
queue.
We will see that #changes ≪ #distinct elements
Claim: The expected number of actual updates
(changes) of the min-hash sketch is O( ln )
Proof: First Consider  = . Look at distinct elements in the
order they first occur.
The th distinct element has lower hash value than the current

minimum with probability . This is the probability of being

first in a random permutation of  elements.
⟹ Total expected number of updates is
1
=1
=  ≤ ln .
32, 12, 14, 32, 7, 12, 32, 7, 6,
Update
Prob.
1
1
2
1
3
0
1
4
0
0
0
1
5
12, 4,
0
1
6
Claim: The expected number of actual updates
(changes) of the min-hash sketch is O( ln )
Proof (continued): Recap for  = 1 (single min-hash
value): the  th distinct element causes an update with
1
1
probability ⟹ expected total is =1 ≤ ln .
i

k-mins:  min-hash values (apply  times)
Bottom-k: We keep the  smallest elements, so update

th
probability of the  distinct element is min{1, }

(probability of being in the first  in a random permutation)
k-partition:  min-hash values for ≈ / distinct values.
Merging Min-Hash Sketches
!! We apply the same set of hash function to all
elements/data sets/streams.
The union sketch  from sketches of two sets ’,’’:
 k-mins: take minimum per hash function
← min {′ , ′′ }
 k-partition: take minimum per part i ← min {′ , ′′ }
 Bottom-k: The  smallest in union of data must be in
the  smallest of their own set:
{1 , … ,  } = bottom{1′ , … , ′ , 1′′ , … , ′′ }
Using Min-Hash Sketches
Recap:
 We defined Min-Hash Sketches (3 types)
 Adding elements, merging Min-Hash sketches
 Some properties of these sketches
Next: We put Min-Hash sketches to work
 Estimating Distinct Count from a Min-Hash
Sketch
 Tools from estimation theory
The Exponential Distribution Exp()
−
 PDF e
−
,  ≥ 0 ; CDF 1 − e
; ==
 Very useful properties:
Memorylessness:
∀,  ≥ 0, Pr  >  +   > ] = Pr  >
Min-to-Sum conversion:
min Exp 1 , … , Exp  ∼ Exp(1 + ⋯ +  )
 Relation with uniform:
ln 1−
−

−
∼  0,1 ⇔
∼ Exp() ⇔ 1 − e
∼ Exp()
∼  0,1
1
n
Estimating Distinct Count from a MinHash Sketch: k-mins
• Change to exponential distribution ℎ  ∼ Exp(1)
• Using Min-to-Sum property,  ∼ Exp()
– In fact, we can just work with ℎ  ∼ U[0,1] and use
← −ln 1 −  when estimating.
• Number of distinct elements becomes a parameter
estimation problem:
Given  independent samples from Exp() ,
estimate
Estimating Distinct Count from a MinHash Sketch: k-mins
1

 Each  ∼ Exp() has expectation and variance
2
=
1
.
2

=1
 The average  =
has expectation  =

1

2
variance  = 2 . The cv is = 1/  .

  is a good unbiased estimator
 But
1

1
for

which is the inverse of what we want.
1

and
Estimating Distinct Count from a MinHash Sketch: k-mins
1) We can use the biased estimator
1

=

=1
 To say something useful on the estimate quality: We
apply Chebyshev’s inequality to bound the probability
1
1
that  is far from its expectation and thus is far
n

from
2) Maximum Likelihood Estimation (general and
powerful technique)
Chebyshev’s Inequality
For any random variable with expectation
and standard deviation , for any  ≥ 1
1
Pr  −  ≥  ≤ 2

For ,  =
For  <
Pr
1

1
,
2
1

, =
1

−  ≥   ≤ Pr  −
Using  =

2
1

≥

2
≤
4
2
Using Chebyshev’s Inequality
For 0 <  <
Pr
1

1
,
2

2
1− >
1
1+
;
1
1−
>1+ >1+
1

2
−  ≥   = 1 − Pr − ≤ −  ≤   =
1
1 − Pr (1 − ) ≤ ≤ (1 + )  =

1 1
1 1
1 − Pr
≥≥
≤
1−
1+
1

1
1 − Pr
1+
≥  ≥ (1 − )

2
2
1

= Pr  − ≥

2
Maximum Likelihood Estimation
Set of independent  ∼ Fi () ; we do not know
The MLE  is the value that maximizes the
likelihood (joint density) function (; ). The
maximum over  of the probability of observing { }
Properties:
 Principled way of deriving estimators
 Converges in probability to true value (with
enough i.i.d samples)… but generally biased
 (Asymptotically!) optimal – minimizes MSE
(mean square error) – meets Cramér-Rao lower
bound
Estimating Distinct Count from a MinHash Sketch: k-mins MLE
Given  independent samples from Exp() ,
estimate
 Likelihood function for yi (joint density function):
−

=1 e
k −n
=1
=n e
;  =
 Take a logarithm (does not change the maximum):
ℓ 1 , … ,  ;  = log  ;  =  ln  −  =1
 Differentiate to find maximum:
 MLE estimate  =

ℓ ;

= −

=1
=0

=1
We get the same estimator, depends only on the sum!
Given  independent samples from Exp() ,
estimate
We can think of several ways to combine and
use these  samples and decrease the variance:
• average (sum)
• median
• remove outliers and average remaining, …
We want to get the most value (best estimate) from
the information we have (the sketch).
What combinations should we consider ?
Sufficient Statistic
A function T y = T y1 , … ,  is a sufficient
statistic for estimating some function of the
parameter  if the likelihood function has the
factored form   (  ; )
Likelihood function (joint density) for
exponential i.i.d random variables from Exp() :

;  =
e
−
k −n
=1
=n e
=1
⇒ The sum

=1
is a sufficient statistic for
Sufficient Statistic
A function T y = T y1 , … ,  is a sufficient
statistic for estimating some function of the
parameter  if the likelihood function has the
factored form  ;  =   (  ; )
In particular: The MLE depends on  only through ()
 The maximum with respect to  does not depend on
.
 The maximum of (  ; ) , computed by deriving
with respect to , is a function of T  .
Sufficient Statistic
T y = T y1 , … ,  is a sufficient statistic for
if the likelihood function has the form  ;  =
(  ; )
Lemma:   sufficient ⟺ Conditional
distribution of  given () does not depend on
If we fix   , the density function is  ;  ∝

If we know the density up to fixed factor, it is
determined completely by normalizing to 1
Rao-Blackwell Theorem
Recap: T y is a sufficient statistic for  ⟺
Conditional distribution of  given () does
not depend on
Rao-Blackwell Theorem: Given an estimator () of
that is not a function of the sufficient statistic, we
can get an estimator with at most the same MSE
that depends only on (): [()|()]
 [()|()] does not depend on  (critical)
 Process is called: Rao-Blackwellization of ()
Rao-Blackwell Theorem
(1 , 2 ; )
(1,3)
Density function of 1 , 2 given
parameter
(2,2)
(4,0)
(1,2)
(3,1)
(2,1)
(3,2)
(3,0)
(1,4)
Rao-Blackwell Theorem
Sufficient statistic:
(1 , 2 ; )
(1,3)
T 1 , 2 = y1 + y2
(2,2)
(4,0)
(1,2)
(3,1)
(2,1)
(3,2)
(3,0)
(1,4)
Rao-Blackwell Theorem
(1 , 2 ; )
Sufficient statistic:
T 1 , 2 = y1 + y2
(1,3)
(2,2)
(4,0)
(1,2)
(3,1)
(2,1)
(3,2)
(3,0)
(1,4)
Rao-Blackwell Theorem
(1 , 2 ; )
Sufficient statistic:
1 , 2 ;  |y1 + y2 T 1 , 2 = y1 + y2
(1,3)
(2,2)
(4,0)
(1,2)
(3,1)
(2,1)
(3,2)
(3,0)
(1,4)
Rao-Blackwell Theorem
Estimator ( ,  )
3 (1,3)
0 (4,0)
2 (2,2)
2 (1,2)
1 (3,1)
T 1 , 2 = y1 + y2
1 (2,1)
0 (3,0)
2 (3,2)
4 (1,4)
Rao-Blackwell Theorem
( ,  ) T 1 , 2 = y1 + y2
Rao-Blackwell: ′ = [  ,  | +  ]
3 (1,3)
0
1 (2,1)
2 (2,2)
1.5
(4,0)
1
2 (1,2)
1 (3,1)
0 (3,0)
2 (3,2)
4 (1,4)
3
Rao-Blackwell Theorem
( ,  ) T 1 , 2 = y1 + y2
Rao-Blackwell: ′ = [  ,  | +  ]
 Law of total expectation:
[′] = []
Expectation (bias) remains the same
 MSE (Mean Square Error) can only decrease
′
MSE  ≤ MSE[]
Why does the MSE decrease?
 Suppose we have two points with equal
probabilities. We have an estimator of  that
gives estimates  and  on these points.
 We replace it by an estimator that instead
+
returns the average:
2
 The (scaled) contribution of these two points
to the square error changes from
−
2
+ −
2
to 2
+
2
−
2
Why does the MSE decrease?
Show that
−
2
+ −
2
+
≥2
−
2
2
Sufficient Statistic for estimating
from k-mins sketches
Given  independent samples from Exp() ,
estimate
  ;  =
−

=1 e
k −n
=1
=n e
 The sum =1  is a sufficient statistic for estimating
1
any function of  (including , , n2 )

 Rao-Blackwell ⇒ We can not gain by using estimators
with a different dependence on { } (e.g. functions
of median or of a smaller sum)
Estimating Distinct Count from a MinHash Sketch: k-mins MLE
MLE estimate  =

=1
•  = =1  , the sum of i.i.d ∼ Exp() random
variables), has PDF
−1

, () = e−
−1 !
The expectation of the MLE estimate is
∞

,   =

−1
0
Estimating Distinct Count from a MinHash Sketch: k-mins
Unbiased Estimator  =
−1

=1
(for  > 1)
The variance of the unbiased estimate is
∞
2

−
1
1
2
2
2
=

−

=

,
2

−2
0
The CV is

=
1
−2
Is this the best we can do ?
Cramér-Rao lower bound (CRLB)
Are we using the information in the sketch
in the best possible way ?
Cramér-Rao lower bound (CRLB)
Information theoretic lower bound on the
variance of any unbiased estimator  of .
Likelihood function:  ;
Log likelihood:
ℓ ;  = ln  ;
Fisher Information
2 ℓ ;
= −E
2
CRLB: Any unbiased estimator has V  ≥
1

CRLB for estimating
 Likelihood function for n, y = yi

;  =
−

=1
k −n
=1
=n e
 Log likelihood ℓ ;  = ln  − n
 Negated second
2 ℓ ;
derivative: 2

 Fisher information: I n =
 CRLB : var  ≥
1

=
2

=1
=

− 2

2 ℓ ;
−E[ 2

]=

2
Estimating Distinct Count from a MinHash Sketch: k-mins
Unbiased Estimator  =
Our estimator has CV
−1

=1
(for  > 1)
1
−2
The Cramér-Rao lower bound on CV is
1

⇒
we are using the information in the sketch nearly
optimally !
Estimating Distinct Count from a MinHash Sketch: Bottom-k
Bottom-k sketch 1 < 2 < ⋯ <
Can we specify the distribution? Use Exponential D.
  = 1 same as k-mins 1 ∼ Exp()
 The minimum 2 of the remaining  − 1
elements is Exp  − 1 | 2 > 1 . Since
memoryless, 2 − 1 ∼ Exp( − 1).
 More generally +1 −  ∼ Exp( − ).
What is the relation with k-mins sketches?
Bottom-k versus k-mins sketches
Bottom-k sketch: samples from
Exp  , Exp  − 1 , … , Exp( −  + 1)
K-mins sketch:  samples from Exp
To obtain  ∼ Exp  from  ∼ Exp  −  (without
knowing ) we can take min{, } where  ∼ Exp
We can use k-mins estimators with bottom-k. Can do
even better by taking expectation over choices of .
Bottom-k sketches carry strictly more
information than k-mins sketches!
Estimating Distinct Count from a MinHash Sketch: Bottom-k
Likelihood function of 1 , … ,  , :

−(+1−)( −−1 )
;  =
( + 1 − )e
=
=1
n!
e−(n+1)
n−k !
=

=1( −−1 )
− −1
=1
e
Does not depend on n
e

n!
e−
n−k !
−−1
n+1
Depends on n
What does estimation theory tell us?
=
Estimating Distinct Count from a MinHash Sketch: Bottom-k
What does estimation theory tell us?
Likelihood function
!
− −1

=1
;  = e
e−
− !
+1
(maximum value in the sketch) is a sufficient
statistic for estimating  (or any function of  ).
Captures everything we can glean from the
bottom-k sketch on
Bottom-k: MLE for Distinct Count
Likelihood function (probability density) is
!
− −1

=1
;  = e
e− +1
− !

Find the value of  which maximizes  ;  :
Look only at part that depends on
Take the logarithm (same maximum)
−1
ℓ ;  =
ln( − ) −  + 1
=0
Bottom-k: MLE for Distinct Count
We look for  which maximizes
−1
ℓ ;  =
ln( − ) −  + 1
=0
ℓ ;
=

−1
=0
1
−
−
−1
MLE is the solution of:
Need to solve numerically
=0
1
=
−
Summary: k-mins count estimators
 k-mins sketch with U 0,1 dist: 1 ′, … ,  ′
 With Exp dist: 1 , … ,
= −ln(1 − ′ )
 Sufficient statistic for (any function of)  :
 MLE/Unbiased est for
 MLE for :
1
:

=1

cv:
1

CRLB:

=1
 Unbiased est for :
k−1

=1
cv:
1
1
CRLB:
−2

=1
1

Summary: bottom-k count estimators
 Bottom-k sketch with U 0,1 :
1′
<⋯<
′

 With Exp dist: 1 < ⋯ <   = −ln(1 −
′ )
 Sufficient statistic for (any function of)  :
 When  ≫ , approximately the same as k-mins
−1
 MLE for  is the solution of:
1
=
−
=0
Bibliography
•
•
•
•
•
See lecture 3
We will continue with Min Hash sketches
Use as random samples
Applications to similarity
Inverse-Probability based distinct count
estimators
```