### Entropy and Shannon`s First Theorem

```Chapter 6
Entropy and
Shannon’s First
Theorem
Information
A quantitative measure of the amount of information any
event represents. I(p) = the amount of information in the
occurrence of an event of probability p.
Axioms:
A. I(p) ≥ 0
B. I(p1∙p2) = I(p1) + I(p2)
C. I(p)
Cauchy functional equation
Existence: I(p) = log_(1/p)
source
single
symbol
for any event p
p1 & p2 are independent events
is a continuous function of p
units of information:
in base 2 = a bit
in base e = a nat
in base 10 = a Hartley
6.2
Uniqueness:
Suppose I′(p) satisfies the axioms. Since I′(p) ≥ 0, take
any 0 < p0 < 1, any base k = (1/p0)(1/I′(p0)). So kI′(p0) = 1/p0,
and hence logk (1/p0) = I′(p0). Now, any z  (0,1) can be
written as p0r, r a real number  R+ (r = logp0 z). The
Cauchy Functional Equation implies that I′(p0n) = n I′(p0)
and m  Z+, I′(p01/m) = (1/m) I′(p0), which gives I′(p0n/m) =
(n/m) I′(p0), and hence by continuity I′(p0r) = r I′(p0).
Hence I′(z) = r∙logk (1/p0) = logk (1/p0r) = logk (1/z). ⁪
Note: In this proof, we introduce an arbitrary p0, show how
any z relates to it, and then eliminate the dependency
on that particular p0.
6.2
Entropy
The average amount of information received on a per
symbol basis from a source S = {s1, …, sq} of symbols, si has
probability pi. It is measuring the information rate.
In radix r, when all the probabilities are independent:
weighted arithmeticmean
of information
information of the
weighted geometricmean





pi
pi
q
q
q
1
1
1
H r ( S )   pi  logr
  logr    logr   
pi
i 1
i 1
i 1  pi 
 pi 
• Entropy is amount of information in probability distribution.
Alternative approach: consider a long message of N symbols
from S = {s1, …, sq} with probabilities p1, …, pq. You expect si to
appear Npi times, and the probability of this typical message is:
q
P   pi
i 1
Np i
q
1
1
whose information is log   N  pi log  N  H (S)
pi
P 
i 1
6.3
Consider f(p) = p ln (1/p): (works for any base, not just e)
f′(p) = (-p ln p)′ = -p(1/p) – ln p = -1 + ln (1/p)
f″(p) = p(-p-2) = - 1/p < 0 for p  (0,1)  f is concave down
f′(1/e) = 0
f(1/e) = 1/e
1/
e
f
f′(1) = -1
f′(0) = ∞
0
lim f ( p)  lim
p 0
p 0
ln p1
1
p
1/
e
p
1
f(1) = 0
 ln p
( ln p)
 p 1
 lim 1  lim
 lim
0

1

2
p 0 p
p  0 ( p )
p 0  p
6.3
Basic information about logarithm function
Tangent line to y = ln x at x = 1
(y  ln 1) = (ln)′x=1(x 
1)
y=x1
(ln x)″ = (1/x)′ = -(1/x2) < 0 x
 ln x is concave down.
y=x1
ln x
0
x
1
-1
Conclusion: ln x  x  1
6.4
Fundamental Gibbs inequality
q
q
i 1
i 1
Let  xi  1 and  yi  1 be two probability distributions, and consider
q
yi
xi log

xi
i 1
 only when xi  yi


q
q
q
q
yi
xi (1  )   ( xi  yi )   xi   yi  1  1  0

xi
i 1
i 1
i 1
i 1
• Minimum Entropy occurs when one pi = 1 and all others are 0.
• Maximum Entropy occurs when? Consider
Gibbs with
 1 distribution y i  q1
  
q
q
q
1
q
H (S )  logq   pi log  logq  pi   pi log 

0
 pi 
pi
i 1
i 1
i 1
 
 
• Hence H(S) ≤ log q, and equality occurs only when pi = 1/q.
1

6.4
Entropy Examples
S = {s1}
S = {s1,s2}
S = {s1, …, sr}
p1 = 1
p1 = p2 = ½
p1 = … = pr = 1/r
H(S) = 0
H2(S) = 1
Hr(S) = 1
(no information)
(1 bit per symbol)
but H2(S) = log2r.
• Run length coding (for instance, in binary predictive coding):
p = 1  q is probability of a 0. H2(S) = p log2(1/p) + q log2(1/q)
As q  0 the term q log2(1/q) dominates (compare slopes). C.f.
average run length = 1/q and average # of bits needed = log2(1/q).
So q log2(1/q) = avg. amount of information per bit of original code.
Entropy as a Lower Bound for Average Code Length
Given an instantaneous code with length li in radix r, let
q
1
r  li
K   li  1 ; Qi 
;  Qi  1
K
i 1 r
i 1
q
 Qi 
Qi
1
1
So by Gibbs,  pi logr    0, applyinglog  log  log
pi
pi
Qi
i 1
 pi 
q
q
q
1
1
H r ( S )   pi logr
  pi logr
  pi (logr K  li logr r )
pi i 1
Qi i 1
i 1
q
q
 logr K   pi li . Since K  1, logr K  0, and henceH r ( S )  L.
i 1
By the McMillan inequality, this hold for all uniquely decodable
codes. Equality occurs when K = 1 (the decoding tree is complete)
and
p  r li
i
6.5
Shannon-Fano Coding
Simplest variable length method. Less efficient than Huffman, but
allows one to code symbol si with length li directly from
probability pi.
li = logr(1/pi)


pi
1
1
1
r
li
 li
  logr   li   logr   1   r   pi  r  .
pi 
pi 
pi
pi
r


K

q
Summing this inequality over i:
p
i 1
i
q
1 r
i 1
 li
q

i 1
pi 1

r r
Kraft inequality is satisfied, therefore there is an instantaneous
code with these lengths.
6.6
q
q
1
Also, H r ( S )   pi logr
  pi li  H r ( S )  1
pi 
i 1
i 1

L
by summing multipliedby pi
Example: p’s: ¼, ¼, ⅛, ⅛, ⅛, ⅛ l’s: 2, 2, 3, 3, 3, 3 K = 1
0
H2(S) = 2.5
0
1
1
L = 5/2
0
0
1
1
0
1
6.6
The Entropy of Code Extensions
Recall: The nth extension of a source S = {s1, …, sq} with
probabilities p1, …, pq is the set of symbols
T = Sn = {si1 ∙∙∙ sin : sij  S 1  j  n} where
concatenation
multiplication
ti = si1 ∙∙∙ sin has probability pi1 ∙∙∙ pin = Qi assuming independent
probabilities. Let i = (i1−1, …, in−1)q + 1, an n-digit number base q.
The entropy is:
[]
qn
qn
1
1
H ( S )  H (T )   Qi log   Qi log

Qi i 1
pi1  pin
i 1
n

1
1

Qi log
   log


pi1
pin
i 1

qn
qn
 qn
   Qi log 1     Qi log 1 .
 i 1
pi1
pin
i 1

6.8
qn
qn
1
1
Consider the kth term Qi log
  pi1  pi n log

pi k i 1
pi k
i 1
q
q
q
q
q
1
1
ˆ
ˆ
 pi1  pi n log
 i k  pi1  pi k  pi n  pi k log


pi k i1 1 i n 1
pi k
i 1 1
i n 1
i k 1
q

i 1 1
iˆk
q
  pi1  pˆ i k  pi n H (S )  H (S )
i n 1
 pi1  pˆ i k  pi n is just a probability in the(n  1)st
extension,and adding themall up gives1.
 H(Sn) = n∙H(S)
Hence the average S-F code length Ln for T satisfies:
H(T)  Ln < H(T) + 1  n ∙ H(S)  Ln < n ∙ H(S) + 1 
H(S)  (Ln/n) < H(S) + 1/n [now let n go to infinity]
6.8
Extension Example
S = {s1, s2}
H2(S) = (2/3)log2(3/2) + (1/3)log2(3/1)
p1 = 2/3 p2 = 1/3
~ 0.9182958 …
Huffman: s1 = 0 s2 = 1 Avg. coded length = (2/3)∙1+(1/3)∙1 = 1
Shannon-Fano: l1 = 1 l2 = 2 Avg. length = (2/3)∙1+(1/3)∙2 = 4/3
2nd extension: p11 = 4/9 p12 = 2/9 = p21 p22 = 1/9 S-F:
l11 = log2 (9/4) = 2 l12 = l21 = log2 (9/2) = 3 l22 = log2 (9/1) = 4
LSF(2) = avg. coded length = (4/9)∙2+(2/9)∙3∙2+(1/9)∙4 = 24/9 = 2.666…
Sn = (s1 + s2)n, probabilities are corresponding terms in (p1 + p2)n
i
n i
i
n i
n i
n


2
1
2




  p1  p2 So thereare   symbolswith probability     

n
i 
3
3
3
i 0  i 
  
 
n

3n 
T hecorresponding SF lengthis log2 i   n log2 3  i   n log2 3  i
2 

6.9
Extension cont.
LSF
(n)
 n  2i
1 n n i
    n  n log2 3  i   n   2  n log2 3  i  
3 i 0  i 
i 0  i  3
n
 n i n  n i 
1 
2n




n
log
3
2

2

i

n
log
3

 

2   
2 
n 


3
i
i
3 
i 0  
i 0  

n
(2 + 1)n = 3n
(n )
LSF
Hence
n
2n 3n-1 *
2
 log2 3   H2 (S )
n 
3
as
n
x 1
 n  i n  i dx
n i
n 1
n  i 1
 (2  x )    2  x  n (2  x )    2 (n  i ) x
 n  3n 1 
i 0  i 
i 0  i 
n
n
n
n
n i
n i
n
n i
n 1
i
n
n 1








2
(
n

i
)

n

3

n
2


i

2

2

i

n

3

n

3




 
 
 
 
i 0  i 
i 0  i 
i 0  i 
i 0  i 
n
n
6.9
Markov Process Entropy
p( si | si1  sim )  conditional probability thatsi follows si1  sim .
For an mth order process,thinkof let tingthestate s  si1 , , sim .
 1 
Hence, I ( s | s ) 

log
, and so
i
 p( si | s ) 
 p( s | s )  I ( s | s )
H (S | s ) 
si S
i
i
Now, let p(s )  theprobability of being in states .
T henH (S )   p(s )  H (S | s ) 
s S m
 p(s )  p(s | s )  I (s | s )    p(s )  p(s
   p(s , s )  I (s | s )   p(s , s )  log
s i S
s S m
s S
m
s i S
i
i
i
i
s S m s i S
i
s , s i S
m 1
i
| s )  I (si | s )
1
p ( s i |s )
6.10
.8
previous
state
0, 0
.2
.5
.5
0, 1
1, 0
.5
.5
.2
1, 1
.8
equilibrium probabilities:
p(0,0) = 5/14 = p(1,1)
p(0,1) = 2/14 = p(1,0)
next
state
Si1
Si2
Si
0
0
0
0.8
5/14
4/14
0
0
1
0.2
5/14
1/14
0
1
0
0.5
2/14
1/14
0
1
1
0.5
2/14
1/14
1
0
0
0.5
2/14
1/14
1
0
1
0.5
2/14
1/14
1
1
0
0.2
5/14
1/14
1
1
1
0.8
5/14
4/14
H 2 (S ) 

{0,1}3
2
Example
p(si | si1, si2) p(si1, si2) p(si1, si2, si)
1
p(si1 , si2 , si ) log2

p(si | si1 , si2 )
4
1
1
1
1
1
log 2
 2 log 2
 4 log 2
 0.801377
14
0.8
14
0.2
14
0.5
6.11
The Fibonacci numbers
Let f0 = 1 f1 = 2 f2 = 3 f3 = 5 f4 = 8 , ….
+1
be defined by fn+1 = fn + fn−1. The lim
=
→∞
1+ 5
2
= the golden ratio, a root of the
equation x2 = x + 1. Use these as the
weights for a system of number
representation with digits 0 and 1, without
adjacent 1’s (because (100)phi = (11)phi).
Base Fibonacci
Representation Theorem: every number from 0 to fn − 1 can
be uniquely written as an n-bit number with no adjacent one’s .
Existence: Basis: n = 0 0 ≤ i ≤ 0. 0 = (0)phi = ε
Induction: Let 0 ≤ i ≤ fn+1 If i < fn , we are done by induction
hypothesis. Otherwise, fn ≤ i < fn+1 = fn−1 + fn , so 0 ≤ (i − fn) <
fn−1, and is uniquely representable by i − fn = (bn−2 … b0)phi with
bi in {0, 1} ¬(bi = bi+1 = 1). Hence i = (10bn−2 … b0)phi which also
has no adjacent ones.
Uniqueness: Let i be the smallest number ≥ 0 with two distinct
representations (no leading zeros). i = (bn−1 … b0)phi = (b′n−1 …
b′0)phi . By minimality of i bn−1 ≠ b′n−1 , and so without loss of
generality, let bn−1 = 1 b′n−1 = 0, implies (b′n−2 … b′0)phi ≥ fn−1
which can’t be true.
Base Fibonacci
The golden ratio  = (1+√5)/2 is a solution
to x2 − x − 1 = 0 and is equal to the limit of
the ratio of adjacent Fibonacci numbers.
1/2
1/
1/2
1/r
H2 = log2 r
0
1
0
1/
0
…
r−1
1st order Markov process:
1
0
1/
Think of source as 0
emitting variable 10
length symbols:
1/2
1/ 1/2
1
0
1/ + 1/2 = 1
Entropy = (1/)∙log  + ½(1/²)∙log ² = log  which is maximal
take into account
variable length symbols
```