### Chapter 6 Information Theory

```Chapter 6
Information Theory
1
6.1 Mathematical models for
information source
• Discrete source
X  { x1 , x 2 ,  , x L }
L
p k  P[ X  x k ]

pk  1
k 1
2
6.1 Mathematical models for
information source
• Discrete memoryless source (DMS)
Source outputs are independent random variables
{X i}
i  1, 2 ,  , N
• Discrete stationary source
– Source outputs are statistically dependent
– Stationary: joint probabilities of x1 , x 2 ,  x n and
x1 m , x 2  m ,  x n  m are identical for all shifts m
– Characterized by joint PDF p ( x1 , x 2 ,  x m )
3
6.2 Measure of information
• Entropy of random variable X
– A measure of uncertainty or ambiguity in X
X  { x1 , x 2 ,  , x L }
L
H ( X )    P[ X  x k ] log P [ X  x k ]
k 1
– A measure of information that is required by knowledge of
X, or information content of X per symbol
– Unit: bits (log_2) or nats (log_e) per symbol
– We define 0 log 0  0
– Entropy depends on probabilities of X, not values of X
4
Shannon’s fundamental paper in 1948
“A Mathematical Theory of Communication”
Can we define a quantity which will measure how much
information is “produced” by a process?
He wants this measure H ( p1 , p 2 ,  , p n ) to satisfy:
1) H should be continuous in p i
2) If all p i are equal, H should be monotonically
increasing with n
3) If a choice can be broken down into two
successive choices, the original H should be the
weighted sum of the individual values of H
5
Shannon’s fundamental paper in 1948
“A Mathematical Theory of Communication”
1 1 1
1 1
1
2 1
H( , , ) H( , ) H( , )
2 3 6
2 2
2
3 3
6
Shannon’s fundamental paper in 1948
“A Mathematical Theory of Communication”
The only H satisfying the three assumptions is of
the form:
n
H   K  p i log p i
i 1
K is a positive constant.
7
Binary entropy function
H(p)
H ( X )   p log p  (1  p ) log( 1  p )
H=0: no uncertainty
H=1: most uncertainty
1 bit for binary information
Probability p
8
Mutual information
• Two discrete random variables: X and Y
I ( X ;Y ) 


  P[ X
  P[ X
  P[ X
 x, Y  y ]I ( x, y )
 x , Y  y ] log
P[ x | y ]
P[ x ]
 x , Y  y ] log
P[ x , y ]
P [ x ]P [ y ]
• Measures the information knowing either variables
provides about the other
• What if X and Y are fully independent or dependent?
9
I ( X ;Y )  H ( X )  H ( X | Y )
 H (Y )  H (Y | X )
 H ( X )  H (Y )  H ( X , Y )
10
Some properties
I ( X ; Y )  I (Y ; X )
I ( X ;Y )  0
I(X ; X )  H (X )
I ( X ; Y )  min{ H ( X ), H (Y )}
0  H ( X )  log 
Entropy is maximized
when probabilities are equal
If Y  g ( X ), then H (Y )  H ( X )
11
Joint and conditional entropy
• Joint entropy
H ( X ,Y ) 
 P[ X
 x , Y  y ] log P [ X  x , Y  y ]
• Conditional entropy of Y given X
 P [ X  x ] H (Y | X  x )
   P [ X  x , Y  y ] log P [Y
H (Y | X ) 
 y | X  x]
12
Joint and conditional entropy
• Chain rule for entropies
H ( X 1 , X 2 , , X n )  H ( X 1 )  H ( X 2 | X 1 )  H ( X 3 | X 1 , X 2 )
   H ( X n X 1 , X 2 ,  , X n 1 )
• Therefore,
n
H ( X 1 , X 2 , , X n ) 
 H (X
i
)
i 1
• If Xi are iid
H ( X 1 , X 2 ,  , X n )  nH ( X )
13
6.3 Lossless coding of information
source
• Source sequence with length n
n is assumed to be large
x  [ X 1 , X 2 , , X n ]
X    { x1 , x 2 ,  , x L }
p i  P[ X  x i ]
• Without any source coding
we need log L bits per symbol
14
Lossless source coding
• Typical sequence
– Number of occurrence of x i is roughly np i
– When n   , any x will be “typical”
L
log P [ x ]  log

N
( pi )
np i
 nH ( X )
P[ x ]  1
 np
i
log p i   nH ( X )
i 1
i 1
P[ x ]  2

All typical sequences have
the same probability
when n  
15
Lossless source coding
• Typical sequence
Number of typical sequences =
1
2
nH ( X )
P[ x ]
• Since typical sequences are almost certain to
occur, for the source output it is sufficient to
consider only these typical sequences
• How many bits per symbol we need now?
R 
nH ( X )
n
 H ( X )  log L
16
Lossless source coding
Shannon’s First Theorem - Lossless Source Coding
Let X denote a discrete memoryless source.
There exists a lossless source code at rate R if
R  H (X )
bits per transmission
17
Lossless source coding
For discrete stationary source…
R  H(X )
 lim
k
1
k
H ( X 1 , X 2 , , X k )
 lim H ( X k | X 1 , X 2 ,  , X k 1 )
k
18
Lossless source coding algorithms
• Variable-length coding algorithm
– Symbols with higher probability are assigned
shorter code words
L
min R 
{nk }
n
k
P ( xk )
k 1
– E.g. Huffman coding
• Fixed-length coding algorithm
E.g. Lempel-Ziv coding
19
Huffman coding algorithm
P(x1)
P(x2)
P(x3)
P(x4)
P(x5)
P(x6)
P(x7)
H(X)=2.11
R=2.21 bits per symbol
x1
x2
x3
x4
x5
x6
x7
00
01
10
110
1110
11110
11111
20
6.5 Channel models and channel
capacity
• Channel models
input sequence
output sequence
x  ( x1 , x 2 ,  , x n )
y  ( y1 , y 2 ,  , y n )
A channel is memoryless if
n
P[ y | x ] 
 P[ y
i
| xi ]
i 1
21
Binary symmetric channel (BSC) model
Source
data
Output
data
Channel
encoder
Binary
modulator
Channel
Demodulator
and detector
Channel
decoder
Composite discrete-input discrete output channel
22
Binary symmetric channel (BSC) model
0
1-p
0
p
Input
p
Output
1
1
1-p
P [Y  0 | X  1]  P [Y  1 | X  0 ]  p
P [ Y  1 | X  1]  P [ Y  0 | X  0 ]  1  p
23
Discrete memoryless channel (DMC)
{Y}
x0
y0
x1
y1
…
……
Input
{X}
xM-1
Output
P[ y | x ]
yQ-1
can be arranged
in a matrix
24
Discrete-input continuous-output channel
Y  X  N
If N is additive white Gaussian noise…
p( y | x) 

1
2 
e
( yx)
2
2
2
2
n
p ( y 1 , y 2 ,  , y n | x1 , x 2 ,  , x n ) 

p ( yi | xi )
i 1
25
Discrete-time AWGN channel
y i  xi  ni
• Power constraint E [ X ]  P
• For input sequence x  ( x1 , x 2 ,  , x n ) with large
n
2
1
n
x

n
i 1
2
i

1
x
2
 P
n
26
AWGN waveform channel
Source
data
Output
data
Channel
encoder
Modulator
Physical
channel
Input
waveform
Demodulator
and detector
Channel
decoder
Output
waveform
• Assume channel has bandwidth W, with
frequency response C(f)=1, [-W, +W]
y (t )  x (t )  n (t )
27
AWGN waveform channel
• Power constraint
E [ X ( t )]  P
2
lim
T 
1
T

T 2
T 2
x ( t )dt  P
2
28
AWGN waveform channel
• How to define probabilities that characterize
the channel?
x (t ) 
x
j
j
(t )
j
(t )
j
n (t ) 
n
j
yi  x j  n j
j
y (t ) 

y j j ( t )
j
{ j ( t ), j  1, 2 ,  , 2WT }
Equivalent to 2W
uses per second of
a discrete-time
channel
29
AWGN waveform channel
• Power constraint becomes...
lim
T 
1
T

T 2
T 2
x ( t )dt  lim
2
T 
1
T
2W T
x
j 1
2
j
 lim
T 
1
 2WT  E [ X ]
2
T
 2W E [ X ]
2
 P
• Hence,
E[ X ] 
2
P
2W
30
Channel capacity
• After source coding, we have binary sequency
of length n
• Channel causes probability of bit error p
• When n->inf, the number of sequences that
have np errors
 n

 np

n!
nH b ( p )
 
2
 ( np )! ( n (1  p ))!
31
Channel capacity
• To reduce errors, we use a subset of all
possible sequences
2
M 
2
n
nH b ( p )
2
n (1  H b ( p ))
• Information rate [bits per transmission]
R 
1
n
log
2
M  1  H b ( p)
Capacity of
binary channel
32
Channel capacity
0  R  1  H b ( p)  1
We cannot transmit more
than 1 bit per channel use
Channel encoder: add redundancy
2n different binary
sequencies of length n
contain information
We use 2m different binary
sequencies of length m for
transmission
33
Channel capacity
• Capacity for abitray discrete memoryless channel
C  max I ( X ; Y )
p
• Maximize mutual information between input and
output, over all p  ( p 1 , p 2 ,  , p X )
• Shannon’s Second Theorem – noisy channel coding
- R < C, reliable communication is possible
- R > C, reliable communication is impossible
34
Channel capacity
For binary symmetric channel
P [ X  1]  P [ X  0 ] 
1
2
C  1  p log 2 p  (1  p ) log 2 (1  p )  1  H ( p )
35
Channel capacity
Discrete-time AWGN channel with an input
power constraint
E[ X ]  P
Y  X  N
2
For large n,
1
y
2
 E[ X ]  E[ N ]  P  
2
2
2
n
1
n
yx
2

1
n
2

2
n
36
Channel capacity
Discrete-time AWGN channel with an input
power constraint
E[ X ]  P
Y  X  N
2
Maximum number of symbols to transmit

M 
n(P   )
2

n
2

n

n
 (1 
P

n
2
)
2
Transmission rate
R 
1
n
log
2
M 
1
2
log 2 (1 
P

2
)
Can be obtained by directly
maximizing I(X;Y), subject to
power constraint
37
Channel capacity
Band-limited waveform AWGN channel with
input power constraint
- Equivalent to 2W use per second of discretetime channel
P
1
1
P
2
W
C  log 2 (1 
)  log 2 (1 
)
bits/channel use
N
2
2
N 0W
0
2
1
P
P
C  2W  log 2 (1 
)  W log 2 (1 
)
bits/s
2
N 0W
N 0W
38
Channel capacity
C  W log 2 (1 
P  
W  
P
)
N 0W
C  
C  1 . 44
P
N0
39
Channel capacity
• Bandwidth efficiency
r 
R
W
 log 2 (1 
r  log 2 (1 
bR
N 0W
P
)

b 
log
N 0W
)  log 2 (1  r
b
2

M
PT s
log
2

M
P
R
)
N0
• Relation of bandwidth efficiency and power
efficiency
b
N0
2 1
r

r
r  0,
b
 ln 2   1 . 6 dB
N0
40
41
```