### Information and Coding Theory lecture slides

```Information and Coding Theory
Dr Costas Constantinou
School of Electronic, Electrical & Computer Engineering
University of Birmingham
W: www.eee.bham.ac.uk/ConstantinouCC/
E: [email protected]
Recommended textbook
• Simon Haykin and Michael Moher,
Communication Systems, 5th Edition, Wiley,
2010; ISBN:970-0-470-16996-4
Information theory
• Claude E Shannon (1948) A Mathematical Theory of
Communication, The Bell System Technical Journal,
Vol. 27, pp. 379–423, 623–656, July, October
– How to describe information sources
– How to represent information from a source
– How to transmit information over a channel
Information sources
• Consider the following forecast:
– Birmingham on 15 July at midday; cloudy
– Birmingham on 15 July at midday; sunny
• The forecast is an information source; the information has two
outcomes, cloudy and sunny
• Now consider the following forecast:
– Cairo on 15 July at midday; cloudy
– Cairo on 15 July at midday; sunny
• This forecast is another information source; the information also
has two outcomes, cloudy and sunny. Since we expect it to be
sunny, the latter outcome has little information since it is a highly
probable outcome; conversely the cloudy contains a lot of
information since it is highly improbable outcome
Information sources
• The information content, I, of an outcome (or
information symbol), xi, occurring with probability, pi,
must obey the following properties:
1.
2.
3.
4.
I  x i   0,
fo r all 0  p i  1
If P  x i , x j   p i p j
then
I  xi , x j   I  xi   I  x j 
I  x i  is a co n tin u o u s fu n ctio n o f 0  p i  1
I  xi   I  x j  ,
if 0  p i  p j  1
Information sources
•
•
A measure of the
information contained in
an outcome and satisfying
all of the earlier properties
was introduced by Hartley
in 1927
Hartley defined the
information (sometimes
called self-information)
contained in an outcome,
xi, as,
I  xi 
lo g 2 1 p i    lo g 2 p i
Properties of logarithms
log a b  c  a  b
log a a  1
log a x  log a y  log a  xy 
log a 1  0
lo g a x  lo g a y  lo g a  x y 
d ln x
c
dx
log a x  n log a x
n
log a
q
x 
log a x 
1
q
d log a x
log a x
log b x
log b a


ln x
ln a
dx
1
x
1

x ln a
 ln xdx  x ln x  x  constant
 log a xdx 
x ln x  x
ln a
 constant
Information sources
• The choice of the logarithm base 2, and the arbitrary
multiplicative constant of 1, simply give the unit of
information, in this case the bit
• Choosing a logarithm base of 10 gives the unit of Hartley
• Choosing a logarithm base of e (natural logarithms) gives the
unit of nat
• Hartley's measure defines the information in a single outcome
– we are interested in the information content of a source
Information sources
• The measure entropy, H(X), defines the information content
of the source X as a whole
• It is the mean information provided by the source per
outcome, or symbol
H X

I  xi 
 xi 


i
p i I  x i     p i log 2 p i
i
• It is measured in bits/symbol
• A binary symmetric source (BSS) is a source with two outputs
whose probabilities are p and 1 – p respectively
Information sources
• The weather forecast we
discussed earlier is a BSS
• The entropy of this information
source is:
H X
   p log 2
p
 1  p  log 2 1  p 
• When either outcome becomes
certain (p = 0 or p = 1), the
entropy takes the value 0
• The entropy becomes
maximum when p = 1 – p = ½
Information sources
• An equiprobable BSS has an entropy, or average information
content per symbol, of 1 bit per symbol
• Warning: Engineers use the word bit to describe both the
symbol, and its information content [confusing]
– A BSS whose outputs are 1 or 0 has an output we describe as a bit
– The entropy of the source is also measured in bits, so that we might
say the equi-probable BSS has an information rate of 1 bit/bit
– The numerator bit refers to the information content; the denominator
bit refers to the symbol 1 or 0
different
Information sources
• For a BSS, the entropy is maximised when both outcomes are
equally likely
• This property is generally true for any number of symbols
• If an information source X has k symbols, its maximum
entropy is log2k and this is obtained when all k outcomes are
equally likely
• Thus, for a k symbol source:
0 H
X 
lo g 2 k
• The redundancy of an information source is defined as,
R edundancy  1 
X 
H m ax  X 
H
1
H
X 
log 2 k
Information sources
• Redundancy – natural languages
– A natural language is that used by man, in spoken or
written form, such as, English, French, Chinese
– All natural languages have some redundancy to permit
understanding in difficult circumstances, such as a noisy
environment
– This redundancy can be at symbol (letter) level, or word
level
– Here we will calculate redundancy at symbol level
Information sources
• The maximum entropy occurs
when all symbols have equal
probability
• Thus in English, with 27 letters
plus a space,
H m ax  log 2 27 
ln 27
 4.75
ln 2
• Assuming the symbols/letters
in the English language are
independent and their
frequencies are given in the
table,
H  4.18
•

R edundancy  13%
Letter
Probability
Letter
Probability
A
0.082
N
0.067
B
0.015
O
0.075
C
0.028
P
0.019
D
0.043
Q
0.001
E
0.127
R
0.060
F
0.022
S
0.063
G
0.020
T
0.091
H
0.061
U
0.028
I
0.070
V
0.010
J
0.002
W
0.023
K
0.008
X
0.001
L
0.040
Y
0.020
M
0.024
Z
0.001
Note: H is really much smaller, between 0.6 and 1.3, as listeners can guess the
next letter in many cases
Source coding
• It is intuitively reasonable that an information source of
entropy H needs on average only H binary bits to represent
each symbol
• The equiprobable BSS generates on average 1 information
bit per symbol bit
• Consider the Birmingham weather forecast again:
– Suppose the probability of cloud (C) is 0.1 and that of sun (S) 0.9
– The earlier graph shows that this source has an entropy of 0.47
bits/symbol
– Suppose we identify cloud with 1 and sun with 0
– This representation uses 1 binary bit per symbol
– Thus we are using more binary bits per symbol than the entropy
suggests is necessary
Source coding
• The replacement of the symbols rain/no-rain with a binary
representation is called source coding
• In any coding operation we replace the symbol with a
codeword
• The purpose of source coding is to reduce the number of
bits required to convey the information provided by the
information source (reduce redundancy)
• Central to source coding is the use of sequences
• By this, we mean that codewords are not simply associated
to a single outcome, but to a sequence of outcomes
• Suppose we group the outcomes in threes (e.g. for 15, 16
and 17 July), according to their probability, and assign
binary codewords to these grouped outcomes
Source coding
• The table here shows such a
variable length code and
the probability of each
codeword occurring for our
weather forecasting
example
• It is easy to compute that
this code will on average
use 1.2 bits/sequence
• Each sequence contains
three symbols, so the code
uses 0.4 bits/symbol
Sequence
Probability
Codeword
SSS
0.729
0
SSC
0.081
1
SCS
0.081
01
CSS
0.081
10
CCS
0.009
11
CSC
0.009
00
SCC
0.009
000
CCC
0.001
111
Source coding
• Using sequences decreases the average number of bits per symbol
• We have found a code that has an average bit usage less than the
source entropy – how is this possible?
• BUT, there is a difficulty with the previous code:
– Before a codeword can be decoded, it must be parsed
– Parsing describes the activity of breaking a message string into its
component codewords
– After parsing, each codeword can be decoded into its symbol sequence
– An instantaneously parseable code is one that can be parsed as soon as
the last bit of a codeword is received
– An instantaneous code must satisfy the prefix condition: No codeword
may be a prefix of any other codeword
• This condition is not satisfied by our code
– The sequence 011 could be 0-11 (i.e. SSS followed by CCS) or 01-1 (i.e. SCS
followed by SSC) and is therefore not decodable unambiguously
Source coding
• The table here shows an
instantaneously parseable
variable length code and it
satisfies the prefix condition
• It is easy to compute that this
code uses on average 1.6
bits/sequence
• The code uses 0.53
bits/symbol
• This is a 47% improvement on
identifying each symbol with a
bit
• This is the Huffman code for
the sequence set
Sequence
Probability
Codeword
SSS
0.729
1
SSC
0.081
011
SCS
0.081
010
CSS
0.081
001
CCS
0.009
00011
CSC
0.009
00010
SCC
0.009
00001
CCC
0.001
00000
Source coding
• The code for each sequence is found by generating the
Huffman code tree for the sequence
• A Huffman code tree is an unbalanced binary tree
• The derivation of the Huffman code tree is shown in the next
slide
– STEP 1: Have the n symbols ordered according to non-increasing
values of their probabilities
– STEP 2: Group the last two symbols xn-1 and xn into an equivalent
“symbol” with probability pn-1 + pn
– STEP 3: Repeat steps 1 and 2 until there is only one “symbol” left
– STEP 4: Looking at the tree originated by the above steps (see next
slide), associate the binary symbols 0 and 1 to each pair of branches at
each intermediate node – the code word for each symbol can be read
as the binary sequence found when starting at the root of the tree and
reaching the terminal leaf node associated with that symbol
Source coding
Probability Sequence
(0.729)
0.729
SSC
0.081
SSC
(0.081)
0.081
SCS
(0.081)
0.081
CSS
(0.081)
0.009
CCS
0.009
CSC
0.009
SCC
0.001
CCC
STEPS 1, 2 and 3
Source coding
1
SSS
(1)
1
SSC
(011)
0
1
0
SCS
(010)
STEP 4
0
0
1
CSS
(001)
1
1
CCS
(00011)
0
CSC
(00010)
0
1
SCC
(00001)
0
CCC
(00000)
Source coding
• Huffman coding relies on the fact that both the transmitter and the
receiver know the sequence set (or data set) before communicating
and can build the code table
• The noiseless source coding theorem (also called Shannon’s first
theorem) states that an instantaneous code can be found that
encodes a source of entropy, H(X), with an average number of bits
per symbol, Bs, such that,
Bs  H
X 
• The longer the sequences of symbols, the closer Bs will be to H(X)
• The theorem does not tell us how to find the code
• The theorem is useful nevertheless: in our example, the Huffman
code uses 0.53 bits/symbol which is only 11% more than the best
we can do (0.47 bits/symbol)
Communication channels
• A discrete channel is
characterised by an input
alphabet, X, and output
alphabet, Y, and a set of
conditional probabilities,
pij = P(yj|xi)
• We assume that the
channel is memoryless, i.e.
for a binary channel (BC),
P(y1,y2| x1,x2) = p11p12p21p22
Communication channels
• The special case of the
binary symmetric channel
(BSC) is of particular
interest, i.e. p12 = p21 = p
• For the BSC, the probability
of error is,
P e
P  x1 , y 2   P  x 2 , y 1 
 P  x1  P  y 2 | x1   P  x 2  P  y1 | x 2 
 P  x1  p12  P  x 2  p 21
 p  P  x1   P  x 2    p
Communication channels
• We now have a number
of entropies to consider:
– Input entropy, which
measures the average
information content of
the input alphabet
– Output entropy, which
measures the average
information content of
the output alphabet
H
 X     P  x i  log 2 P  x i 
i
H Y
  
i
P  y i  lo g 2 P  y i 
Communication channels
– The joint entropy, which measures the average information
content of a pair of input and output symbols, or
equivalently the average uncertainty of the
communication system formed by the input alphabet,
channel and output alphabet as a whole
H  X ,Y
     P  x i , y j  log 2
i
j
P  xi , y j 
Communication channels
– The conditional entropy, H(Y|X), which measures the
average information quantity needed to specify the output
symbol when the input symbol is known
H Y | X
     P  x i , y j  log 2 P  y j
i
| xi 
j
– The conditional entropy, H(X|Y), which measures the
average information quantity needed to specify the input
symbol, when the output or received symbol is known
H X |Y
     P  x i , y j  log 2 P  x j
i
| yi 
j
– H(X|Y) is the average amount of information lost in the
channel and is called equivocation
Communication channels
• The following relationships between entropies hold
H
 X ,Y  
H
 X ,Y  
H
 X   H Y
|X

H Y   H
X
|Y
• The information flow through, or mutual information of the
channel is defined by,
I  X ;Y

H
X  H X
|Y

• Therefore,
I  X ;Y

H Y   H Y | X

H
 X   H Y   H  X , Y 
• The capacity of a channel is defined by
C
m ax I  X ; Y
P  X



Communication channels
• Making use of
P  xi  
 Px , y 
i
j
j
H X
    P  x i  log 2 P  x i      P  x i , y j  log 2 P  x i 
i
i
j
• Combining this with the expression for H(X|Y), gives,
I  X ;Y

H X  H X |Y  

i
j
 P  xi | y j  

P  x i , y j  log 2 
 P  x i  
Communication channels
• In the case of a BSC, I(X;Y) is maximised when H(X) is
maximised, i.e. when P(x1) = P(x2) = ½
C  I  X ;Y
 Px
0
  P  x1   1 2
• For a BSC, we have,
P  y 0 | x1   P  y1 | x 0   p
P  y 0 | x 0   P  y1 | x1   1  p
• Direct substitution of the above transition probabilities into
the last expression for I(X,Y), and making use of Baye’s
theorem, P(xi,yj) = P(yj|xi)P(xi) = P(xi|yj)P(yj), gives,
C  1  p lo g 2 p   1  p  lo g 2 1  p 
Communication channels
• We can now see how the BSC
capacity varies as a function of
the error or transition
probability, p
• When the error probability is
0, we have a maximum
channel capacity, equal to the
binary symmetric source
information content
• When the error probability is
0.5, we are receiving totally
random symbols and all
information is lost in the
channel
Channel coding
• Source coding represent the source information with the
minimum of symbols – minimises redundancy
• When a code is transmitted over an imperfect channel (e.g. in
the presence of noise), errors will occur
• Channel coding represents the source information in a
manner that minimises the error probability in decoding
Channel coding
• We confine our attention to block codes: we map each k-bit
block to a n-bit block, with n > k, e.g.:
–
–
–
–
00 to 000000
01 to 101010
10 to 010101
11 to 111111
• The Hamming distance between codewords is defined as the
number of bit reversals required to transform one codeword
into another
– The Hamming distance of the above code is 3
– If the probability of bit error (1 bit reversal) is 0.1, then when we
receive 000100 we know with a certainty of 99% that 000000 was sent
Channel coding
• The ratio r = k/n is called the code rate
• Can there exist a sophisticated channel coding scheme such
that the probability that a message bit will be in error is less
than an arbitrarily small positive number?
• The source emits one symbol per TS seconds, then its average
information rate is H/TS bits per second
• These symbols are transmitted through a discrete memoryless
channel of capacity C and the channel can be used once every
TC seconds
Channel coding
• Shannon’s channel coding theorem (2nd theorem) states that:
If H/TS  C/TC there exists a coding scheme with an arbitrarily small
probability of error (equality implies signalling at the critical rate)
b) If H/TS > C/TC it is not possible to transmit information over the
channel and reconstruct it with an arbitrarily small probability of
error
a)
• For a BSC,
– H=1
– r = k/n = TC/TS
• Shannon’s channel coding theorem can be restated as, if r  C
there exists a code with this rate capable of arbitrarily low
probability of error
Capacity of a Gaussian channel
• Band-limited, power-limited Gaussian channels
• Transmitted signal
– Signal is random process, X(t)
– Signal is sampled at Nyquist rate of 2B samples per second
– Let Xk, k = 1, 2, 3, ..., K denote the samples of the
transmitted signal
– Signal samples are transmitted over noisy channel in T
seconds
– Hence, K = 2BT
– Finally, we assume that the transmitted signal is power
limited, E[Xk2] = P, being the average transmitted power
Capacity of a Gaussian channel
• Noise
– Channel is subject to additive white Gaussian noise of
zero mean and power spectral density N0/2
– Noise is also band-limited to B Hz
– The noise sample Nk is therefore Gaussian with zero mean
and variance s 2 = N0B
– The samples of the received signal are Yk = Xk + Nk
– We assume that Yk are statistically independent
Capacity of a Gaussian channel
• The information capacity of this channel is

2
C  m ax I  X k ; Yk  : E  X k   P
fX (x)

k
• Performing the maximisation w.r.t. fXk(x), the probability
density function of Xk, (difficult calculation omitted in this
introductory module), gives,
P 

C  lo g 2  1  2 
2
s 

1
b its/ch an n el u se
Capacity of a Gaussian channel
• Since the channel is used K times for the transmission of K
samples of X(t) in T seconds, the information capacity per unit
time is (K/T) times the previous result, with K = 2BT

P 
C  B log 2  1 

N0B 

bits/second
• This is a fundamental limit on the rate of error-free
transmission for a power-limited, band-limited Gaussian
channel
Capacity of a Gaussian channel
• Implications of the information capacity theorem
– Assume we have an ideal system, defined as one that
transmits data at a bit rate Rb = C
– The average transmit power is P = EbC
– The we can re-write the information capacity equation as

Eb C 
 log 2  1 

B
N0 B 

C

Eb
N0

2
C B
1
C B
Capacity of a Gaussian channel
1. Infinite bandwidth case
C B
 Eb 
2
1
 ln 2

  Blim

C B
 N 0 
P
C   lim C 
log 2 e
B
N0
2. Capacity boundary divides
the region that can support
error-free transmission from
the region that cannot