slides - Department of Computer Science

Report
Information Theory
Dan Fleck
CS 469: Security Engineering
These slides are modified with permission from Bill Young (Univ of Texas)
What is Information?
The fundamental paradigm of information
theory is represented here: A sender
conveys information to a receiver across a
communication channel.
Sender
Channel
Information
is any content to be conveyed. It is sent in
the form of one or more messages.
Receiver
Information theory attempts to quantify the
content of messages and the capacity of
channels on which information may flow
Some Questions
Information theory asks questions such as the following:
1. How much information is encoded in a particular message?
2. How efficiently can a given alphabet/language be
transmitted?
3. What is the maximum capacity of a given transmission
medium?
4. How is that capacity or efficiency reduced by
interference/noise?
These questions all have very precise and deep answers, that
are beyond our scope here. We’ll only scratch the surface.
Why Do We Care?
Information theory is very important in computer science. It
affects all communication, hardware design, protocol design,
cryptography, fault-tolerance, etc.
For example, in our current context it is useful to know how
much information can be transmitted over a specific covert
channel. This is the “bandwidth” of the channel.
A useful way of measuring bandwidth is bits per second.
Quantifying Information: Thought
Experiment
How do you quantify the information content of a response to a
“yes” or “no” question?
What does that even mean? Maybe it means the following:
1.
2.
3.
4.
Sender has either a “yes” or a “no.”
Receiver knows that Sender has one of those two possibilities, but
not which.
Sender wants to transmit enough data, but no more, to resolve
entirely Receiver’s uncertainty.
How much data must Sender transmit?
In those terms the answer is pretty clear: One bit of data. But what
must be true for Receiver to interpret the answer?
Thought Experiment Lessons
This suggests:
1. In some cases, it is possible to quantify the information
content of a message.
2. Maybe an appropriate unit of information content is bits.
3. Sender and receiver must have some shared knowledge,
included an agreed encoding scheme
Lessons
• Information is any content that can be conveyed from a
sender to a receiver across a communication channel.
• Information theory attempts to quantify the amount of
information in a message and the capacity of the channel.
• Communication requires some shared knowledge.
Information Content
Dan Fleck
CS 469: Security Engineering
These slides are modified with permission from Bill Young (Univ of Texas)
Quantifying Information
How much information is contained in each
of the following messages? How would you
decide?
•
•
•
•
An n-bit binary number.
A single decimal digit.
A two digit decimal number.
“The attack is at dawn.”
Sender
Channel
Receiver
Quantifying Information
Suppose Sender and Receiver have agreed that
Sender will transmit one of exactly 16 messages.
How many bits of information must be transmitted
to convey the message? Why?
Sender
Channel
Receiver
Binary Tree Representation
• Suppose we represent all possible choices in any decision as
the leaves of a “complete” binary tree. The longest path is
éêlog2 (k)
, if there
areùú k leaves on the tree.
• Each choice (bit) reduces the search space by half.
Encoding
That suggests that the information content of any message from
a space of k messages should be something like: éêlog2 (k)ùú
But to get a very efficient transmission:
• Sender and Receiver must know in advance the space of
possible transmissions.
• They must have agreed on an encoding.
How Much Information?
How much information is contained in the message: “The
attack is at dawn”?
Answer: It depends on the Receiver’s level of uncertainty.
• If the only uncertainty were whether at dawn or dusk: one bit.
• If the attack could have come anytime during the day: ? bits.
• If the day was uncertain...: ? bits.
“The word ‘information’ relates not so much to what you do say,
as to what you could say. ... That is, information is the measure
of your freedom of choice when you select a message.” –Warren
Weaver
Lessons
• The information content of a message is the amount of
uncertainty it resolves.
• In an ideal situation, each bit transmitted can reduce the
uncertainty by half.
• Very few circumstances are ideal in the required sense
Exploring Encodings
Dan Fleck
CS 469: Security Engineering
These slides are modified with permission from Bill Young (Univ of Texas)
Quantifying Information
Information content is commonly measured in terms of bits.
“Bit” has two connotations, which are not the same:
bit1: a binary digit (discrete);
bit2: a quantity of information (continuous).
The information content of a message is measured in bit2s and
the capacity of a channel in bit2s per second (bps).
In general, the best way of transmitting (encoding) a series of
messages is that way that minimizes the number of bit2s
required, on average.
Finding an Encoding
Four bits are adequate to encode 16 possible messages: M0, . . ., M15:
Call this the naïve encoding.
• Can we do better for one message? What would that mean?
• How about transmitting n messages, each of which is one of 16
possible values?
Probability and Encoding
Suppose you need to send 1000 messages, each of which can be
one of 16 possibilities. But on average 99.5% will be message
10.
Question: Does it still require 4 × 1000 = 4000 bits to send your
1000 messages?
Answer: It is possible to come up with an encoding that will do
better on average than the naïve encoding.
Note, when we talk about sending 1000 messages, we’ve gone
from talking about the information content of a message to
talking about that of a language.
A Better Encoding
• Use the following encoding:
• Given 1000 messages, on average 995 of them will be
message 10, and 5 will be other messages. This encoding takes
995 + (5 · 5) = 1020 bits or 1.02 bits per message.
Some Observations
• Our encoding is pretty good, but can we do even better? Is
there a limit to how well we can do?
• Computing the number of bits per message depends on
knowing the prior probabilities—how often each message
appears in an arbitrarily long sequence of messages.
• The “on average” part is important; some sequences would be
less efficient under our encoding.
• We used the “naïve encoding” as our benchmark, but there
are much worse encodings.
• Is it possible to find an optimal encoding? What would that
mean?
Lessons
• “Bit” has two distinct meanings that are easily confused.
• For any language, one can find a naïve encoding that will
work, but it’s often possible to do better.
• “Doing better” means using fewer bits, on average, to
transmit messages in the language.
Languages &
Encodings
Dan Fleck
CS 469: Security Engineering
These slides are modified with permission from Bill Young (Univ of Texas)
Languages
A language may refer to a natural language such as English or
Japanese. But it may also refer to any set of symbols used to
form “strings.”
Often, we’ll use “language” to refer to the results of a series of
experiments.
Example 1: Assume a two-sided coin. Symbols in the language
are “H” (heads) and “T” (tails). A string describes the result of
repeatedly flipping the coin. E.g., “HHTHTTTHTHTTHTT...”
Example 2: Assume a six-sided die. Symbols in the language are
“1”, “2”, “3”, “4”, “5”, “6”. A string describes the result of
repeatedly rolling the die. E.g., “425146261435261...”
Encoding Properties
For any language, our goal is a binary encoding, with the
following properties:
lossless: it must be possible to recover the entire original
sequence of symbols from the transmission;
uniquely decodable: for any encoded string, there must be only
one possible decoding;
streaming: there should be no breaks in the encoding.
Unique Decodability
Suppose you roll a 6-sided die repeatedly. The following are
some possible codes. Are they lossless, uniquely decodable,
streaming? Which is “best”?
Sufficient (but not necessary) for unique decodability is the
property of being prefix-free: That is, the string representing any
symbol cannot be an initial prefix of the string representing any
other symbol.
Unique Decodability
Imagine a language containing the symbols A, B, C. What’s
wrong with the following encoding? A clue is that it is not prefixfree.
Sym
Code
A
1
B
0
C
10
An encoding can fail to have the prefix-free property, but still be
uniquely decodable.
Sym
Code
A
1
C
1111111110
Parsing such a language may require arbitrary “look-ahead.”
Finding a Coding
How do you come up with an efficient encoding? Use fewer bits
for the symbols that occur more frequently.
Samuel Morse knew this instinctively. But Morse code doesn’t
satisfy our criteria for encodings. Why not?
E
.
S
…
T
-
R
.-.
M
--
Q
--.-
Huffman coding is guaranteed to find an efficient code for a
given language if you know the probabilities of symbols in the
language.
Lessons
• We use “language” for any scheme to generate a series of
symbols.
• For any language, we want an efficient binary encoding that is
lossless, uniquely decodable and streaming.
• Huffman encoding will provide such an encoding assuming we
know the probabilities of symbols in the language.
Entropy
Dan Fleck
CS 469: Security Engineering
These slides are modified with permission from Bill Young (Univ of Texas)
An Encoding Example
Recall the following language:
Example 1: Assume a two-sided coin. Symbols in the language
are “H” (heads) and “T” (tails). A string describes the result of
repeatedly flipping the coin. E.g., “HHTHTTTHTHTTHTT...”
Suppose you know further that it’s a fair coin.
The following is the naïve encoding for this language. Is there a
better encoding? How would you know?
Result
Prob
Code
H
1/2
0
T
1/2
1
Entropy
Claim: If you know the probabilities of symbols in the language,
you can compute the average information content of a symbol in
the language.
For our fair coin example, the computed answer is 1 bit per
symbol.
Definition: The entropy of a language is a measure of the
information content of an average symbol in the language.
Computing Entropy
Entropy H is computed as follows. If pi is the probability of the
ith symbol in the language, then
h = -(å pi log2 pi )
i
From now on, all log’s are base 2.
Example: Consider our fair coin example. Find the entropy.
Result
Prob
Code
H
1/2
0
T
1/2
1
Solution: There are two symbols, each with probability 1/2, so:
h = −(1/2 × log 1/2 + 1/2 × log 1/2) = 1
What Does It Mean?
What does it mean to say that the entropy of this language is 1?
1. On average, there is one bit of information in each symbol
of the language.
2. It is impossible to find an encoding that uses less than one
bit per symbol, on average.
3. Any encoding that uses one bit per symbol, on average, is
optimal.
Therefore, for this example, the naïve encoding is the optimal
encoding.
Entropy Special Case
Whenever you have n symbols, all equally probable, the
probability of any of them is 1/n. The language has entropy:
h = −(log 1/n) = log n
For example, a fair die with six sides has entropy:
h = −(log 1/6) = log 6 ≈ 2.58
Hence, it requires 2.58 bits, on average, to transmit the result of
a roll.
Another Example
Suppose we have an unbalanced coin that is three times more likely to
yield a head than a tail. What is the entropy of this language?
Solution: Again there are two possible outcomes:
The entropy is computed as follows:
Result
Prob
H
3/4
T
1/4
h = −(3/4 × log 3/4 + 1/4 × log 1/4) ≈ 0.811
This says that it’s theoretically impossible to encode this language
using less than 0.811 bits (on average) to transmit the result of each
toss of the coin.
Lessons
• Given the probabilities of symbols in a language, you can
compute its entropy.
• Entropy measures the average information content of symbols
in the language.
• Entropy sets a lower limit on encoding efficiency.
Example Revisited
Given: an unbalanced coin that is three times more likely to
yield a head than a tail.
Solution: There are two possible outcomes:
The entropy is computed as follows:
Result
Prob
H
3/4
T
1/4
h = −(3/4 × log 3/4 + 1/4 × log 1/4) ≈ 0.811
Upshot: It’s theoretically impossible to encode this language
using less than 0.811 bits per symbol (on average).
But how would you ever do better than 1 bit / symbol?
Example Revisited
Instead of taking single flips as our “experiments,” let’s take
pairs (call them “2flips”) and code as follows:
Result
Prob
Code
HH
9/16
0
HT
3/16
10
TH
3/16
110
TT
1/16
111
Suppose we flip the coin 32 times; that’s 16 2flips. In an average
run, we’d expect: HH to appear 9 times, HT and TH to each
appear 3 times, and TT to appear once. Why?
Example Revisited
Given 32 flips (16 2flips), we could expect:
Result
Count
Code
Bits
HH
9
0
9
HT
3
10
6
TH
3
110
9
TT
1
111
3
Total:
27bits
For the naïve encoding, using 1 bit / flip, we’d expect to use 32 bits.
Our efficiency is 27/32 ≈ 0.844, which is not a bad approximation of
the entropy (0.811).
Could we do better? Sure, just use 3flips, 4flips, etc. The entropy is the
limit of this process.
Test Your Understanding
Suppose you have a six-sided die that is unbalanced such that 1
and 2 are equally likely; 3 and 4 are equally likely; and 5 and 6
are equally likely. However, the die rolls 1 twice as often as 3,
and rolls 3 three times as often as 5.
1. What is the “naïve ” encoding for this language?
2. What is the entropy of this language?
3. Find an encoding that is more efficient than the naïve
encoding.
4. Give a convincing argument that your encoding is more
efficient than the naive encoding.
Hint: There’s no need to encode sequences of rolls
Lessons
• Computing the entropy of a language provides a bound on the
efficiency of any encoding.
• But, finding an efficient encoding requires ingenuity.
Fundamental
Theorems
Dan Fleck
CS 469: Security Engineering
These slides are modified with permission from Bill Young (Univ of Texas)
Source Entropy
The entropy of a language is a measure of the most efficient
possible encoding of the language.
The entropy of a message source is the amount of information
content that the source can produce in a given period. This is
measured in bits per second.
Any channel can transmit an arbitrary amount of information,
given enough time and unbounded buffering capacity. But can a
given channel transmit the information in real time?
Channel Capacity
The capacity of a channel is the number of bits that can be sent
per second over the channel. This is a property of the
communication medium.
Fundamental Theorem of the Noiseless Channel. (Shannon):
If a language has entropy h (bits per symbol) and a channel can
transmit C bits per second, then it is possible to encode the
signal is such a way as to transmit at an average rate of (C/h) − ε
symbols per second, where ε can be made arbitrarily small. It is
impossible to transmit at an average rate greater than C/h.
What the Theorem Means
Suppose a channel can handle 100 bits / second and your
language has entropy 5 (bit per symbol).
Given a perfect encoding and a noiseless channel, you’d expect
to be able to transmit 20 symbols / second though the channel,
on average. Right?
But you may not have a perfect encoding. Doesn’t matter. You
can always find a better encoding to get within ε of that limit.
Noisy Channels
If the channel is noisy, the capacity is reduced by the noise. But
the following is true:
Fundamental Theorem of a Noisy Channel (Shannon): Let a
discrete channel have a capacity C and a discrete source an
entropy h (bits per second). If h < C there exists a coding system
such that the output of the source can be transmitted over the
channel with an arbitrarily small frequency of errors.
If the channel (with noise factored in) can physically handle the
message traffic, then it is possible to transmit with arbitrarily
small error rate
What this Theorem Means
The upshot of this is that a message can be transmitted reliably
over even a very noisy channel by increasing the redundancy of
the coding scheme.
For example, covert channels in the system cannot be dismissed
with the argument that they are noisy and hence useless. You
can always get the message through by finding a more
redundant encoding.
Lessons
• Entropy provides a bound on coding efficiency.
• But Shannon’s theorems show that it is always possible to
approach that limit arbitrarily closely.
Entropy of English
Dan Fleck
CS 469: Security Engineering
These slides are modified with permission from Bill Young (Univ of Texas)
Entropy vs. Redundancy
Intuitively, the difference between the efficiency of the encoding
and the entropy is a measure of the redundancy in the
encoding.
If you find an encoding with efficiency matching the entropy and
there is no redundancy.
The standard encoding for English contains a lot of redundancy.
Fr xmpl, y cn prbbly gss wht ths sntnc sys, vn wth ll f th vwls mssng.
Tht ndcts tht th nfrmtn cntnt cn b xtrctd frm th rmnng smbls.
The Effect of Redundancy
The following also illustrates the redundancy of English text:
Aoccdrnig to rscheearch at Cmabirgde Uinervtisy, it deosn’t
mttaer in waht oredr the ltteers in a wrod are, the olny
iprmoetnt tihng is taht the frist and lsat ltteer be at the rghit
pclae. The rset can be a ttoal mses and you can sitll raed it
wouthit porbelm. Tihs is bcuseae the huamn mnid deos not raed
ervey lteter by istlef, but the wrod as a wlohe. Amzanig huh?
Spammers count on the ability of humans to decipher such text,
and the inability of computers to do so to defeat anti-spam
filters:
Care to order some [email protected] or Vigara?
Entropy of English: Zero-Order Model
Suppose we want to transmit English text (26 letters and a
space). If we assume that all characters are equally likely, the
entropy is:
h = −(log 1/27) = 4.75
This is the zero-order model of English.
This gives an approximation to the entropy. But the underlying
assumption is clearly false.
Entropy of English: First-Order
In written or spoken English, some symbols occur much more
frequently than others.
Assuming that all symbols are independent of one another, but
follow the probabilities above, the entropy is 4.219 bits per
symbol. This is the “first-order” model of English.
Entropy of English: Higher-Order
The assumption of independence (zero memory) is also
incorrect. Some letters follow other letters frequently; others
not at all.
The following shows the most common digrams and trigrams in
English.
Entropy of English: Higher-Order
We can compute tables of the likelihood of digrams (two-letter
combinations), trigrams, etc. Adding digrams to the
computation gives a second-order model; adding trigrams gives
a third-order model; etc.
A third-order model yields 2.77 bits per symbol. The actual
entropy is the “limit” of this process of taking higher and higher
order models.
Estimates by Shannon based on human experiments have
yielded values as low as 0.6 to 1.3 bits per symbol.
Lessons
• All natural languages contain significant redundancy.
• Computing the entropy of a natural language is difficult and
requires complex models.
Entropy odds and
ends
Dan Fleck
CS 469: Security Engineering
These slides are modified with permission from Bill Young (Univ of Texas)
Entropy Aside
Recall that information content of a message depends on the state of
knowledge of the receiver. Hence, entropy is relative to a particular
observer.
Consider the entropy of the contents of an envelope marked “Best
Picture” at the Academy Awards (assuming 5 nominees):
• If all were equally likely to win, the entropy would be log 5 ≈ 2.322.
• For everyone who knows that the odds aren’t even, it’s less, though
hard to compute.
• For the auditors who stuffed the envelope, it’s 0 since they have no
uncertainty.
Often, prior probabilities are impossible to compute.
Entropy and Randomness
Note that entropy can be used to measure the amount of
“redundancy” in the encoding. If the information content of a
message is equal to the length of the encoded message, there is
no redundancy.
Some sources define a random string as one that cannot be
represented any more efficiently. (I.e., no compression is
possible.)
Finding a Coding
Huffman coding is guaranteed to find an efficient code for a
given language assuming you know the probabilities of language
units.
In fact, it always uses less than one bit per symbol more than the
entropy, which is extremely efficient.
Lempel-Ziv is an “adaptive coding” algorithm used in many
commercial text compression utilities. It builds an encoding on
the fly according to the strings it encounters.
Lempel-Ziv is asymptotically optimal. That is, as the text length
tends to infinity, the compression approaches optimal.
Lessons
• The information content of a message is relative to the state
of knowledge of an observer.
• If an encoding’s efficiency matches the entropy, there is no
redundancy to compress out.
• Huffman coding and the Lempel Ziv algorithms both give
highly efficient codes.

similar documents