Information Theory Dan Fleck CS 469: Security Engineering These slides are modified with permission from Bill Young (Univ of Texas) What is Information? The fundamental paradigm of information theory is represented here: A sender conveys information to a receiver across a communication channel. Sender Channel Information is any content to be conveyed. It is sent in the form of one or more messages. Receiver Information theory attempts to quantify the content of messages and the capacity of channels on which information may ﬂow Some Questions Information theory asks questions such as the following: 1. How much information is encoded in a particular message? 2. How eﬃciently can a given alphabet/language be transmitted? 3. What is the maximum capacity of a given transmission medium? 4. How is that capacity or eﬃciency reduced by interference/noise? These questions all have very precise and deep answers, that are beyond our scope here. We’ll only scratch the surface. Why Do We Care? Information theory is very important in computer science. It aﬀects all communication, hardware design, protocol design, cryptography, fault-tolerance, etc. For example, in our current context it is useful to know how much information can be transmitted over a speciﬁc covert channel. This is the “bandwidth” of the channel. A useful way of measuring bandwidth is bits per second. Quantifying Information: Thought Experiment How do you quantify the information content of a response to a “yes” or “no” question? What does that even mean? Maybe it means the following: 1. 2. 3. 4. Sender has either a “yes” or a “no.” Receiver knows that Sender has one of those two possibilities, but not which. Sender wants to transmit enough data, but no more, to resolve entirely Receiver’s uncertainty. How much data must Sender transmit? In those terms the answer is pretty clear: One bit of data. But what must be true for Receiver to interpret the answer? Thought Experiment Lessons This suggests: 1. In some cases, it is possible to quantify the information content of a message. 2. Maybe an appropriate unit of information content is bits. 3. Sender and receiver must have some shared knowledge, included an agreed encoding scheme Lessons • Information is any content that can be conveyed from a sender to a receiver across a communication channel. • Information theory attempts to quantify the amount of information in a message and the capacity of the channel. • Communication requires some shared knowledge. Information Content Dan Fleck CS 469: Security Engineering These slides are modified with permission from Bill Young (Univ of Texas) Quantifying Information How much information is contained in each of the following messages? How would you decide? • • • • An n-bit binary number. A single decimal digit. A two digit decimal number. “The attack is at dawn.” Sender Channel Receiver Quantifying Information Suppose Sender and Receiver have agreed that Sender will transmit one of exactly 16 messages. How many bits of information must be transmitted to convey the message? Why? Sender Channel Receiver Binary Tree Representation • Suppose we represent all possible choices in any decision as the leaves of a “complete” binary tree. The longest path is éêlog2 (k) , if there areùú k leaves on the tree. • Each choice (bit) reduces the search space by half. Encoding That suggests that the information content of any message from a space of k messages should be something like: éêlog2 (k)ùú But to get a very eﬃcient transmission: • Sender and Receiver must know in advance the space of possible transmissions. • They must have agreed on an encoding. How Much Information? How much information is contained in the message: “The attack is at dawn”? Answer: It depends on the Receiver’s level of uncertainty. • If the only uncertainty were whether at dawn or dusk: one bit. • If the attack could have come anytime during the day: ? bits. • If the day was uncertain...: ? bits. “The word ‘information’ relates not so much to what you do say, as to what you could say. ... That is, information is the measure of your freedom of choice when you select a message.” –Warren Weaver Lessons • The information content of a message is the amount of uncertainty it resolves. • In an ideal situation, each bit transmitted can reduce the uncertainty by half. • Very few circumstances are ideal in the required sense Exploring Encodings Dan Fleck CS 469: Security Engineering These slides are modified with permission from Bill Young (Univ of Texas) Quantifying Information Information content is commonly measured in terms of bits. “Bit” has two connotations, which are not the same: bit1: a binary digit (discrete); bit2: a quantity of information (continuous). The information content of a message is measured in bit2s and the capacity of a channel in bit2s per second (bps). In general, the best way of transmitting (encoding) a series of messages is that way that minimizes the number of bit2s required, on average. Finding an Encoding Four bits are adequate to encode 16 possible messages: M0, . . ., M15: Call this the naïve encoding. • Can we do better for one message? What would that mean? • How about transmitting n messages, each of which is one of 16 possible values? Probability and Encoding Suppose you need to send 1000 messages, each of which can be one of 16 possibilities. But on average 99.5% will be message 10. Question: Does it still require 4 × 1000 = 4000 bits to send your 1000 messages? Answer: It is possible to come up with an encoding that will do better on average than the naïve encoding. Note, when we talk about sending 1000 messages, we’ve gone from talking about the information content of a message to talking about that of a language. A Better Encoding • Use the following encoding: • Given 1000 messages, on average 995 of them will be message 10, and 5 will be other messages. This encoding takes 995 + (5 · 5) = 1020 bits or 1.02 bits per message. Some Observations • Our encoding is pretty good, but can we do even better? Is there a limit to how well we can do? • Computing the number of bits per message depends on knowing the prior probabilities—how often each message appears in an arbitrarily long sequence of messages. • The “on average” part is important; some sequences would be less eﬃcient under our encoding. • We used the “naïve encoding” as our benchmark, but there are much worse encodings. • Is it possible to ﬁnd an optimal encoding? What would that mean? Lessons • “Bit” has two distinct meanings that are easily confused. • For any language, one can ﬁnd a naïve encoding that will work, but it’s often possible to do better. • “Doing better” means using fewer bits, on average, to transmit messages in the language. Languages & Encodings Dan Fleck CS 469: Security Engineering These slides are modified with permission from Bill Young (Univ of Texas) Languages A language may refer to a natural language such as English or Japanese. But it may also refer to any set of symbols used to form “strings.” Often, we’ll use “language” to refer to the results of a series of experiments. Example 1: Assume a two-sided coin. Symbols in the language are “H” (heads) and “T” (tails). A string describes the result of repeatedly ﬂipping the coin. E.g., “HHTHTTTHTHTTHTT...” Example 2: Assume a six-sided die. Symbols in the language are “1”, “2”, “3”, “4”, “5”, “6”. A string describes the result of repeatedly rolling the die. E.g., “425146261435261...” Encoding Properties For any language, our goal is a binary encoding, with the following properties: lossless: it must be possible to recover the entire original sequence of symbols from the transmission; uniquely decodable: for any encoded string, there must be only one possible decoding; streaming: there should be no breaks in the encoding. Unique Decodability Suppose you roll a 6-sided die repeatedly. The following are some possible codes. Are they lossless, uniquely decodable, streaming? Which is “best”? Suﬃcient (but not necessary) for unique decodability is the property of being preﬁx-free: That is, the string representing any symbol cannot be an initial preﬁx of the string representing any other symbol. Unique Decodability Imagine a language containing the symbols A, B, C. What’s wrong with the following encoding? A clue is that it is not preﬁxfree. Sym Code A 1 B 0 C 10 An encoding can fail to have the preﬁx-free property, but still be uniquely decodable. Sym Code A 1 C 1111111110 Parsing such a language may require arbitrary “look-ahead.” Finding a Coding How do you come up with an eﬃcient encoding? Use fewer bits for the symbols that occur more frequently. Samuel Morse knew this instinctively. But Morse code doesn’t satisfy our criteria for encodings. Why not? E . S … T - R .-. M -- Q --.- Huﬀman coding is guaranteed to ﬁnd an eﬃcient code for a given language if you know the probabilities of symbols in the language. Lessons • We use “language” for any scheme to generate a series of symbols. • For any language, we want an eﬃcient binary encoding that is lossless, uniquely decodable and streaming. • Huﬀman encoding will provide such an encoding assuming we know the probabilities of symbols in the language. Entropy Dan Fleck CS 469: Security Engineering These slides are modified with permission from Bill Young (Univ of Texas) An Encoding Example Recall the following language: Example 1: Assume a two-sided coin. Symbols in the language are “H” (heads) and “T” (tails). A string describes the result of repeatedly ﬂipping the coin. E.g., “HHTHTTTHTHTTHTT...” Suppose you know further that it’s a fair coin. The following is the naïve encoding for this language. Is there a better encoding? How would you know? Result Prob Code H 1/2 0 T 1/2 1 Entropy Claim: If you know the probabilities of symbols in the language, you can compute the average information content of a symbol in the language. For our fair coin example, the computed answer is 1 bit per symbol. Deﬁnition: The entropy of a language is a measure of the information content of an average symbol in the language. Computing Entropy Entropy H is computed as follows. If pi is the probability of the ith symbol in the language, then h = -(å pi log2 pi ) i From now on, all log’s are base 2. Example: Consider our fair coin example. Find the entropy. Result Prob Code H 1/2 0 T 1/2 1 Solution: There are two symbols, each with probability 1/2, so: h = −(1/2 × log 1/2 + 1/2 × log 1/2) = 1 What Does It Mean? What does it mean to say that the entropy of this language is 1? 1. On average, there is one bit of information in each symbol of the language. 2. It is impossible to ﬁnd an encoding that uses less than one bit per symbol, on average. 3. Any encoding that uses one bit per symbol, on average, is optimal. Therefore, for this example, the naïve encoding is the optimal encoding. Entropy Special Case Whenever you have n symbols, all equally probable, the probability of any of them is 1/n. The language has entropy: h = −(log 1/n) = log n For example, a fair die with six sides has entropy: h = −(log 1/6) = log 6 ≈ 2.58 Hence, it requires 2.58 bits, on average, to transmit the result of a roll. Another Example Suppose we have an unbalanced coin that is three times more likely to yield a head than a tail. What is the entropy of this language? Solution: Again there are two possible outcomes: The entropy is computed as follows: Result Prob H 3/4 T 1/4 h = −(3/4 × log 3/4 + 1/4 × log 1/4) ≈ 0.811 This says that it’s theoretically impossible to encode this language using less than 0.811 bits (on average) to transmit the result of each toss of the coin. Lessons • Given the probabilities of symbols in a language, you can compute its entropy. • Entropy measures the average information content of symbols in the language. • Entropy sets a lower limit on encoding eﬃciency. Example Revisited Given: an unbalanced coin that is three times more likely to yield a head than a tail. Solution: There are two possible outcomes: The entropy is computed as follows: Result Prob H 3/4 T 1/4 h = −(3/4 × log 3/4 + 1/4 × log 1/4) ≈ 0.811 Upshot: It’s theoretically impossible to encode this language using less than 0.811 bits per symbol (on average). But how would you ever do better than 1 bit / symbol? Example Revisited Instead of taking single ﬂips as our “experiments,” let’s take pairs (call them “2ﬂips”) and code as follows: Result Prob Code HH 9/16 0 HT 3/16 10 TH 3/16 110 TT 1/16 111 Suppose we ﬂip the coin 32 times; that’s 16 2ﬂips. In an average run, we’d expect: HH to appear 9 times, HT and TH to each appear 3 times, and TT to appear once. Why? Example Revisited Given 32 ﬂips (16 2ﬂips), we could expect: Result Count Code Bits HH 9 0 9 HT 3 10 6 TH 3 110 9 TT 1 111 3 Total: 27bits For the naïve encoding, using 1 bit / ﬂip, we’d expect to use 32 bits. Our eﬃciency is 27/32 ≈ 0.844, which is not a bad approximation of the entropy (0.811). Could we do better? Sure, just use 3ﬂips, 4ﬂips, etc. The entropy is the limit of this process. Test Your Understanding Suppose you have a six-sided die that is unbalanced such that 1 and 2 are equally likely; 3 and 4 are equally likely; and 5 and 6 are equally likely. However, the die rolls 1 twice as often as 3, and rolls 3 three times as often as 5. 1. What is the “naïve ” encoding for this language? 2. What is the entropy of this language? 3. Find an encoding that is more eﬃcient than the naïve encoding. 4. Give a convincing argument that your encoding is more eﬃcient than the naive encoding. Hint: There’s no need to encode sequences of rolls Lessons • Computing the entropy of a language provides a bound on the eﬃciency of any encoding. • But, ﬁnding an eﬃcient encoding requires ingenuity. Fundamental Theorems Dan Fleck CS 469: Security Engineering These slides are modified with permission from Bill Young (Univ of Texas) Source Entropy The entropy of a language is a measure of the most eﬃcient possible encoding of the language. The entropy of a message source is the amount of information content that the source can produce in a given period. This is measured in bits per second. Any channel can transmit an arbitrary amount of information, given enough time and unbounded buﬀering capacity. But can a given channel transmit the information in real time? Channel Capacity The capacity of a channel is the number of bits that can be sent per second over the channel. This is a property of the communication medium. Fundamental Theorem of the Noiseless Channel. (Shannon): If a language has entropy h (bits per symbol) and a channel can transmit C bits per second, then it is possible to encode the signal is such a way as to transmit at an average rate of (C/h) − ε symbols per second, where ε can be made arbitrarily small. It is impossible to transmit at an average rate greater than C/h. What the Theorem Means Suppose a channel can handle 100 bits / second and your language has entropy 5 (bit per symbol). Given a perfect encoding and a noiseless channel, you’d expect to be able to transmit 20 symbols / second though the channel, on average. Right? But you may not have a perfect encoding. Doesn’t matter. You can always ﬁnd a better encoding to get within ε of that limit. Noisy Channels If the channel is noisy, the capacity is reduced by the noise. But the following is true: Fundamental Theorem of a Noisy Channel (Shannon): Let a discrete channel have a capacity C and a discrete source an entropy h (bits per second). If h < C there exists a coding system such that the output of the source can be transmitted over the channel with an arbitrarily small frequency of errors. If the channel (with noise factored in) can physically handle the message traﬃc, then it is possible to transmit with arbitrarily small error rate What this Theorem Means The upshot of this is that a message can be transmitted reliably over even a very noisy channel by increasing the redundancy of the coding scheme. For example, covert channels in the system cannot be dismissed with the argument that they are noisy and hence useless. You can always get the message through by ﬁnding a more redundant encoding. Lessons • Entropy provides a bound on coding eﬃciency. • But Shannon’s theorems show that it is always possible to approach that limit arbitrarily closely. Entropy of English Dan Fleck CS 469: Security Engineering These slides are modified with permission from Bill Young (Univ of Texas) Entropy vs. Redundancy Intuitively, the diﬀerence between the eﬃciency of the encoding and the entropy is a measure of the redundancy in the encoding. If you ﬁnd an encoding with eﬃciency matching the entropy and there is no redundancy. The standard encoding for English contains a lot of redundancy. Fr xmpl, y cn prbbly gss wht ths sntnc sys, vn wth ll f th vwls mssng. Tht ndcts tht th nfrmtn cntnt cn b xtrctd frm th rmnng smbls. The Eﬀect of Redundancy The following also illustrates the redundancy of English text: Aoccdrnig to rscheearch at Cmabirgde Uinervtisy, it deosn’t mttaer in waht oredr the ltteers in a wrod are, the olny iprmoetnt tihng is taht the frist and lsat ltteer be at the rghit pclae. The rset can be a ttoal mses and you can sitll raed it wouthit porbelm. Tihs is bcuseae the huamn mnid deos not raed ervey lteter by istlef, but the wrod as a wlohe. Amzanig huh? Spammers count on the ability of humans to decipher such text, and the inability of computers to do so to defeat anti-spam ﬁlters: Care to order some [email protected] or Vigara? Entropy of English: Zero-Order Model Suppose we want to transmit English text (26 letters and a space). If we assume that all characters are equally likely, the entropy is: h = −(log 1/27) = 4.75 This is the zero-order model of English. This gives an approximation to the entropy. But the underlying assumption is clearly false. Entropy of English: First-Order In written or spoken English, some symbols occur much more frequently than others. Assuming that all symbols are independent of one another, but follow the probabilities above, the entropy is 4.219 bits per symbol. This is the “ﬁrst-order” model of English. Entropy of English: Higher-Order The assumption of independence (zero memory) is also incorrect. Some letters follow other letters frequently; others not at all. The following shows the most common digrams and trigrams in English. Entropy of English: Higher-Order We can compute tables of the likelihood of digrams (two-letter combinations), trigrams, etc. Adding digrams to the computation gives a second-order model; adding trigrams gives a third-order model; etc. A third-order model yields 2.77 bits per symbol. The actual entropy is the “limit” of this process of taking higher and higher order models. Estimates by Shannon based on human experiments have yielded values as low as 0.6 to 1.3 bits per symbol. Lessons • All natural languages contain signiﬁcant redundancy. • Computing the entropy of a natural language is diﬃcult and requires complex models. Entropy odds and ends Dan Fleck CS 469: Security Engineering These slides are modified with permission from Bill Young (Univ of Texas) Entropy Aside Recall that information content of a message depends on the state of knowledge of the receiver. Hence, entropy is relative to a particular observer. Consider the entropy of the contents of an envelope marked “Best Picture” at the Academy Awards (assuming 5 nominees): • If all were equally likely to win, the entropy would be log 5 ≈ 2.322. • For everyone who knows that the odds aren’t even, it’s less, though hard to compute. • For the auditors who stuﬀed the envelope, it’s 0 since they have no uncertainty. Often, prior probabilities are impossible to compute. Entropy and Randomness Note that entropy can be used to measure the amount of “redundancy” in the encoding. If the information content of a message is equal to the length of the encoded message, there is no redundancy. Some sources deﬁne a random string as one that cannot be represented any more eﬃciently. (I.e., no compression is possible.) Finding a Coding Huﬀman coding is guaranteed to ﬁnd an eﬃcient code for a given language assuming you know the probabilities of language units. In fact, it always uses less than one bit per symbol more than the entropy, which is extremely eﬃcient. Lempel-Ziv is an “adaptive coding” algorithm used in many commercial text compression utilities. It builds an encoding on the ﬂy according to the strings it encounters. Lempel-Ziv is asymptotically optimal. That is, as the text length tends to inﬁnity, the compression approaches optimal. Lessons • The information content of a message is relative to the state of knowledge of an observer. • If an encoding’s eﬃciency matches the entropy, there is no redundancy to compress out. • Huﬀman coding and the Lempel Ziv algorithms both give highly eﬃcient codes.