Last modified 2017-09-25 00:25:11 CDT

Information Theory Primer, Part 1

Information theory [1], in its most practical uses, deals with compression and error correction of information sources for their efficient and reliable communication. It is essential to countless aspects of our modern digital world, and a remarkable example of where understanding seemingly abstract, non-physical mathematical manipulation yields powerful tools that make things like reliable deep-space communication [2] possible.

This is a short introduction to the basic quantities defined in information theory, some of their properties, what they tell us, and what they let us do.

First and foremost, “information” in the information theory context does not refer to the semantic value of a message. As Claude E. Shannon, the father of information theory, wrote succinctly in his famous “A Mathematical Theory of Communication” [3]:

The fundamental problem of communication is that of reproducing at one point
either exactly or approximately a message selected at another point. Frequently
the messages have meaning; that is they refer to or are correlated according to
some system with certain physical or conceptual entities. These semantic
aspects of communication are irrelevant to the engineering problem.

Instead, information is a measure of uncertainty in a probabilistic event, and that event is often the selection of a symbol to represent a message in a communication system.

Why a probabilistic event? Because the outcomes of an information source – a producer of symbols representing messages to be transmitted or stored, are modeled by discrete random variables. If the information source was deterministic, it would not actually have any information by definition, and nor would there be any issue in its communication, since its output is deterministically known.

Common information sources might be human speech, binary data files, ASCII text files, radio signals, scientific measurements, and so on.

Measure of Information

The first step is to quantify information as a measure of uncertainty for probabilistic events. For single events, this is called self-information. Ideally, a useful measure of information would be:

A function only dependent on the probability of the event, and not representation of the event itself – we don’t care what symbol the event is represented by, we just want to measure its uncertainty.
A decreasing function of the probability of the event, i.e. more likely events have less information than less likely events.
0 if the event occurs with certainty (probability = 1).
A continuous function of the probability of the event, e.g. no surprise jumps as probability of the event ranges from 0.0 to 1.0.
Additive for independent events, e.g. Information of Independent A and B = Information of A + Information of B.

A function of an event \( A \) that does meet all of these criteria is:

\[I(A) = \log(\frac{1}{P(A)}) = -\log(P(A))\]

This definition of \( I(A) \) is only dependent on the probability of the event, is a decreasing function of probability due to its inverse in the argument of the log, is zero for an event with certain probability of 1, is continuous, and is additive for independent events, by the properties of log:

\[I(A \text{ and } B) = -\log(P(A \text{ and } B)) = -\log(P(A)P(B))\] \[= -\log(P(A)) -\log(P(B)) = I(A) + I(B)\]

Also, notice that \( I(A) \geq 0 \), since we know it to be zero for probability 1.0, and it tends towards positive infinity, as probability tends to 0.0.

The base chosen for the logarithm is arbitrary, but determines the units of self-information. The difference between the self-information in one unit from another unit is just a constant scale-factor, and they are both still measurements of the same uncertainty in the event (e.g. miles/h and km/h are both measuring the same thing, speed). The natural logarithm, or base e logarithm, gives units named nats. The base 10 logarithm gives units named hartleys. The base 2 logarithm gives the familiar units named bits.

So what we’ve achieved so far, is a measure [4] of uncertainty. Instead of talking about small probability p or large probability p of an event, we can talk about a non-negative measure of uncertainty that is large with unlikely events and small with likely events.

Entropy of an Information Source

Now that we have a useful measurement of uncertainty, we can easily define a related quantity called entropy for a random variable. As previously mentioned, an information source is modeled by a random process, which outputs a sequence of random variables, which map probabilities to symbols representing messages. These symbols might eventually be communicated geographically by a wireline or wireless transmission, or communicated temporally to a permanent storage device like a hard drive. They might also be subject to useful pre-processing and post-processing by tools derived from information theory.

\[\boxed{\text{Information Source}}\longrightarrow \dots, X_{-1}, X_0, X_1, X_2, \dots\]

A particular model for an information source, the discrete memoryless source, produces a sequence of independent and identically distributed discrete random variables. The random symbol \( X_t \) at any discrete timestep t is specified by a discrete random variable that takes on symbol \( a_i \) from some alphabet of size M, \( \mathcal{A} = { a_1, a_2, … a_M } \) , according to some probability mass function \( p_X(x) \).

As an example, suppose we convinced someone to roll a weighted die all day, and write down the symbol \( a_1 \) through \( a_6 \) corresponding to the outcome of each roll. We could model each outcome of this discrete memoryless source with the random variable \( X \) taking on values:

\[X = \begin{cases} a_1 & \text{if roll is a 1} \\\ a_2 & \text{if roll is a 2} \\\ a_3 & \text{if roll is a 3} \\\ a_4 & \text{if roll is a 4} \\\ a_5 & \text{if roll is a 5} \\\ a_6 & \text{if roll is a 6} \\\ \end{cases}\]

according to the (unfair) probability mass function:

\[p_X(x) = \begin{cases} 0.3 & x = a_1 \\\ 0.3 & x = a_2 \\\ 0.1 & x = a_3 \\\ 0.1 & x = a_4 \\\ 0.1 & x = a_5 \\\ 0.1 & x = a_6 \\\ 0 & \text{otherwise} \\\ \end{cases}\]

While we could calculate the self-information of each separate event of a particular roll occurring, there is more useful characterization of information in this random variable that considers the whole set of its possible outcomes. It’s called entropy, and it’s the weighted average of the self-information of all the random variable’s possible outcomes.

\[H(X) = \sum\limits_{i=1}^{M}{ p_X(a_i) I(a_i) } = -\sum\limits_{i=1}^{M}{ p_X(a_i) \log(p_X(a_i))}\]

I’ve taken the minus sign in \( I(A) \) outside the sum. (There is a small footnote technicality: \( 0 \log(0) \triangleq 0 \) )

It’s also easy to see that this is a weighted average by writing it in terms of the expectation of the self-information function \( I(X) \):

\[H(X) = \sum\limits_{i=1}^{M}{ p_X(a_i) I(a_i) } = E[I(X)]\]

Entropy gives us the measure of average information content or average uncertainty in a random variable. If we choose base 2 for the logarithm in the calculation, its units will be bits. In that case, entropy is the number of bits needed on average, to encode all possible outcomes.

Going back to our weighted die example, we can calculate the entropy of \( X \) to be:

\[\begin{align} H(X) &= -\sum\limits_{i=1}^{M}{ p_X(a_i) \log(p_X(a_i))} \\\ &= -(0.3\log_2(0.3) + 0.3\log_2(0.3) + 0.1\log_2(0.1) + \\\ &\quad\quad 0.1\log_2(0.1) + 0.1\log_2(0.1) + 0.1\log_2(0.1)) \\\ &= 2.371 \text{ bits} \end{align}\]

Entropy gives us insight into the compressibility of a random variable. It’s the number of essential bits needed to represent the random variable on average, when you encode it in such a way to remove all of the redundancy. Probabilities tell us what is redundant and what isn’t. Encoding in such a way to strip unnecessary redundancy is called source coding. Adding useful redundancy to the encoding to help with error correction is called channel coding.

Entropy of a Binary Symmetric Source

Another discrete memoryless source, called the binary symmetric source, produces a sequence of independent and identically distributed random variables that map to a binary alphabet \( \mathcal{A} = {0,1} \), with probability mass function:

\[p_X(x) = \begin{cases} p & x = 1 \\\ 1-p & x = 0 \\\ \end{cases}\]

where p is a specifiable parameter. This can also be intuitively thought of as a random variable modeling the outcome of a coin flip that has weight probability p for heads.

Its entropy as a function of parameter p is:

\[H(X) = -p \log{p} - (1-p) \log(1-p) \triangleq H_b(p)\]

Plotting entropy as a function of p gives us this famous concave curve, also called the binary entropy function:

Notice that entropy goes to zero as p goes to 0.0 or 1.0, which corresponds to certainty in one of the outcomes. As we would expect, \( H(X) = 1 \), the maximum, for p=0.5. In other words, we need exactly 1 bit on average to represent the outcome of a fair coin.

Introductory Example of Source Coding

Consider the random variable \( Y \), which takes on symbol values \( \mathcal{A} = {\text{apple, banana, orange, papaya, guava}} \), representing fruit commonly purchased at a particular fruit stand, according to the probability mass function:

\[p_Y(y) = \begin{cases} 0.4 & y = \text{apple} \\\ 0.3 & y = \text{banana} \\\ 0.2 & y = \text{orange} \\\ 0.05 & y = \text{papaya} \\\ 0.05 & y = \text{guava} \\\ 0 & \text{otherwise} \\\ \end{cases}\]

Suppose we want to efficiently transmit fruit purchase information for counting purposes over a constrained channel, like avian carrier [5]. How would we go about doing this?

Well, we could just transmit a fixed number of bits representing the fruit choices. In that case, we would need at least 3 bits to represent the five possible choices:

\[\begin{align} \text{apple} &\mapsto \text{000} \\\ \text{banana} &\mapsto \text{001} \\\ \text{orange} &\mapsto \text{010} \\\ \text{papaya} &\mapsto \text{011} \\\ \text{guava} &\mapsto \text{100} \end{align}\]

This means we are using 3 bits per symbol.

But if we calculate the entropy of \( Y \), the average number of bits needed to represent the uncertainty of \( Y \), we find that:

\[H(Y) = 1.332 \text{ bits}\]

which is much less than 3 bits! This means that we could be more efficient with our representation of the symbols, and achieve a higher rate of communication.

This is also to be expected, though. We are hardly using all 3 bits effectively, since we made no mapping for representations 101, 110, 111. Also, we did not take into account the probability of the outcomes, which we could have used to encode more probable outcomes with shorter representations and less probable outcomes with longer representations, for example.

So, let’s consider a coding (actually Huffman coding [6]) that does take into account these two shortcomings of the previous fixed length code. I’ll provide the representation without going into the details of the algorithm for now (which is elegantly simple):

\[\begin{align} \text{apple} &\mapsto \text{0} \\\ \text{banana} &\mapsto \text{10} \\\ \text{orange} &\mapsto \text{110} \\\ \text{papaya} &\mapsto \text{1110} \\\ \text{guava} &\mapsto \text{1111} \end{align}\]

Can we even tell the variable-length representations apart if we received them? The answer is yes, because this code has the property of being a prefix-free code. This means that no code word is a prefix of another, so that the start and stop of a code word is distinct, and any reliably received sequence is always uniquely decodable (feel free to convince yourself). Next, you might be surprised by the length of the papaya and guava representations; they’re 4 bits! How can that possibly be better than just using a fixed 3-bit representation? We have to remember that the outcomes of the information source have different probabilities, so we will more rarely end up using those longer representations.

To get a comparison with the entropy \( H(Y) \), we can calculate the average length of the code by taking the weighted average of the code word lengths:

\[\bar{r} = \sum{n_i p_i} = 1 \cdot 0.4 + 2 \cdot 0.3 + 3 \cdot 0.2 + 4 \cdot 0.05 + 4 \cdot 0.05 = 2 \text{ bits}\]

So on average, this encoding will require 2 bits. We just beat the previous fixed length representation by 1 bit, and got closer to the entropy of \( H(Y) = 1.332 \text{ bits} \).

To get even closer to entropy, we can start encoding blocks of N source outcomes at a time, rather than encoding one outcome at a time. For example, if we define random variable \( Z \) to take on values of a pair of fruit (N=2), it will have probability mass function \( p_Z(z = (y_1, y_2)) = p_Y(y_1)p_Y(y_2) \). We can then design a code, taking advantage of the distribution \( p_Z(z) \), to efficiently represent each pair.

In fact, as our block length N gets larger, in the limit to infinity we can design a code that gets arbitrarily close to the entropy of the source in the code’s average bits per symbol. This is the result of Shannon’s source coding theorem [7].

In the case that information source statistics (probability mass functions of the random variables) are not known ahead of time, there are universal codes [8] that effectively “learn” the probability distribution of the information source as they operate. These are what are behind the scenes of the ZIP, gzip, bzip2, 7zip, etc. compression schemes.

You can probably already imagine other sources with low entropy / high redundancy. The English language itself is an example, and like many other source codes, Morse code takes advantage of the relative frequency of different letters to achieve an average code word length that is closer to entropy of the source. Although the English language is hardly memoryless (likelihood of future symbols very much depend on previous symbols), it can be modeled as memoryless, and determining the relative frequency of letters then can still help in designing a more efficient coded representation than an uncoded one.

Bounds on Entropy

We can say the following, in general, about the bounds of entropy \( H(X) \):

\[0 \leq H(X) \leq \log(M)\]

where M is the cardinality (size) of the alphabet that random variable \( X \) can take on. There is some insight behind the lower and upper bound equalities.

To prove the lower bound, we see that in the definition of entropy,

\[H(X) = -\sum\limits_{i=1}^{M}{ p_X(a_i) \log(p_X(a_i))}\]

\( p_X(a_i) \) will always be between 0.0 and 1.0, because it is a probability, and \( \log(p_X(a_i)) \) will correspondingly be between \( -\infty \) and \( 0 \). The negative sign outside the sum means that the entropy is always non-negative.

Since the probabilities of all outcomes must sum to one, \( \sum\limits_{i=1}^{M}{p_X(a_i)} = 1 \), there always needs to be some non-zero probability \( p_X(a_i) \) for a valid probability distribution \( p_X(x) \). So we can only achieve \( H(X) = 0 \) from \( \log(p_X(a_i)) = 0 \), which corresponds to a particular outcome \( a_i \) with \( p_X(a_i) = 1 \). In other words, a random variable with a certain outcome of probability 1 has zero entropy. This is because the random variable is effectively deterministic, with no uncertainty.

The upper bound requires a little more work, and there are different ways of proving it. One way is with this log inequality:

\[\ln(x) \leq x-1\]

which has equality of \( \ln(x) = x - 1 = 0 \), only at x=1:

(The proof behind this inequality is straightforward if you define \( f(x) = \ln(x) - (x - 1) \) and do first and second derivative tests.)

We can cleverly define this new sum (which is not entropy), and plug it into the log inequality:

\[-\sum\limits_{i=1}^{M} p_X(a_i) \log(M \cdot p_X(a_i)) \leq -\sum\limits_{i=1}^{M} p_X(a_i) (M \cdot p_X(a_i) - 1)\]

which has equality of 0 when \( p_X(a_i) = \frac{1}{M} \) for all \( a_i \). This inequality simplifies to:

\[\begin{align} -\sum\limits_{i=1}^{M} p_X(a_i) \log(M \cdot p_X(a_i)) &\leq 0 \\\ -\sum\limits_{i=1}^{M} p_X(a_i) \log(M) -\sum\limits_{i=1}^{M} p_X(a_i) \log(p_X(a_i)) &\leq 0 \\\ -\log(M) \sum\limits_{i=1}^{M} p_X(a_i) -\sum\limits_{i=1}^{M} p_X(a_i) \log(p_X(a_i)) &\leq 0 \\\ -\log(M) -\sum\limits_{i=1}^{M} p_X(a_i) \log(p_X(a_i)) &\leq 0 \\\ -\sum\limits_{i=1}^{M} p_X(a_i) \log(p_X(a_i)) &\leq \log(M) \\\ H(X) &\leq \log(M) \\\ \end{align}\]

Recall that the equality condition is \( p_X(a_i) = \frac{1}{M} \) for all \( a_i \), and that this corresponds to a uniform distribution. We now see that a uniformly distributed random variable has maximal entropy. Intuitively, we can’t take advantage of different probabilistic outcomes to design a code to more efficiently represent the random variable outcomes on average, since all outcomes are equally likely. The best we can do is the fixed length code of \( \log_{2}(M) \) bits.

Already, with just these basics of information theory, we have insight into the reason behind why good random number generators that are independent in time and uniformly distributed in value are incompressible.