Thursday, April 25, 2013

What is Information? (Part I: The Eye of the Beholder)


Information is a central concept in our daily life. We rely on information in order to make sense of the world: to make "informed" decisions. We use information technology in our daily interactions with people and machines. Even though most people are perfectly comfortable with their day-to-day understanding of information, the precise definition of information, along with its properties and consequences, is not always as well understood. I want to argue in this series of blog posts that a precise understanding of the concept of information is crucial to a number of scientific disciplines. Conversely, a vague understanding of the concept can lead to profound misunderstandings, within daily life and within the technical scientific literature.  My purpose is to introduce the concept of informationmathematically defined—to a broader audience, with the express intent of eliminating a number of common misconceptions that have plagued the progress of information science in different fields.

What is information? Simply put, information is that which allows you (who is in possession of that information) to make predictions with accuracy better than chance. Even though the former sentence appears glib, it captures the concept of information fairly succinctly. But the concepts introduced in this sentence need to be clarified. What do I mean with prediction? What is "accuracy better than chance"? Predictions of what? 

We all understand that information is useful. When is the last time that you have found information to be counterproductive? Perhaps it was the last time you watched the News. I will argue that, when you thought that the information you were given was not useful, then what you were exposed to was most likely not information. That stuff, instead, was mostly entropy (with a little bit of information thrown in here or there). Entropy, in case you have not yet come across the term,  is just a word we use to quantify how much you don't know. Actually, how much anybody doesn't know. (I'm not just picking on you).

But, isn't entropy the same as information?

One of the objectives of these posts is to make the distinction between the two as clear as I can. Information and entropy are two very different objects. They may have been used synonymously (even by Claude Shannon—the father of information theory—thus being responsible in part for a persistent myth) but they are fundamentally different. If the only thing you will take away from this article is your appreciation of the difference between entropy and information, then I will have succeeded.

But let us go back to our colloquial description of what information is, in terms of predictions. "Predictions of what"? you should ask. Well, in general, when we make predictions, it is about a system that we don't already know. In other words, an other system. This other system can be anything: the stock market, a book, the behavior of another person. But I've told you that we will make the concept of information mathematically precise. In that case, I have to specify this "other system" as precisely as I can. I have to specify, in particular, which states the system can take on. This is, in most cases, not particularly difficult. If I'm interested in quantifying how much I don't know about a phone book, say, I just need to tell you the number of phone numbers in it. Or, let's take a more familiar example (as phone books may appeal, conceptually, only to the older crowd among us), such as the six-sided fair die. What I don't know about this system is which side is going to be up when I throw it next. What you do know is that it has six sides. How much don't you know about this die? The answer is not six. This is because information (or the lack thereof) is not defined in terms of the number of unknown states. Rather, it is given by the logarithm of the number of unknown states. 

"Why on Earth introduce that complication?", you ask.

Well, think of it this way. Let's quantify your uncertainty (that is, how much you don't know) about a system (System One) by the number of states it can be in. Say this is \(N_1\). Imagine that there is another system (System Two), and that one can be in \(N_2\) different states. How many states can the joint system (System One And Two Combined) be in? Well, for each state of System One, there can be \(N_2\) number of states. So the total number of states of the joint system must be \(N_1\times N_2\). But our uncertainty about the joint system is not \(N_1\times N_2\). Our uncertainty adds, it does not multiply. And fortunately the logarithm is that one function where the log of a product of elements is the sum of the logs of the elements. So, the uncertainty about the system \(N_1\times N_2\) is the logarithm of the number of states
$$H(N_1N_2)=\log(N_1N_2)=\log(N_1) + \log(N_2).$$
I had to assume here that you knew about the properties of the log function. If this is a problem for you, please consult Wikipedia and continue after you digested that content.

Phew, I'm glad we got this out of the way. But, we were talking about a six-sided die. You know, the type you've known all your life. What you don't know about the state of this die (your uncertainty) before throwing it is \(\log 6\). When you peek at the number that came up, you have reduced your uncertainty (about the outcome of this throw) to zero. This is because you made a perfect measurement. (In an imperfect measurement, you only got a glimpse of the surface that rules out a "1" and a "2", say.) 

What if the die wasn't fair? Well that complicates things. Let us for the sake of the argument assume that the die is so unfair that one of the six sides (say, the "six") can never be up. You might argue that the a priori uncertainty of the die (the uncertainty before measurement) should now be \(\log 5\), because only five of the states can be the outcome of the measurement. But how are you supposed to know this? You were not told that the die is unfair in this manner, so as far as you are concerned, your uncertainty is still \(\log 6\). 

Absurd, you say? You say that the entropy of the die is whatever it is, and does not depend on the state of the observer? Well I'm here to say that if you think that, then you are mistaken. Physical objects do not have an intrinsic uncertainty. I can easily convince you of that. You say the fair die has an entropy of \(\log 6\)? Let's look at an even more simple object: the fair coin. Its entropy is \(\log 2\), right? What if I told you that I'm playing a somewhat different game, one where I'm not just counting whether the coin comes up heads to tails, but am also counting the angle that the face has made with a line that points towards True North. And in my game, I allow four different quadrants, like so:


Suddenly, the coin has \(2\times4\) possible states, just because I told you that in my game the angle that the face makes with respect to a circle divided into 4 quadrants is interesting to me. It's the same coin, but I decided to measure something that is actually measurable (because the coin's faces can be in different orientation, as opposed to, say, a coin with a plain face but two differently colored sides). And you immediately realize that I could have divided the circle into as many quadrants as I can possibly resolve by eye. 

Alright fine, you say, so the entropy is \(\log(2\times N)\) where \(N\) is the number of resolvable angles. But you know, what is resolvable really depends on the measurement device you are going to use. If you use a microscope instead of your eyes, you could probably resolve many more states. Actually, let's follow this train of thought. Let's imagine I have a very sensitive thermometer that can sense the temperature of the coin. When throwing it high, the energy the coin absorbs when hitting the surface will raise the temperature of the coin slightly, compared to one that was tossed gently. If I so choose, I could include this temperature as another characteristic, and now the entropy is \(\log(2\times N\times M)\), where \(M\) is the number of different temperatures that can be reliably measured by the device. And you know that I can drive this to the absurd, by deciding to consider the excitation states of the molecules that compose the coin, or of the atoms composing the molecules, or nuclei, the nucleons, the quarks and gluons? 

The entropy of a physical object, it dawns on you, is not defined unless you tell me which degrees of freedom are important to you. In other words, it is defined by the number of states that can be resolved by the measurement that you are going to be using to determine the state of the physical object. If it is heads or tails that counts for you, then \(\log 2\) is your uncertainty. If you play the "4-quadrant" game, the entropy of the coin is \(\log 8\), and so on. Which brings us back to six-sided die that has been mysteriously manipulated to never land on "six". You (who do not know about this mischievous machination) expect six possible states, so this dictates your uncertainty. Incidentally, how do you even know the die has six sides it can land on? You know this from experience with dice, and having looked at the die you are about to throw. This knowledge allowed you to quantify your a priori uncertainty in the first place. 

Now, you start throwing this weighted die, and after about twenty throws or so without a "six" turning up, you start to become suspicious. You write down the results of a longer set of trials, and note this curious pattern of "six" never showing up, but the other five outcomes with roughly equal frequency. What happens now is that you adjust your expectation. You now hypothesize that it is a weighted die with five equally likely outcome, and one that never occurs. Now your expected uncertainty is \(\log 5\). (Of course, you can't be 100% sure.)

But you did learn something through all these measurements. You gained information. How much? Easy! It's the difference between your uncertainty before you started to be suspicious, minus the uncertainty after it dawned on you. The information you gained is just \(\log 6-\log5\). How much is that? Well, you can calculate it yourself. You didn't give me the base of the logarithm you say? 

Well, that's true. Without specifying the logarithm's base, the information gained is not specified. It does not matter which base you choose: each base just gives units to your information gain. It's kind of like asking how much you weigh. Well, my weight is one thing. The number I give you depends on whether you want to know it in kilograms, or pounds. Or stones, for all it matters.

If you choose the base of the logarithm to be 2, well then your units will be called "bits" (which is what we all use in information theory land). But you may choose the Eulerian e as your base. That makes your logarithms "natural", but your units of information (or entropy, for that matter) will be called "nats".  You can define other units (and we may get to that), but we'll keep it at that for the moment. 

So, if you choose base 2 (bits), your information gain is \(\log_2(6/5)\approx 0.263\) bits. That may not sound like much, but in a Vegas-type setting this gain of information might be worth, well, a lot. Information that you have (and those you play with do not) can be moderately valuable (for example, in a stock market setting), or it could mean the difference between life and death (in a predator/prey setting). In any case, we should value information.  

As an aside, this little example where we used a series of experiments to "inform" us that one of the six sides of the die will not, in all likelihood, ever show up, should have convinced you that we can never know the actual uncertainty that we have about any physical object, unless the statistics of the possible measurement outcomes of that physical object are for some reason known with infinite precision (which you cannot attain in a finite lifetime). It is for that reason that I suggest to the reader to give up thinking about the uncertainty of any physical object, and be only concerned with differences between uncertainties (before and after a measurement, for example). 

The uncertainties themselves we call entropy. Differences between entropies (for example before and after a measurement) are called information. Information, you see, is real. Entropy on the other hand: in the eye of the beholder.

In this series on the nature of information, I expect the next posts to feature more conventional definitions of  entropy and information—meaning, those that Claude Shannon has introduced—(with some examples from physics and biology), then moving on to communication, and the concept of the channel capacity.

Part 2: The Things We Know


4 comments:

  1. Thanks for doing this series. I'm an ex-physics and math major, and myself never fully acquired a good understanding of information vs. entropy; I'm glad to see you trying to explain this in terms comprehensible by educated laypersons.

    I do have one quibble, though: You write: "But our uncertainty about the joint system is not N_1 x N_2. Our uncertainty adds, it does not multiply." Maybe it's just me being dense, but this doesn't strike me as a trivially obvious statement (though it does seems plausible if I think about it a bit). Clearly if one does accept it then taking the logarithm of the number of states follows naturally (at least for anyone with basic mathematical knowledge). But it would have been nice to have a little more explanation on why uncertainty (in the sense you've defined it) is additive in the first place.

    ReplyDelete
  2. Dear Frank,

    you are quite right, It isn't trivially obvious. But think of it this way. Let's say I have two systems, with N states each. Then the total number of states that the combined system can be in is N squared: N states in the second for each N in the first. The uncertainty of the combined state is the logarithm of the total number of states (which is N squared). The logarithm of N squared is two times the logarithm of N. So you can see that in this case of two variables with the same uncertainty, the uncertainty of the joint system is the sum of the uncertainty of each. Because the log of x squared is two times the log of x.

    ReplyDelete
  3. The "Our uncertainty adds" bit seemed a bit of a jump (or assertion) to me too. It might be better to say the we "really want uncertainties to add (and subtract)". The log formulation leads to easier and more intuitive equations later on, but I think everything could actually be reformulated without the logs.

    Maybe using your coin side+orientation example would help:

    If we care/measure the side and orientation of a single coin, it can be in 2x4 states. OK, N1=8

    But what if we measure the head/tail of one coin and the orientation of a second coin? N2=2 and N3=4

    If we want uncertainty in the first case (H1) to equal the sum of the uncertainties in the second case (H2+H3), then we just define H = log(N). So,
    H1 = H2 + H3
    log(8) = log(2)+log(4)
    taking log base 2, that is just 3=1+2

    ReplyDelete
  4. Chris, thanks for your response. I take your point to be that if we consider the uncertainty to be a logarithmic function of the number of states then it follows naturally that if we have two systems S1 and S2 then our uncertainty regarding the joint system of both S1 and S2 is the sum of our uncertainties regarding the individual systems. More formally, if our uncertainty regarding a system S is considered to be log(N) where N is the number of states of S, then if S1 has N1 states and S2 has N2 states with uncertainties log(N1) and log(N2) respectively, the joint system S3 has N1 x N2 states and thus our uncertainty regarding S3 is log(N1 x N2) = log(N1) + log(N2).

    I consider Travc's comment to be coming from the opposite direction: If we find it natural to consider that our uncertainty regarding a system is a function of the number of possible states of the system, and that the uncertainty regarding a joint system is the sum of our uncertainties regarding the individual subsystems making up that joint system, then it is natural to define the uncertainty as the logarithm of the number of states.

    More formally, we are trying to find an expression H(N) for our uncertainty regarding a system with a number of states N. It's natural to conclude H is a monotonically increasing function of N, with H(N1) > H(N2) if N1 > N2. (In other words, the more states the more our uncertainty.) It's also natural to conclude that H(1) = 0. (In other words, if a system can have only one state as measured by us then our uncertainty regarding it is zero.)

    Based on the example of the coin for which we can measure either heads or tails or the orientation in one of four directions, it's also natural to conclude that H(8) = H(2) + H(4) -- we can consider this to be a case of a measurement against a single system with 8 states, or a measurement against a composite system consisting of two subsystems with 2 and 4 states respectively, with our overall uncertainty being the same in either case, and the uncertainty regarding the composite system being the sum of our uncertainties regarding the subsystems. Generalizing this argument, we'd conclude that H(N1 x N2) = H(N1) + H(N2) for all N1, N2.

    Taking this last condition together with the other conditions on H we'd conclude that H(N) = logb(N) for some base b. Since our argument didn't force a choice of base we'd conclude that the actual base can be chosen as whatever value we consider to be most convenient, for example b = 2.

    So the bottom line is that I think I'm unconfused now :-)

    ReplyDelete