This is the third part of the "What is Information?" series. Part one is here, and clicking this way gets you to Part 2.
We are rolling along here, without any indication of how many parts this series may take. Between you and me, I have no idea. We may be here for a while. Or you may run out of steam before me. Or just run.
Let me remind you why I am writing this series. (Perhaps I should have put this paragraph at the beginning of part 1, not part 3? No matter).
I believe that Shannon's theory of information is a profound addition to the canon of theoretical physics. Yes, I said theoretical physics. I can't get into the details of why I think this in this blog (but if you are wondering about this you can find my musings here). But if this theory is so fundamental (as I claim) then we should make an effort to understand the basic concepts in walks of life that are not strictly theoretical physics. I tried this for molecular biology here, and evolutionary biology here.
But even though the theory of information is so fundamental to several areas of science, I find that it is also one of the most misunderstood theories. It seems, almost, that because "everybody knows what information is", a significant number of people (including professional scientists) use the word, but do not bother to learn the concepts behind it.
But you really have to. You end up making terrible mistakes if you don't.
The theory of information, in the end, teaches you to think about knowledge, and prediction. I'll try to give you the entry ticket to all that. Here's the quick synopsis of what we have learned in the first two parts.
1.) It makes no sense to ask what the entropy of any physical system is. Because technically, it is infinite. It is only when you specify what questions you will be asking (by specifying the measurement device that you will use in order to determine the state of the random variable in question) that entropy (a.k.a. uncertainty) is finite, and defined.
2.) When you are asked to calculate the entropy of a mathematical (as opposed to physical) random variable, you are usually handed a bunch of information you didn't realize you have. Like, what's the number of possible states to expect, what those states are, and possibly even what the likelihood is of experiencing those states. But given those, your prime directive is to predict the state of the random variable as accurately as you can. And the more information you have, the better your prediction is going to be.
Now that we've got these preliminaries out of the way, it seems like high time that we get to the concept of information in earnest. I mean, how long can you dwell on the concept of entropy, really?
Actually, just a bit longer as it turns out.
I think I confused you a bit in the first two posts. One time, I write that the entropy is just \(\log N\), the logarithm of the number of states the system can take on, and later I write Shannon's formula for the entropy of random variable \(X\) that can take on states \(x_i\) with probability \(p_i\) as
\(H(X)=-\sum_{i=1}^N p_i\log p_i\) (1)
And then I went on to tell you that the first was "just a special case" of the second. And because I yelled at one reader, you probably didn't question me on it. But I think I need to clear up what happened here.
In the second part, I talked about the fact that you really are given some information when a mathematician defines a random variable. Like, for example, in Eq. (1) above. If you know nothing about the random variable, you don't know the \(p_i\). You may not even know the range of \(i\). If that's the case, we are really up the creek, with paddle in absentia. Because you wouldn't even have any idea about how much you don't know. So in the following, let's assume that you know at least how many states to expect, that is, you know \(N\).
If you don't know anything else about a probability distribution, then you have to assume that each state appears with equal probability. Actually, this isn't a law or anything. I just don't know how you would assign probabilities to states if you have zero information. Nada. You just have to assume that your uncertainty is maximal in that case. And this happens to be a celebrated principle: the "maximum entropy principle". The uncertainty (1) is maximized if \(p_i=1/N \)for all \(i\). And if you plug in \(p_i=1/N\) in (1), you get
\(H_{\rm max}=\log N\). (2)
It's that simple. So let me recapitulate. If you don't know the probability distribution, the entropy is (2). If you do know it, it is (1). The difference between the entropies is knowledge. Uncertainty (2) does not depend on knowledge, but the entropy (1) does. One of them is conditional on knowledge, the other isn't it.
There is a technical note for you physicists out there that is imminent. All you physics geeks, read on. Everybody else, cover your eyes and go: "La La La La!"
<geeky physics> Did you realize how Eq. (2) is really just like the entropy in statistical physics when using the microcanonical ensemble, while Eq. (1) is the Boltzmann-Gibbs entropy in the macrocanoical ensemble, where \(p_i\) is given by the Boltzmann distribution? </geeky physics>
You can start reading again. If you had your fingers in your ears while going "La La La La!", don't worry: you're nearly normal, because reading silently means nothing to your brain.
Note, by the way, that I've been using the words "entropy" and "uncertainty" interchangeably. I did this on purpose, because they are one and the same thing here. You should use one or the other interchangeably too.
So, getting back to the narrative, one of the entropies is conditional on knowledge. But, you think while scratching you head, wasn't there something in Shannon's work about "conditional entropies"?
Indeed. and those are the subject of this part 3. The title kinda gave it away, I'm afraid.
To introduce conditional entropies more formally, and then connect to (1)--which completely innocently looks like an ordinary Shannon entropy--we first have to talk about conditional probabilities.
What's a conditional probability? I know, some of you groan "I've known what a conditional probability is since I've been seven!" But even you may learn something. After all, you learned something reading this blog even though you're somewhat of an expert? Right? Why else would you still be reading?
"Infinite patience", you say? Moving on.
A conditional probability characterizes the likelihood of an event, when another event has happened at the same time. So, for example, there is a (generally small) probability that you will crash your car. The probability that you will crash your car while you are texting at the same time is considerably higher. On the other hand, the probability that you will crash your car while it is Tuesday at the same time, is probably unchanged, that is, unconditional on the "Tuesday" variable. (Unless Tuesday is your texting day, that is.)
So, the probability of events depends on what else is going on at the same time. "Duh", you say. But while this is obvious, understanding how to quantify this dependence is key to understanding information.
In order to quantify the dependence between "two things that happen at the same time", we just need to look at two random variables. In the case I just discussed, one variable is the likelihood that you will crash your car, and the other is the likelihood that you will be texting. The two are not always independent, you see. The problems occur when the two occur simultaneously.
You know, if this was another blog (like, the one where I veer off to discuss topics relevant only to theoretical physicists) I would now begin to remind you that the concept of simultaneity is totally relative, so that the concept of a conditional probability cannot even be unambiguously defined in relativistic physics. But this is not that blog, so I will just let it go. I didn't even warn you about geeky physics this time.
OK, here we go: \(X\) is one random variable (think: \(p_i\) is the likelihood that you crash your car while you conduct maneuver \(X=x_i\)). The other random variable is \(Y\). That variable has only two states: either you are texting (\(Y=1\)), or you are not (\(Y=0\)) And those two states have probabilities \(q_1\)(texting) and \(q_0\) (not texting) associated to them.
I can then write down the formula for the uncertainty of crashing your car while texting, using the probability distribution
\(P(X=x_i|Y=1)\) .
This you can read as "the probability that random variable \(X\) is in state \(x_i\) given that, at the same time, random variable \(Y\) is in state \(Y=1\)."
This vertical bar "|" is always read as "given".
So, let me write \(P(X=x_i|Y=1)=p(i|1)\). I can also define \(P(X=x_i|Y=0)=p(i|0)\). \(p(i|1)\) and \(p(i|0)\) are two probability distributions that may be different (but they don't have to be if my driving is unaffected by texting). Fat chance for the latter, by the way.
I can then write the entropy while texting as
\(H(X|{\rm texting})=-\sum_{i=1}^N p(x_i|1)\log p(x_i|1)\). (3)
On the other hand, the entropy of the driving variable while not texting is
\(H(X|{\rm not\ texting})=-\sum_{i=1}^N p(x_i|0)\log p(x_i|0)\). (4)
Now, compare Eqs (3) and (4) to Eq. (1). The latter two are conditional entropies, conditional in this case on the co-occurrence of another event, here texting. They look just like the the Shannon formula for entropy, which I told you was the one where "you already knew something", like the probability distribution. In the case of (3) and (4), you know exactly what it is that you know, namely whether random variable \(X\) is texting while driving, or not.
So here's the gestalt idea that I want to get across. Probability distributions are born being uniform. In that case, you know nothing about the variable, except perhaps the number of states it can take on, Because if you didn't know that, then you wouldn't even know how much you don't know. That would be the "unknown unknowns", that a certain political figure once injected into the national discourse.
These probability distributions become non-uniform (that is, some states are more likely than others) once you acquire information about the states. This information is manifested by conditional probabilities. You really only know that a state is more or less likely than the random expectation if you at the same time know something else (like in the case discussed, whether the driver is texting or not).
Put in another way, what I'm trying to tell you here is that any probability distribution that is not uniform (same probability for all states) is necessarily conditional. When someone hands you such a probability distribution, you may not know what it is conditional about. But I assure you that it is conditional. I'll state it as a theorem:
All probability distributions that are not uniform are in fact conditional probability distributions.
This is not what your standard textbook will tell you, but it is the only interpretation of "what do we know" that makes sense to me. "Everything is conditional" thus, as the title of this blog post promised.
But let me leave you with one more definition, which we will need in the next post, when we finally get to define information.
Don't groan, I'm doing this for you!
We can write down what the average uncertainty for crashing your car is, given your texting status. It is simply the average of the uncertainty while texting and the uncertainty while not texting, weighted by the probability that you engage in any of the two behaviors. Thus, the conditional entropy \(H(X|Y)\), that is the uncertainty of crashing your car given your texting status, is
\(H(X|Y)=q_0H(X|Y=0)+q_1H(X|Y=1)\) (5).
That's obvious, right? \(q_0\) being the probability that you are texting while executing any maneuver \(i\), and \(q_1\) the probability that you are not (while executing any maneuver).
With this definition of the entropy of one random variable given another, we can now finally tackle information.
I am not so sure that I like the way you assign a fundamental significance to the uniform distribution. In the special case where you have a finite number of states for your system AND you know that there is a symmetry that makes all these states equivalent, then I agree with you. But if you do not have any information about such a symmetry, I see nothing special about the uniform distribution. There can not really be anything special about it because it is not invariant under a change of variables. So unless you have knowledge that one particular parametrisation (say momentum p) is more fundamental than another one (say kinetic energy $p^2/(2m)$) you have no reason to choose the uniform distribution in either case.
ReplyDeleteI think what you are trying to do, namely to find a totally uninformative distribution, can not be done.
Hi Gustav (we're Kommilitonen from Stony Brook, by the way).
ReplyDeleteI should have prefaced that particular dicussion by saying that it holds for probability theory only. Physics is, as you point out, a totally diferent thing altogether. First, the uniform distribution is special because of the maximum entropy principle. But this is true only if there are no other constraints on the system, such as imposed by symmetries, for example. The Boltzmann distribution, for example, is special in equilibrium thermodynamics. But the distribution is clearly conditional: it requires knowledge of E and T. In the absence of knowing the temperature, all you can assume is the microcanonical ensemble (you still know the energy), but all states are equally probable. So yes, if you have particular parameterizations (requiring knowledge, such as the momentum) and constraints and/or symmetries are at work, other distributions are special. But in their absence, the maximum entropy principle forces me to assume a uniform distribution.
Hi Chris, it is nice to again discuss with you, 26 years after leaving Stony Brook. I did not manage to get my point across yet. My point is that the maximum entropy principle is not a good principle for choosing a distribution, because it is dependent on a choice of parametrization of the observables. The state of least knowledge should not depend on such a choice of coordinates. There is a discussion of that point in chapter 12 of Jaynes' 2003 book (Jaynes, E. T. Probability Theory: The Logic of Science. Edited by G. Larry Bretthorst. Cambridge University Press, 2003.), although, for some reason, he believes that the reparametrization ambiguity exists only for continuous random variables.
ReplyDeleteGustav, that's a very interesting point, which I'll try to answer after reading the chapter in Jaynes's book. Which curiously I haven't read, even though (after reading some of the stuff available online) it is generally very much in line with my views. But chapter 12 is only in the hardcover, which I've now ordered.
ReplyDelete