## Saturday, June 29, 2013

### What is Information? (Part 4: Information!)

This is the 4th (and last) installment of the "What is Information" series. The one where I finally get to define for you what information is. The first three parts can be found here: Part 1, Part 2, Part 3.

Let me quickly summarize the take-home points of parts1-3, just in case you are reading this temporally removed from parts1-3.

1.) Entropy, also known as "uncertainty", is something that is mathematically defined for a "random variable". But physical objects aren't mathematical. They are messy complicated things. They become mathematical when observed through the looking glass of a measurement device that has a finite resolution. We then understand that a physical object does not "have an entropy". Rather, its entropy is defined by the measurement device I choose to examine it with.  Information theory is a theory of the relative state of measurement devices.

2.) Entropy, also known as uncertainty, quantifies how much you don't know about something (a random variable). But in order to quantify how much you don't know, you have to know something about the thing you don't know. These are the hidden assumptions in probability theory and information theory. These are the things you didn't know you knew.

3.) Shannon's entropy is written as "p log p", but these "p" are really conditional probabilities if you know that they are not uniform (all p the same for all states). They are not uniform given what else you know. Like that they are not uniformly distributed. Duh.

Alright, now we're up to speed. So we have the unconditional entropy, which is the one where we know nothing about the system that the random variable describes. We call that $H_{\rm max}$, because an unconditional entropy must be maximal: it tells us how much we don't know if we don't know anything. Then there is the conditional entropy $H=-\sum_i p_i\log p_i$, where the $p_i$ are conditional probabilities. They are conditional on some knowledge. Thus, $H$ tells you what remains to be known. Information is "what you don't know minus what remains to be known given what you know". There it is. Clear?

"Hold on, hold on. Hold on for just a minute"

What?

"This is not what I've been reading in textbooks."

"Don't condescend. What I mean is that this is not what I would read in, say, Wikipedia, that you link to all the time. For example, in this link to `mutual information'. You're obviously wrong.
Because, you know, it's Wiki!"

So tell me what it is that you read.

"It says there that the mutual information is the difference between the entropy of random variable $X$,  $H(X)$, and the conditional entropy $H(X|Y)$, which is the conditional entropy of variable $X$ given you know the state of variable $Y$."

"Come to think of it, you yourself defined that conditional entropy at the end of your part 3. I think it is Equation (5) there!" And there is this Venn diagram on Wiki. It looks like this:"
 Source: Wikimedia
Ah, yes. That's a good diagram. Two variables $X$ and $Y$. The red circle represents the entropy of $X$, the blue circle the entropy of $Y$. The purple thing in the middle is the shared entropy $I(X:Y)$, which is what $X$ knows about $Y$. Also what $Y$ knows about $X$. They are the same thing.

"You wrote $I(X:Y)$ but Wiki says $I(X;Y)$. Is your semicolon key broken?"

Actually, there are two notations for the shared entropy (a.k.a information) in the literature. One uses the colon, the other the semicolon. Thanks for bringing this up. It confuses people. In fact, I wanted to bring up this other thing....

"Hold on again. You also keep on saying "shared entropy" when Wiki says "shared information". You really ought to pay more attention."

Well, you. That's a bit of a pet-peeve of mine. Just look at the diagram above. The thing in the middle, the purple area, it's a shared entropy. Information is shared entropy. "Shared information" would be, like, shared shared entropy. I think that's a bit ridiculous, don't you think?

"Well, if you put it like this. I see your point. But why do I read "shared information" everywhere?"

That is, dear reader, because people are confused about what to call entropy, and what to call information. A sizable fraction of the literature calls what we have been calling "entropy" (or uncertainty) "information". You can see this even in the book by Shannon and Weaver (which, come to think of it, was edited by Weaver, not Shannon). When you do this, then what is shared by the "informations" is "shared information". But that does not make any sense, right?

"I don't understand. Why would anybody call entropy information? Entropy is what you don't know, information is what you know. How could you possibly confuse the two?"

I'm with you there. Entropy is "potential information". It quantifies "what you could possibly know". But it is not what you actually know. I think, between you and me, that it was just sloppy writing at first, which then ballooned into a massive misunderstanding. Both entropy and information are measured in bits, and so people would just flippantly say: "a coin has two bits of information", when they mean to say "two bits of entropy". And it's all downhill from there.

I think I've made my point here, I hope. Being precise about entropy and information really matters. Colon vs. semicolon does not. Information is "unconditional entropy minus conditional entropy". When cast as a relationship between two random variables $X$ and $Y$, we can write it as

$I(X:Y)=H(X)-H(X|Y)$.

And because information is symmetric in the one who measures and the one who is being measured (remember: a theory of the relative state of measurement devices...) this can also be written as

$I(X:Y)=H(Y)-H(Y|X)$.

And both formulas can be verified by looking at the Venn diagram above.

"OK, this is cool."

"Hold on, hold on!"

What is it again?

"I just remembered. This was all a discussion that came after I brought up Wikipedia, that says that information was $I(X:Y)=H(X)-H(X|Y)$, while you said it was $H_{\rm max}-H$, where the $H$ was clearly an entropy that you write as $H=-\sum_i p_i\log p_i$. All you have to do is scroll up, I'm not dreaming this!"

So you are saying that textbooks say

$I=H(X)-H(X|Y)$  (1)

while I write instead

$I=H_{\rm max}-H(X)$,   (2)

where $H(X)=-\sum_i p_i\log p_i$. Is that what you're objecting to?

"Yes. Yes it is."

Well, here it is in a nutshell. In (1), information is defined as the difference between the actual observed entropy of $X$, minus the actual observed entropy of $X$ given that I know the state of $Y$ (whatever that state may be).

In (2), information is defined as what I don't know about $X$ (without knowing any of the things that we may implicitly know already), and the actual uncertainty of $X$. The latter does not mention a system $Y$. It quantifies my knowledge of $X$ without stressing what it is I know about $X$. If the probability distribution with which I describe $X$ is not uniform, then I know something about $X$. My $I$ in Eq. (2) quantifies that. Eq. (1) quantifies what I know about $X$ above and beyond what I already know via Eq. (2). It quantifies specifically the information that $Y$ conveys about $X$. So you could say that the total information that I have about $X$, given that I also know the state of $Y$, would be

$I_{\rm total}=H_{\rm max}-H(X) + H(X)-H(X|Y)=H_{\rm max}-H(X|Y)$.

So the difference between what I would write and what textbooks write is really only in the unconditional term: it should be maximal. But in the end, Eqs. (1) and (2) simply refer to different informations. (2) is information, but I may not be aware how I got into possession of that information. (1) tells me exactly the source of my information: the variable $Y$.

Isn't that clear?

"I'll have to get back to you on that. I'm still reading. I think I have to read it again. It sort of takes some getting used to."

I know what you mean. It took me a while to get to that place. But, as I has hinted at in the introduction to this series, it pays off big time to have your perspective adjusted, so that you know what you are talking about when you say "information". I will be writing a good number of blog posts that reference "information", and many of those are a consequence of research that was only possible when you understand the concept precisely. I wrote a series on information in black holes already (starting here). That's just the beginning. There will be more to come, for example on information and cooperation. I mean, how you can only fruitfully engage in the latter if you have the former.

I know it sounds more like a threat than a promise. I really mean it to be a promise.

## Thursday, June 27, 2013

### Why NP can't possibly be P

Do you think P=NP? Or were you just curious what these letter combinations could possibly mean in an otherwise well-constructed sentence?

Whether or not P=NP is actually one of the "big" unsolved problems in mathematics. You can be a million bucks richer by proving it one way or the other. (It's one of the seven math problems deemed worthy of a million bucks by the Clay Mathematics Institute.)

So this P=NP (or not) thing is a pretty big deal. What's it all about?

A lot has been written about this problem, and if you want to get up to speed about it, the Wiki entry is not a bad place to start.  But I'll try my best to give you a non-technical introduction that will make you care. Because the question that was first asked by the mathematician and computer scientist Stephen Cook  is profound. I'm not kidding here: Mathematics, beauty, and Rachmaninov's second piano concerto may be involved. Well, at least, in this post.

I want you to think about two different things. Two seemingly completely unrelated things.

First thing: How long does it take to find the answer to difficult problems? Of course, putting it like this is not precise enough. So let's be more specific. How hard is it to find one rare thing in the midst of $N$ not so rare things?

"Well, it depends", you would answer, and you would be right. It depends on a lot of things. For example, suppose I want you to find a red dot in the middle of very many black dots. That's easy, you can do this visually in no time flat, because the red dot sticks out (left panel below). If the dot you are supposed to find is not so easy to distinguish, then it might take you a while longer (blue dot in the right panel below). If I asked you instead to tell me whether or not any of the dots are different at all (and I don't tell you what the difference is) you may have to look at every dot individually. And if there are $N$ dots, that's how long it would take you on average.
 Easy Search                                                                                  Less Easy Search
But problems of size $N$ (problems that take $N$ steps to solve) are still in the class of easy problems. A little harder are problems that take polynomial time to solve (like $N^2$ steps, or even $N^{10,000}$).

Problems of this type are said to belong to the class "P", where P stands for polynomial. They are (for the most part) easy for a computer. Things get problematic when the $N$ is not in the base, but in the exponent. Like, $2^N$ or $e^N$. The exponential function rises sooo much faster than any polynomial, that the difference between the two (polynomial versus exponential) is not one of degrees. It is one of substance.

Now, the Second thing: Suppose somebody hands you a sheet of paper and claims: "THIS is the solution to this extremely hard problem you have been talking about".

How long will it take you to confirm (or ridicule) said claim? You would think: "Pretty fast I gather", and you would be right (most of the time). In the example of the search above, if I tell you that the red dot is THAT ONE in the upper left corner, all you would have to say is: "OK, right, I see it". For the second panel, it's in the lower right corner (the blue dot). See, it's so much easier if I show you the solution!

There is a name for problems that are really hard, but for which any proposed solution can be checked really quickly. This class of problems is called "NP": the class of problems for which we can verify a solution in polynomial time, but may not be able to find it in polynomial time.

<nerdy comment> Strictly speaking, NP stands for "non-deterministic polynomial".  It means that a computer very unlike the one you read this on (a non-deterministic one) could solve the problem in polynomial time.  An unpredictable computer, so to speak. Maybe a quantum one. Maybe that has to remain the subject of a different post. One that speculates about whether it is obvious that there must be a classical algorithm that factorizes primes in polynomial time. </end nerdy comment>

Let's just think of the problems in the class NP as those that we have no idea how to solve in polynomial time. I'm oversimplifying here, but this is a blog, not a textbook.

So, I can imagine that you would immediately agree that checking a solution is a hell of a lot faster than finding a solution. Because that makes sense.

We indeed believe that's true. But we do not know.

Let me give you an example hard problem to illustrate all this. It's kind of the grand-daddy of hard problems, perhaps in part because it is the one that Stephen Cook used to formulate the problem. In fact, he showed that if you could solve that problem, then you could solve many many others right away, because this grand-daddy is part of a class of problems that can all be related to each other (called the NP-complete problems). This is cool because it means that if you've solved any of the problems in the set in polynomial time, then you've solved them all in polynomial time.

This grand-daddy of problems is the 3-SAT problem, and I urge you to be awake through my description of it. (All other descriptions would put you to sleep much faster anyway.) If you can get through this, I promise that there are symphonies ahead.

"Do I really have to endure a 30 seconds lecture about 3-SAT?"

You brush your teeth, don't you? That's 60 seconds right there! (If not, you should reconsider your brushing routine. Just sayin'.)

So here it goes. 30 seconds.

You know about logic already. If A and B are two variables that can be either TRUE or FALSE, then you know whether A$\vee B$ is TRUE or FALSE, depending on the truth value of A and B. Now let's get C into it. I can now ask, for example, whether A$\vee$B$\vee\neg$ C is true. This $\neg$ symbol, by the way, is just the logical NOT. So, if A is TRUE, then $\neg$A is FALSE. And $\vee$ stands for OR, of course, while $\wedge$ simply means AND.

Well, you say, obviously, this clause A$\vee$B$\vee\neg$ C is TRUE if any of A or B is true, or C is false. Indeed, to check a single clause, that's easy. But what if we take two clauses (each made from three variables (three, get it, 3-SAT?)  and ask whether the AND of these two clauses is true. Like, for example, (A$\vee$B$\vee\neg$ C)$\wedge (\neg$ B$\vee$C$\vee\neg$ D).

"Well OK", you say, now I have to determine whether each of the clauses is TRUE to figure out whether the AND of the two clauses is true. And I see what you did there: in any of the clauses, variables can reoccur. "And you're down 15 seconds, by the way. "

All right. Now imagine I have a given number of variables A,B,C, to whatever. Like N of them.  And I make tons of clauses from them (each of them ORs of three variables), ANDed together. And I call the resulting expression a "statement". And I ask you the all-important question:

"Could the resulting statement ever be TRUE?"

It could look like this:

(R$\vee$T$\vee\neg$ C)$\wedge (\neg$ B$\vee$C$\vee\neg$ D)$\wedge(\neg$ W$\vee$B$\vee\neg$ M)$\wedge$(F$\vee\neg$ G$\vee$ D)$\wedge$(W$\vee$T$\vee\neg$R)

Or it could be a lot longer. Asking whether the statement can ever be TRUE is asking whether the statement can be "satisfied". And because the statement is composed of clauses with three variables, it is the "3-Satisfiability" problem: 3-SAT.

Now, I don't expect you to be able to answer this question just by looking at the statement. And, I also realize that my 30 seconds are up. But you could of course just test it. If there are $N$ variables, you just have to test $2^N$ possible cominations of TRUE and FALSE for these variables, plug them into the statement, and check if it ever returns 1. But you understand immediately that this will take a long time if $N$ is large. An exponential amount of time, because $2^N$ depends on $N$ exponentially.

(Really it does. Because $2^N=e^{\ln(2) N}$. If you are more comfortable with exponentials to the base e.)

Now, let me imagine for a moment that you have been thinking about this for longer than the 30 seconds it took you to read through this simple example. Please, let me indulge in that thought just for the sake of it.

You may have come up with this simple thought:

"What if..... every variable appears only once in any clause? Shouldn't it be obvious that some combination will result in a TRUE statement?"

Yes. Yes indeed, that is quite right. I'm impressed.

"But wait, wait. What if there are only, like, two variables A and B, but a gazillion clauses. Isn't it obvious that there are bound to be contradictions between clauses so that not all clauses could possibly be right at the same time (kind of like A and $\neg A$ cannot be both be TRUE at the same time! Am I right? Am I?"

Gosh, you. You're like, totally right. If that happens (a small number of variables, a large number of clauses) you can fairly quickly rule out that the statement can ever be TRUE. Indeed, figuring out whether a statement of many clauses but a small number of variables, or a large number of variables and just a few clauses, is usually quick and easy.

For anything inbetween, however, we think it'll take checking essentially all $2^N$ to figure out satisfiability. Sorry. Unless, of course, there is a trick nobody has yet thought of. An algorithm of unheard of power.

Remember when I wrote that I want you to think about two things? Two very different things?

The first was:  How long does it take to solve a really hard problem? In the case of 3-SAT, we think it'll take an "NP" amount of time, on average. NP: a very very long time. Possibly as long as the universe has been around (for big $N$.). But we can check any proposed solution quickly.

Just imagine: someone gives you a sequence of variables A,B,C, etc. and claims that

"A=TRUE, B=FALSE, C=FALSE.... will satisfy the statement!"

You can check that in no time at all, by just plugging it into the formula. VoilĂ ! Either that someone was right, or not. The time to check any proposed solution is polynomial in the problem size, that is, it is P.

Is P (time to check a solution) equal to NP (time to find a solution)? That's preposterous right? "No way!" you exclaim. Yet, nobody has ever been able to prove that this is not so. People have tried, and  even claimed they succeeded

So we have to entertain, at this point in time, the possibility that P=NP, as preposterous as that may sound. Because the people who have been trying to prove the opposite aren't exactly amateurs. There is, after all, a million bucks riding on it. So could it be that P=NP?  What would the world be like it if it was true?

First, let us think about this: Is anything like this (P=NP) even possible at all? It would mean that we can create rare things that are "TRUE" (by some measure) as easily as we can check that any thing is TRUE (by some measure). Let's go back to 3-SAT to see what this means. Let's imagine that we deal with a hard case of 3-SAT: that there is actually only a single lonesome set of assignements to the variables that results in the statement to be TRUE. In that case, the statement is satisfiable, but proving it will require finding the proverbial needle in a haystack. We know that we can check every proposition quickly. But if we don't have any way of guesssing what a good solution is, then we have to check all of them. Checking each may be quick. Checking all is not, because there are an exponential number of candidates.

How could we find this one needle in a haystack? Is something like this even thinkable in principle?

It is indeed! Think of a different problem: creating a poignant and meaningful piece of poetry.

"That's ridiculous!", you exclaim, "How can you liken a creative process such as writing poetry to an automated procedure?"

Hold on there for a moment. Take you favorite piece of poetry. Think about it. Recite it in your head.

What, you don't have one? Really, you should spend a bit more time away from the computer, smell the roses, so to speak. What about this one, about the thing with feathers? No? What about Hamlet's soliloquy? Well, you should have a favorite piece of poetry, is all I'm saying. I have more, but I can tell you're getting tired of this discussion already.

Now, given the medium in which those pieces of poetry that I referred to are rendered (sequences of letters drawn from an alphabet of 26 symbols, plus the space), they are extraordinarily rare. I think I can even trust you to calculate how rare, if you imagine the "blind monkey" way of writing a poetry on a type writer. You know, the kind that (theoretically speaking) can reproduce all the works of Shakespeare when randomly hacking away at a keyboard.

The chance of even producing a Shakespearean sonnet by this method before the sun engulfs this planet are pretty much nil, you can imagine. That's if the monkey types completely randomly. Now, what if it's not completely random? What if, say, you write random 9 letter sequences, and keep all those that match any word (or combination of words) that Shakespeare ever wrote?

Well, that would be an algorithm that still runs for a long time, but The one called "A Lover's Complaint", no less. How is this possible?

The answer is that any deviation from a random search is an algorithm, a heuristic. Heuristics can accelerate search at sometimes formidable rates. The one I described is, when you think about it, an evolutionary one. You keep the good stuff (even if it was produced by a random process), and you try, try, again.

Yes it is true, evolution is an algorithm--a heuristic--that can produce complexity from a diet of random chance. I'm pretty sure you're with me on this. But there are plenty of other algorithms. For example (for the problem of reconstituting Shakespeare's writings) you could bias the way letters are produced, for example by imposing blanks between letters at the rate dictated by the average length of words in English.

"This is cheating!", you might exclaim. But that is precisely what an algorithm is: finding ways to cheat, so that you don't have to examine every possible solution.

In the light of this, we can now formulate the central problem of P=NP.

Are there ways to cheat so that we can find the truly hard stuff fairly quickly? For example, as quickly as we can check whether the proposed solution really is a solution?

Let's use the art metaphor one more time. There are "solutions" (time series of notes) that we call music that are generally acknowledged to be "good". I'm sure you can be with me here: this is neither well-defined nor do I think it can be. (This is not stopping us from trying: stay tuned for my lab's efforts in this direction). And not everybody agrees. If there was an obvious algorithm for good music, then we would not be paying successful composers the money they earn. "Not everybody can be a Mozart", is what we say.

Think about it: we (usually) can tell fairly quickly whether a piece of music is any good. Can we write such a piece as quickly?

Every fibre in our body says: "Hell no!"

Now, we don't usually think of our musical appreciation abilities as algorithmic, but perhaps we should.  The generation of music can be algorithmic (and in fact the generation of music via evolutionary algorithms was the subject of a recent article within the Proceedings of the National Academy of Sciences). And yours truly wrote a commentary about that article here. But I digress. As is my wont.

If P=NP, then there must be an algorithm that creates beatutiful music, beautiful poetry, beautiful fiction, you name it. Of course, just because we haven't found it (trust me, we haven't), that doesn't mean it does not exist. Can a machine be programmed to produce these kind of rare pieces?

The answer to the latter question is, of course, "We don't know". All I know is that it is unlikely that a computational algorithm (the way we understand it) can produce Rachmaninov's second piano concerto, for example. I have a personal view about this, because I've worked on his 2nd (and to a lesser extent 3rd) concerto since 1982. But I digress as usual.

Sergej Rachmaninov (1873-1943)

I'm not arguing here for a non-algorithmic view of human performance or ingenuity. What I am suggesting is that the origin of these rare achievements is found in the history of the machine that produced it (the personal life experience of the composer, here, Sergej). Mathematical algorithms do not take experiences into account: they produce solutions from given input data. But the human algorithm takes a lot more input into account: this input data could be essentially infinite (our experience of the world up to the present). Even more, the algorithm people use is not deterministic, in the sense that no two people are alike. We get a "solution" very very rarely. Some personal histories, blended with exquisite mental algorithms operating within that person's brain, can produce unmatched beauty, once in a while.

So where does this leave us? NP is probably not P, because if it was then there would be no place for genius in this world. Genius is an algorithm that takes advantage of history: the accumulated experiences (guided by curiosity) that make up a person's baggage. It's good that NP does not equal P. Life, the world, the universe, would be boring if NP=P. It's not a proof, and I don't expect to receive a check in the mail for it. But I'm fairly confident that I'm right about that.

## Tuesday, June 4, 2013

### What Is Information? (Part 3: Everything is Conditional)

This is the third part of the "What is Information?" series. Part one is here, and clicking this way gets you to Part 2.

We are rolling along here, without any indication of how many parts this series may take. Between you and me, I have no idea. We may be here for a while. Or you may run out of steam before me. Or just run.

Let me remind you why I am writing this series. (Perhaps I should have put this paragraph at the beginning of part 1, not part 3? No matter).

I believe that Shannon's theory of information is a profound addition to the canon of theoretical physics. Yes, I said theoretical physics. I can't get into the details of why I think this in this blog (but if you are wondering about this you can find my musings here). But if this theory is so fundamental (as I claim) then we should make an effort to understand the basic concepts in walks of life that are not strictly theoretical physics. I tried this for molecular biology here, and evolutionary biology here.

But even though the theory of information is so fundamental to several areas of science, I find that it is also one of the most misunderstood theories. It seems, almost, that because "everybody knows what information is", a significant number of people (including professional scientists) use the word, but do not bother to learn the concepts behind it.

But you really have to. You end up making terrible mistakes if you don't.

The theory of information, in the end, teaches you to think about knowledge, and prediction. I'll try to give you the entry ticket to all that. Here's the quick synopsis of what we have learned in the first two parts.

1.) It makes no sense to ask what the entropy of any physical system is. Because technically, it is infinite. It is only when you specify what questions you will be asking (by specifying the measurement device that you will use in order to determine the state of the random variable in question) that entropy (a.k.a. uncertainty) is finite, and defined.

2.) When you are asked to calculate the entropy of a mathematical (as opposed to physical) random variable, you are usually handed a bunch of information you didn't realize you have. Like, what's the number of possible states to expect, what those states are, and possibly even what the likelihood is of experiencing those states. But given those, your prime directive is to predict the state of the random variable as accurately as you can. And the more information you have, the better your prediction is going to be.

Now that we've got these preliminaries out of the way, it seems like high time that we get to the concept of information in earnest. I mean, how long can you dwell on the concept of entropy, really?

Actually, just a bit longer as it turns out.

I think I confused you a bit in the first two posts. One time, I write that the entropy is just $\log N$, the logarithm of the number of states the system can take on, and later I write Shannon's formula for the entropy of random variable $X$ that can take on states $x_i$ with probability $p_i$ as

$H(X)=-\sum_{i=1}^N p_i\log p_i$   (1)

And then I went on to tell you that the first was "just a special case" of the second. And because I yelled at one reader, you probably didn't question me on it. But I think I need to clear up what happened here.

In the second part, I talked about the fact that you really are given some information when a mathematician defines a random variable. Like, for example, in Eq. (1) above. If you know nothing about the random variable, you don't know the $p_i$. You may not even know the range of $i$. If that's the case, we are really up the creek, with paddle in absentia. Because you wouldn't even have any idea about how much you don't know. So in the following, let's assume that you know at least how many states to expect, that is, you know $N$.

If you don't know anything else about a probability distribution, then you have to assume that each state appears with equal probability. Actually, this isn't a law or anything. I just don't know how you would assign probabilities to states if you have zero information. Nada. You just have to assume that your uncertainty is maximal in that case. And this happens to be a celebrated principle: the "maximum entropy principle". The uncertainty (1) is maximized if $p_i=1/N$ for all $i$.  And if you plug in $p_i=1/N$ in (1), you get

$H_{\rm max}=\log N$.   (2)

It's that simple. So let me recapitulate. If you don't know the probability distribution, the entropy is (2). If you do know it, it is (1). The difference between the entropies is knowledge. Uncertainty (2) does not depend on knowledge, but the entropy (1) does. One of them is conditional on knowledge, the other isn't it.

There is a technical note for you physicists out there that is imminent. All you physics geeks, read on. Everybody else, cover your eyes and go: "La La La La!"

<geeky physics> Did you realize how Eq. (2) is really just like the entropy in statistical physics when using the microcanonical ensemble, while Eq. (1) is the Boltzmann-Gibbs entropy in the macrocanoical ensemble, where $p_i$ is given by the Boltzmann distribution? </geeky physics>

You can start reading again. If you had your fingers in your ears while going "La La La La!", don't worry: you're nearly normal, because reading silently means nothing to your brain

Note, by the way, that I've been using the words "entropy" and "uncertainty" interchangeably. I did this on purpose, because they are one and the same thing here. You should use one or the other interchangeably too.

So, getting back to the narrative, one of the entropies is conditional on knowledge. But, you think while scratching you head, wasn't there something in Shannon's work about "conditional entropies"?

Indeed. and those are the subject of this part 3. The title kinda gave it away, I'm afraid.

To introduce conditional entropies more formally, and then connect to (1)--which completely innocently looks like an ordinary Shannon entropy--we first have to talk about conditional probabilities.

What's a conditional probability? I know, some of you groan "I've known what a conditional probability is since I've been seven!" But even you may learn something. After all, you learned something reading this blog even though you're somewhat of an expert? Right? Why else would you still be reading?

"Infinite patience", you say? Moving on.

A conditional probability characterizes the likelihood of an event, when another event has happened at the same time. So, for example, there is a (generally small) probability that you will crash your car. The probability that you will crash your car while you are texting at the same time is considerably higher. On the other hand, the probability that you will crash your car while it is Tuesday at the same time, is probably unchanged, that is, unconditional on the "Tuesday" variable. (Unless Tuesday is your texting day, that is.)

So, the probability of events depends on what else is going on at the same time. "Duh", you say. But while this is obvious, understanding how to quantify this dependence is key to understanding information.

In order to quantify the dependence between "two things that happen at the same time", we just need to look at two random variables. In the case I just discussed, one variable is the likelihood that you will crash your car, and the other is the likelihood that you will be texting. The two are not always independent, you see. The problems occur when the two occur simultaneously.

You know, if this was another blog (like, the one where I veer off to discuss topics relevant only to theoretical physicists) I would now begin to remind you that the concept of simultaneity is totally relative, so that the concept of a conditional probability cannot even be unambiguously defined in relativistic physics. But this is not that blog, so I will just let it go.  I didn't even warn you about geeky physics this time.

OK, here we go: $X$ is one random variable (think: $p_i$ is the likelihood that you crash your car while you conduct maneuver $X=x_i$). The other random variable is $Y$. That variable has only two states: either you are texting ($Y=1$), or you are not ($Y=0$) And those two states have probabilities $q_1$(texting)  and $q_0$ (not texting)  associated to them.

I can then write down the formula for the uncertainty of crashing your car while texting, using the probability distribution

$P(X=x_i|Y=1)$ .

This you can read as "the probability that random variable $X$ is in state $x_i$ given that, at the same time, random variable $Y$ is in state $Y=1$."

This vertical bar "|" is always read as "given".

So, let me write  $P(X=x_i|Y=1)=p(i|1)$. I can also define $P(X=x_i|Y=0)=p(i|0)$. $p(i|1)$ and $p(i|0)$ are two probability distributions that may be different (but they don't have to be if my driving is unaffected by texting). Fat chance for the latter, by the way.

I can then write the entropy while texting as

$H(X|{\rm texting})=-\sum_{i=1}^N p(x_i|1)\log p(x_i|1)$.  (3)

On the other hand, the entropy of the driving variable while not texting is

$H(X|{\rm not\ texting})=-\sum_{i=1}^N p(x_i|0)\log p(x_i|0)$.  (4)

Now, compare Eqs (3) and (4) to Eq. (1). The latter two are conditional entropies, conditional in this case on the co-occurrence of another event, here texting. They look just like the the Shannon formula for entropy, which I told you was the one where "you already knew something", like the probability distribution. In the case of (3) and (4), you know exactly what it is that you know, namely whether random variable $X$ is texting while driving, or not.

So here's the gestalt idea that I want to get across. Probability distributions are born being uniform. In that case, you know nothing about the variable, except perhaps the number of states it can take on, Because if you didn't know that, then you wouldn't even know how much you don't know. That would be the "unknown unknowns", that a certain political figure once injected into the national discourse.

These probability distributions become non-uniform (that is, some states are more likely than others) once you acquire information about the states. This information is manifested by conditional probabilities. You really only know that a state is more or less likely than the random expectation if you at the same time know something else (like in the case discussed, whether the driver is texting or not).

Put in another way, what I'm trying to tell you here is that any probability distribution that is not uniform (same probability for all states) is necessarily conditional. When someone hands you such a probability distribution, you may not know what it is conditional about. But I assure you that it is conditional. I'll state it as a theorem:

All probability distributions that are not uniform are in fact conditional probability distributions.

This is not what your standard textbook will tell you, but it is the only interpretation of "what do we know" that makes sense to me. "Everything is conditional" thus, as the title of this blog post promised.

But let me leave you with one more definition, which we will need in the next post, when we finally get to define information.

Don't groan, I'm doing this for you

We can write down what the average uncertainty for crashing your car is, given your texting status. It is simply the average of the uncertainty while texting and the uncertainty while not texting, weighted by the probability that you engage in any of the two behaviors.  Thus, the conditional entropy $H(X|Y)$, that is the uncertainty of crashing your car given your texting status, is

$H(X|Y)=q_0H(X|Y=0)+q_1H(X|Y=1)$   (5).

That's obvious, right? $q_0$ being the probability that you are texting while executing any maneuver $i$, and $q_1$ the probability that you are not (while executing any maneuver).

With this definition of the entropy of one random variable given another, we can now finally tackle information.

In the next installment of the "What is Information" series, of course!

Part 4: Information.