Saturday, December 26, 2015

Evolving Intelligence ... With a Little Help

The year 2015 may go down in history for a lot of things. Just this December saw a number of firsts: A movie about armed conflict among celestial bodies breaks all records, a rocket delivers a payload of satellites and returns back to Earth vertically, not to mention the politics of the election cycle. But just maybe, 2015 will also be remembered as the year that people started warning about the dangers of Artificial Intelligence (AI). In fact, none other than Elon Musk, the founder of SpaceX who accomplished the improbable feat of landing a rocket, is one of the more prominent voices warning of AI. After giving $10 million to the "Future of Life" Institute (whose mission is "safeguarding life and developing optimistic visions of the future", but mostly warns about the dangers of AI), he co-founded OpenAI, a non-profit research company that aims to promote and develop open-source "friendly AI". 

I wrote about my thoughts on this issue--that is, the dangers of AI--in a piece for the Huffington Post that you can read here. The synopsis (for those of you on a tight reading schedule) is that while I think that it is a reasonable thing to worry about such questions, the fears of a rising army of killing robots are almost certainly naive. The gist is that we are nowhere near creating the kind of intelligence that we should be afraid of. We cannot even design the intelligence displayed by small rodents, let alone creatures that think and plan our demise. 

When discussing AI, I often make the distinction between "Type I" and "Type 2" intelligence. "Type I" is the kind we are good at designing today: the Roomba, Deep Blue, Watson, and the algorithm driving the Google self-driving car. Even the Deep Neural Nets that have been one of the harbingers, it seems, of the newfound fears, squarely belong into this group. These machine intelligences are of Type I (those that you do not need to fear), because they aren't really intelligent. They don't actually have any concept of what it is they are doing: they are reacting appropriately to the input they are presented with. You know not to fear them, because you will not worry about Deep Blue driving Google's car, or Jeopardy-beating Watson to recognize cat videos on the internet. 

Type 2 intelligence is different. Type 2 has representations about the world in which it exists, and uses these representations (abstractions, toy models) to make decisions, plans, and to think about thinking. I have written about the importance of representations in another blog post, so I won't repeat this here. 

If you could design Type 2 intelligence, I would be scared too. But you can't. That is, essentially, my point when I tell people that their fears are naive. The reasons for the failure of the design approach are complex, and detailed in another blog post. You want a synopsis of that one too? Fine, here it is: That stuff ain't modular, and we can only design modular stuff. Type 2 intelligence integrates information at an unheard-off level, and this kind of non-modular integration is beyond our design capabilities, perhaps forever.

I advocate that you cannot design Type 2 intelligence, but you can evolve it. After all, it worked once, didn't it? And that is what my lab (as well as Jeff Clune's lab and now Arend Hintze's lab also) is trying to achieve.  

I know, I know. You are asking: "Why do you think that evolving AI should be less dangerous than designed AI?" This is precisely the question I will try to answer in this post. Along the way, I will shamelessly plug a recent publication where we introduce a new tool that will help us achieve the goal. The goal that we all are looking for--some with trepidation, some with determination and conviction. 

The answer to this question lies in the "How" of the evolutionary approach. To those not already familiar with the evolutionary approach (if this is you: my hat off to you for reading this far), this approach is firmly rooted in emulating the Darwinian process that has given rise to all the biological complexity you can see on our planet. The emulation is called the "Genetic Algorithm".

Here's the "Genetic Algorithm" (GA, for short) for you in a nutshell. Mind you, a nutshell is only large enough to hold but the most basic of stuff about GAs. But here we go. In a GA, a population of candidate "solutions" to a given problem is maintained. The candidates encode the solution in terms of symbolic strings (often called "genotypes"). The strings encode the solution in a way so that small changes to the string give rise to small changes to the solution (mostly). Changes (called mutations) are made to the genotypes randomly, and often strings are recombined by taking pieces of two different strings and merging them. After all the changes are done, the sequences in the new population are tested, and each has a fitness assigned to them. Those with high fitness are given relatively more offspring to place into the next generation, and those with less fitness ... well, less so. Because those types with "good genes" get more representation in the next generation (and those with bad genes barely leave any) over time fitness increases, and complex stuff ensues.

Clearly (we can all attest to that), this is some powerful algorithm. You would not be here reading this without it, because it is the algorithm behind Darwinian evolution, which made you. But as powerful as it is, it also has an Achilles heel. The algorithm preferentially selects types that have higher fitness than the prevailing type. That's the whole point, of course, but what if the highest type is far away, and the path towards it must go through less fit types? In that case, the preference for fitter things is actually an impediment, because you can't tell the algorithm to "stop wanting higher fitness for a little while". 

This problem is known as the "valley-crossing" problem. Consider the fitness landscape in the figure below.
A schematic fitness landscape where elevation is synonymous with fitness.  Credit: Bjørn Østman.
This is known as a "rugged" fitness landscape (for obvious reasons). You are to think of the x and y coordinates of this picture as the genotype, and the z-axis as the fitness of that type. Of course, in realistic landscapes the type is specified by far more than two numbers, but it would not be as easily depicted. Think of the x and y coordinates as the most important numbers to characterize the type. In evolutionary biology, such "important characters" are called "traits". 

If a population occupies one of these peaks, an evolutionary process will have a hard time to make it to another (higher) peak, as the series of changes that the type has to undergo to move to the new peak must lead through valleys. While it is in a valley, it is outcompeted by those types that are not attempting the "trip" to higher ground. Those types that are left behind and stick to the "old ways" of doing things, they are like reactionaries actively opposing progress. And in evolution, these forces are very strong.

What can you do to help the evolutionary algorithm see that it is OK to creep along at lower fitness for a little while? There are in fact many things that can be done, and there are literally hundreds, if not thousands of papers that have been written to address this problem, both in the world of evolutionary computation and in evolutionary biology. It is one of the hottest research fields in evolution.

I cannot here describe the different approaches that have been taken to increase evolvability in the computational realm, or to understand evolvability in the biological realm. There are books about this topic. I will describe here one way to address this problem, in the context of our attempts to evolve intelligent behavior. The trick is to exploit the fact that the landscape really has many more dimensions than the one you are either visualizing, or even the one you are using to calculate the fitness. Let me explain.

In evolutionary computation, you generally specify a way to calculate fitness from the genotype. This could be as simple as "count the number of 1s in the binary string". Such a fitness landscape is simple, non-deceptive (because all paths that lead upwards actually lead to the highest peak) and smooth (there is only one peak). Evolution stops once the string "1111...1111" is found. In the evolution of intelligence, it takes much more to calculate fitness. This is because the sequence, when interpreted, literally makes a brain. That brain must then be loaded onto an agent, who then has to "do stuff" in its simulated world. The better it "does stuff", the higher its score. The higher its score, the higher its fitness. The higher its fitness, the more offspring it will leave in the next generation. And because the offspring inherit the type, the more types in the next generation that can "do stuff". Which is a good thing, as now each one of those has a chance to find out (I mean, via mutations) how to "do even more stuff".

In one of the examples of the paper that I'm actually blogging about the agent has to catch some types of blocks that are raining down, and avoid others. Here's a picture of what that world looks like:

The agent's world. Credit: the authors.
The agent is the rectangular block on the bottom, and it can move left or right. It looks upwards using the four red triangles. Using these "eyes" it must determine whether the block raining down (diagonally, either left or right) is small or large. It it is small it should catch it,  but if it is large it should avoid it. The problem is, the agent's vision is poor: it has a big blind spot between the sensors, so a small and a large block may look exactly the same, unless you move around, that is. That is why this classic task is called "active categorical perception": in order to perceive and classify the shape (which you do by either catching or avoiding), you have to move actively.

This is a difficult problem for the agent, as it takes a little while to determine what the object even is. Once you know what it is, you have to plan your move in such a way that the object will touch you (if it is small) or not touch you (if it is large). This means that you have to predict where it is going to land, and make your moves accordingly. And all that before the brick has hit the floor. You do need memory to pull this off, as without it you will not be able to determine the trajectory.

We have previously shown that you can evolve brains that can do this task perfectly. But this does not mean that every evolutionary trajectory reaches that point. Quite to the contrary: most of the time you get stuck at these intermediate peaks of decent,  but not perfect, performance. We looked for ways to increase those odds, and here's what we came up with. What you want to do is reward things other than the actual performance. Things that you think might make a better brain, but that might not, just at this moment, make you better at the block-catching task. We call these things "neuro-correlates": characters that are correlated with good neurological processing in general. It is like selecting for good math ability when the task at hand is survival from being hunted by predators. Being good at math may not save you right then and there (while being fast would), but in the long run, being good at math will be huge because for example you can calculate the odds of any evasion strategy, and thus select the right one. Math could help you in a myriad of ways. Later on, in another hunt.

After all, the problem with the evolutionary algorithm is its short-sightedness: it cannot "see" the far-off peaks. Selecting for traits that you, the investigator, trust are "good for thinking in general" (the neuro-correlates) is like correcting for the short-sightedness of evolution. The mutations that increase the neuro-correlate traits would ordinarily not be rewarded (until they become important later on). By rewarding them early, you may be able to jump start evolution.

So that is what we tried, in the paper that I'm blogging about, and that appeared on Christmas Day 2015. We tried a litany of neuro-correlates (eight, to be exact). The neuro-correlates that we tested can roughly be divided into two categories: network-theory based, and information-theory based. Since the Markov brains that we evolve are networks of neurons, network-theory based measures make sense. As brains also process information, we should test information-processing measures as well.

The network-based measures are mostly your usual suspects: density of connection (in the expert parlance: mean degree), sparsity, length of longest shortest path, and a not so obvious one: length of genome encoding the network. The information-theoretic ones are perhaps less obvious: we chose information integration, representation, and two types of predictive information. If I would attempt to describe these measures in detail (and why we chose them) I might as well repeat the paper. For the present purpose, let's just assume that they are well defined, and that they may or may not aid evolution.

Which is exactly what we found empirically. Suppose, for example, that you reward (aside from rewarding the catching of the blocks) a measure that quantifies how well you integrate information. There is indeed such a measure: it is called $\Phi$ (Phi), and I blogged about that before.  You can imagine that information integration might be important for this task: the agent has to integrate the visual information from different time points along with other memories to make the decision. So the trick is that any mutation that increases information integration will have an increased chance of making it into the next generation, even though it may not be useful at that moment. So, in other words, we are helping evolution to look forward in time, by keeping certain mutations around even if they are not useful at the time that they occur. Doing this, what may have looked like a valley in the future, may not be a valley after all (because of the presence of a mutation that was integrated into the genome ahead of time).

So what should we reward? Easy, right? Reward those mutations that help the brain work well! Oh wait, we don't know how the brain works. So, we have to make guesses as to what things might make the brain work well. And then test which of these, as a matter of fact, do help in hindsight. Here are the eight that we tested:

Network-theory  based:

1. Minimum Description Length (MDL) (which here you can think of as a proxy for "brain size")
2. Graph Diameter (Longest of all shortest paths between node pairs)
3. Connectivity (the mean degree per node)
4. Sparseness (kind of the opposite of connectivity)

Information-theory based:

5. Representation (having internal models of the world)
6. Information Integration (Phi, the "atomic variant")
7. Predictive information (between sensor states)
8. Predictive Information (between sensor and actuator states)

Here's what we found: Graph diameter, Phi, and connectivity all three significantly help the evolutionary algorithm when the overall rewarded function is the fitness times the neuro-correlate. Sparseness, as well as the two predictive information measures, made things worse. This finding reinforces the suspicion that we really don't know what makes brains work well. In neuroscience, sparse coding is considered a cornerstone of neural coding theory, after all. But we should keep in mind that these findings can very well depend significantly on the type of task investigated, and that for other tasks the findings for what works might be reversed. For example, the block-catching task requires memory, and predictive information is maximized for purely reactive machines. If the task did not require memory, it is likely that predictive information is a good neuro-correlate.

To check how much the value of the neuro-correlate depends on the task chosen, we repeated the entire analysis for a very different task: one that does not even require the agent to have a body.

The alternate task we tested is the ability to generate random numbers using only deterministic rules. That this is a cognitively complex task has been known for some time: the ability to generate random (or, I should say, random-ish) numbers is often used to assess cognitive deficiencies in people. Indeed, if you (a person) were asked to do this, you would need to keep track of not only the last 5-7 numbers you generated (which you can do using short-term memory), but also of how often you have produced doubles, and triples, etc, and of what numbers. The more you think about this task, you appreciate its complexity. And you can easily imagine that different cognitive impairments might lead to different signature departures from randomness.

Of course this task is easy if you have access to a random number generator. but the Markov brains had none. So they had to figure out an algorithm to produce the numbers (which is also what we do in computers to produce pseudo-random numbers).

The results with the random number generation (RNG) task were roughly the same as with the block-catching task: Graph diameter, Phi, and connectivity scored well, while predictive information and sparseness scored negatively. Representation cannot be used as a neuro-correlate for this task, as there is no external world that the brain can create representations of. So while the individual results differ somewhat, there seems to be some universality in the results we found.

Of course, there are very likely better neuro-correlates out there that can boost the performance of the evolutionary algorithm much more. We don't know what these are, as we don't know what it is that makes brains work better. There are many suggestions, and we hope to try some in the future. We can think of

1. Other graph-based measures such as modularity
2. novelty search (rewarding brains that see or do things they haven't seen or done before)
3. conditional mutual information over different time intervals
4. Measures of information transfer
5. Dual total correlation

Of course, the list is endless. It is our intuition about what matters in computation in the brain that should guide our search for measures, and whether or not they matter is then found empirically. In this manner, evolutionary algorithms might also give us a clue about how all brains work, not just those in silico.

But I have not yet answered the question that I posed at the very beginning of this post. Most of you are forgiven for forgetting it as it is figuratively eons ago (or 36 paragraphs, which in blogging land is considered almost synonymous with eons). The straw man reader asked: "What makes you think that evolution (as opposed to design) will produce "nice" intelligences, that is, the kind that will not be bent on destruction of all of humanity?"

The answer is that we cannot (we firmly believe that) evolve intelligence in a vacuum. The world in which intelligence evolves must be complex, and difficult to predict. Thus, it must change in subtle ways, ways that takes intelligence to forecast. The best world to achieve this is a world in which there are other agents, with complex brains. Then, prediction requires the prediction of behaviors of others, which is best achieved by understanding the other. Perhaps, by realizing that the other thinks like you think. When doing this, you generally also evolve empathy. In other words, as we evolve our agents to survive in groups of other agents, cooperative behavior should ultimately evolve at the same time.

Our robots, when they first open their eyes to the real world, will already know what cooperation and empathy are. These are not traits that human programmers are thinking of, but evolution has stumbled upon these adaptive traits over and over again. That is why we are optimistic: we will be evolving robots with empathic brains. And if they show signs of psychopathology in their adolescence? Well, we know where the off switch is.

The publication this blog post is based on is open access (gold):

J. Schossau, C. Adami, and A. Hintze, Information-Theoretic Neuro-Correlates Boost Evolution of Cognitive Systems. Entropy 18 (2016) e18010006.

Tuesday, September 29, 2015

Shadowgenes: The hidden lives of pseudogenes

Sometimes we are lulled into a belief that we actually know how biology works. Not that biology is simple, mind you, but after a while we "get the hang of it" and think we have it under control.  Sure, we know that molecular biology is complicated: Genes have evolved for eons to be transcribed into mRNA that itself is translated into proteins, who do all the work to build and maintain a cell, and perform computations that lead to decisions that increase the chance of the organism to leave offspring and continue this gigantic experiment of natural selection. 
Matthew Bowden www.digitallyrefreshing.com 
And then once in a while you find that it is not quite the way you thought. Like, you may discover that the expression of genes is influenced by things other than proteins that bind to DNA regulatory regions. When I was a postdoc at Caltech taking part in informal computational biology brainstorming sessions for example, I asked the assembled Caltech braintrust: “Couldn’t RNA bind to other RNA to influence expression levels?” and was laughed out of the room. “RNA does not bind to RNA, silly”, I was schooled, and I could almost feel the virtual hands petting the young physicist’s head that had not studied enough biology. 

This was 1997 when I was a Burroughs-Wellcome fellow in Computational Biology, quite a bit of time before the role of microRNAs in the regulation of gene expression was recognized. That was a "It's not quite that simple" moment. And of course, if you try to understand biology, these moments keep on coming. "Bacteria don't have adaptive immune systems" our wisdom would teach the students, and then someone somewhere discovers the CRISPR system, and the rest is again history and we find that it's not that simple. There are plenty more of these moments, and many more to come. I would like to tell you about one such moment that I was involved in recently, schooling me about the concept of pseudogenes. I'm not sure how much it will change the standing paradigm, but it sure made me rethink biology again: "It's not quite that simple".

Here's what we know about pseudogenes. They are dead genes. Genes that once were (expressed, that is). Standard lore says that pseudogene either stopped being expessed or lost their ability to function even when expressed. Dead as a doornail, thus. Wikipedia tells us that much:

"Pseudogenes are dysfunctional relatives of genes that have lost their gene expression in the cell or their ability to code protein".

Dysfunctional. That's good. But there is an important element there: "Relatives of genes". What's that all about?

Think about it. How could a gene that is functional possibly die? It's there for a reason, no? Genes evolve to enhance the likelihood of survival of the organism. You mess with it, the organism dies or performs much worse. Genes can't die without ultimately taking the organism with it. How can there be pseudogenes at all?

The answer to this conundrum lies in the fantastic arsenal of molecular mechanisms that billions of years of evolution have brought us. The origin of most of evolutionary novelty today (and in the recent as well as not so recent past) can be traced back to gene duplications. The molecular mechanisms that mess with perfect inheritance aren't just point mutations. They include insertions, deletions, and wholesale duplications of genetic material. Never mind how that happens in detail. Let's focus on that duplication process. Once in a while, an organism ends up with two (or even more) copies of the same gene. That means you (the organism) have one good such gene, and a copy with unlimited potential. This gene has the right stuff, as well as the enviable luxury of freedom: keep what you have and become whatever you want! It is the privilege of the offspring of the well-heeled. 

So what is the fate that awaits these duplications? No point in imitating their parent's function: let's explore new frontiers! In the land of genetics, however, their possibilities of future advancement are highly constrained. They may not need to perform the parent's function (as they are ably doing that already) and can explore new functions via mutations. But the mutations that they incur aren't exactly helping. In fact, they are mostly not helpful AT ALL. If a mutation does not propel you (the duplicated gene) to new functions, it likely will abort the one you had, either by messing with the sequence so that the protein does not fold anymore, or even worse, truncates the protein early (meaning that a stop codon was inserted in the middle of the sequence), making a short and most likely unfunctional protein. There are plenty of examples of such deceased genes in the genome. Another way to kill a gene is to mess with its transcriptional logic (the on-off switch to make the protein, so to speak). If the on-off switch is mutated so that it is permanently in the off position, well then you have another pseudogene. 

According to this model, pseudogenes should play no active role in the functioning of a cell. A passive role has been described, where pseudogenes are thought to be a "reservoir" of potentially active sequences that can be thought of as an "evolutionary playground", from which new genes can arise like a Phoenix from its ashes, so to speak. 

But there have been tantalizing hints that pseudogenes also have an active role,  that they are only "mostly dead", to throw in a Princess Bride reference. In the last 15 years a mounting number of cases have been reported where pseudogenes are transcribed nonetheless, and contribute to the organism's function in one way or the other. Interestingly, many of these cases are linked to diseases, such as cancer. In fact, as the ENCODE collaboration has been insisting (and in so doing providing another one of the "it's not quite that simple" moments), almost all of our genome is transcribed at one point or other, whether functional or not. 

In my lab I have a project to study the evolution of drug resistance of HIV (the virus that causes AIDS), using cell cultures in which the virus is exposed to different conditions. The project is led by the very talented Dr. Aditi Gupta, who after a Ph.D. in computational biology joined my lab and taught herself the necessary bench science to pull off these kinds of experiments. Aditi generates lots of RNAseq data in this project. For those asleep in the last 10 years, RNAseq is a method to accurately measure the level of transcription of genes. It's quite a fantastic method that has all but replaced the microarrays that used to be the staple of molecular biology. I used to teach bioinformatic analysis of microarrays: you can believe me when I say that you should not trust microarray data as far as the next trash bin. 

But RNAseq is fantastically accurate, if you know how to analyze it. And fortunately Aditi does, so she decided on a whim to compare the RNAseq profile of cells infected with HIV and compared it to a control that was not infected. When you do this, you immediately notice the differential expression of genes that are a part of obvious pathways linked to infection. While reading papers about the possible role of pseudogenes in cancers, she started wondering whether pseudogenes also were expressed in her HIV samples. And sure enough, not only were pseudogenes expressed, but they were differentially expressed, meaning that they were either much more, or much less, expressed in the infected cells compared to the uninfected cells. 

What's up with that? So she applied extremely stringent criteria for differential expression (at least a four-fold difference), corrected for false-discovery, and finally arrived at a list of 21 pseudogenes that, somehow or other, seem to play an active functional role in HIV infection.  

So what do you do with that? Well, the first thing is you check out the parents of these pseudogenes. Because as you remember from what I wrote earlier (yes, I realize, figuratively eons ago) pseudogenes are "dysfunctional relatives".  So each pseudogene has a parent from whence it sprang. What are these pseudos parents? What do they do for a living? 

First of all, about a third of the parent genes of the pseudogenes in our study are also differentially expressed in HIV infection, which is significant because--obviously--a random set of genes does not have a third of them implicated in HIV infection. Taken together, about half of the parent genes play a role in viral infections (for example, some are also active in influenza). Thus, most of the parents are one way or the other involved in fighting the infection. However, it is not true that parent and pseudogene are always both up-, or both down-regulated (we say these pairs have "synergistic expression").  There are plenty of examples of parent-up, pseudo-down, and vice versa (we call this "antagonistic expression"). 

A typical functional mechanism for synergistic expression is this. Imagine that a host gene, when expressed, is exposed to a microRNA (possibly activated by a virus) that attempts to silence this gene. Up-regulating the expression of pseudogene copies of that parent now would make sense for the cell, as the pseudogene RNA product could serve as a decoy--as a molecular sponge so to speak--to soak up the microRNAs triggered by the invader. Such an interaction has been seen with the protein PTEN and its pseudogene offspring PTENP1, involved in cancer [1]. We see many examples of such synergistic interactions in HIV as well (but of course we cannot be sure that the mechanism is similar).

An example of antagonistic interactions comes from a case when a pseudogene's mRNA product attracts the attention of a protein that helps stabilize the mRNA of the parent gene. By pulling away the stability-inducing protein, the pseudogene indirectly reduces the expression of the parent gene (because the parent's mRNA is now unstable and degrades). In this case the reduced expression gives rise to insulin resistance and Type 2 diabetes [2]. We have examples consistent with such a pattern in our list of pseudogene-gene pairs also, but again we do not have experimental evidence to support any particular mechanistic hypothesis. Indeed, we can also see patterns where a pseudogene transcript is normally expressed in uninfected cells, but suppressed under HIV infection. At the same time, the parent gene is up-regulated under infection. In general, the interaction between the pseudogene and the gene doesn't have to be direct, as in the examples I gave. It is even possible that the differential regulation of the pair has nothing to do with each other. But because of the similarity of the transcripts, we expect that the link is often fairly direct.

So we see that pseudogenes can have hidden lives. The ENCODE project suggests that one out of five pseudogenes are transciptionally active, but given the opportunistic nature of evolution (and the increasing evidence that these interactions come to the fore in particular in disease states), the fraction might be much higher. Because these pseudogenes, it appears, were never dead to begin with, we should not call them "zombiegenes". Instead, we should simply call them "shadowgenes": they are a shadow of their parent, live hidden lives most of the time, but come out of hiding when coaxed out by an invader. Then, depending on whether they help defend the cell, or aid and abet the aggressor, they can be hero or villain. Which is quite appropriate for the shadow's denizens.

The manuscript [3] describing the results of the differential expression of pseudogenes in HIV was published in the journal Viruses.

[1] Poliseno, L. et al. A coding-independent function of gene and pseudogene mRNAs regulates tumour biology. Nature 465 (2010) 1033–1038.
[2] Chiefari, E. et al. Pseudogene-mediated posttranscriptional silencing of HMGA1 can result in insulin resistance and type 2 diabetes. Nat. Commun. 1 (2010) 40.
[3] A. Gupta, C.T. Brown, Y.-H. Zheng, and C. Adami.  Differentially-expressed pseudogenes in HIV-1 infection. Viruses 7 (2015)  5191-5205.

Tuesday, July 14, 2015

On quantum measurement (Part 6: The quantum eraser)

Here's to you, quantum measurement afficionado, who has found their way to the sixth installment, breathless (I hope), to learn of the fate of the famous cat, eponymous with one of the great ones of quantum mechanics. Does she live or die? Can she be both dead and alive? What did this kitten ever do to deserve such an ambiguous fate? And, how many times can you write a blog post teasing cat-revelations and still not talk about it after all? And talk about erasers instead? How many?

In my defense, I tried to talk about the cat in this post. I really did. But the quantum eraser, and in particular the double-slit experiment that I have to describe first, took so much room that it was just not in the cards. But rejoice then, this series will be even longer!


OK then. I think we are beyond the obligatory summary of previous posts now. This can get tedious very fast. ("In Part 27, we learned that,...").  Instead, I refer those who stumbled on this, to peruse the first post here, which should be enough to get you on the road. 

So where were we? In the last post, I ended with teasing you with the quantum eraser. This post will all be about the quantum description of these seemingly paradoxical situations: the two-slit experiment, and the quantum eraser. Post 7 will be thus be the one concerned with Felis Schröderingii, and it is Part 8 that will deal with even more obscure paradoxes such as the Zeno and Anti-Zeno effect. But that post will use these paradoxes as a ploy to investigate something altogether more important: namely the legacy of Hans Bethe's insight from the very first post of the series. 

Foreshadowing aside, let's take a look at the infamous double-slit experiment (also know as "Young's experiment"). Richard Feynman (in his lectures) famously remarked that the experiment is "a phenomenon which is impossible […] to explain in any classical way, and which has in it the heart of quantum mechanics. In reality, it contains the only mystery.

Really, the only mystery? 

Well, once I get through all this (meaning this and the following posts), you might agree that perhaps this is not so far off the mark. If I can get you to slowly and knowingly nod about this statement (while thinking thet there is also much more to QM than that), then I did my job.

As is my want, I will be borrowing images from Wikimedia to illustrate stuff. (Because they explicitly encourage that sort of thing). Here, for example, is what they would show about the double-slit experiment:
Fig. 1: The double-slit experiment (source: Wikimedia)

Here's what happens in this experiment. You shoot electrons at a screen that has, yes you guessed it, two slits. It is important that these are electrons, not photons. (Actually it is not but you would not understand why I say this just yet. So let's forget I even uttered this). You shoot particles at a screen. Particles! (Imagine this in the voice of Seinfeld's Seinfeld). There is a screen behind the double slit. It records where the electrons that got through the slits land. What would you expect to find there?

Maybe something like this?
Fig. 2: The pattern expected at the screen if classical particles are sent in from the left.

If the world was classical, yes this is what you would see. But the world is not classical, and this is not what we observe. What you get (and what you are looking at below is data from an actual such experiment) is this:
Fig. 3: The intensity pattern on the screen if quantum particles are sent in. (Source: Wikimedia)
Now, don't focus on the fact that you have a big blob in the middle, and two smaller blobs left and right. I don't have time (and neither do you) to explain those. Focus on the blob in the middle instead. You thought you were going to see two intense blobs of light, one behind each slit. But you don't. You think something is wrong with your apparatus. But try as you might, you get this pattern every time.

What you should focus on are the dark lines that interrupt the blob. Where on earth do they come from? They look suspiciously like interference patterns, as if you had shone some light on the double slit, like so:
Fig. 3: Interference patterns from light shining on a double slit.
But you did not shine light on these slits. You sent particles. Particles! A particle cannot interfere with itself! Can it?

Here is where we pause. You certainly have to hand it to Feynman. It seems these electrons are not your ordinary classical particles.

The pattern on the screen does not tell you that an electron "went through one or the other slit". The electron's wavefunction interacts with the slits, and then the entangled wavefunction interacts with the screen, and then you look at the screen (not the quantum state, as you remember). Lets find out how that works, using the formalism that we just learned. (Of course you saw this coming, didn't you?)

Say you have your electron, poised to be sent onto the double slit. It is described by a wave function \(\Psi(x)\). All your standard quantum stuff applies to this wavefunction by the way, such as an uncertainty relation for position and momentum.  I could have written down the wavefunction of a wave packet traveling with momentum p just as well. None of these complications matter for what I'm doing here. 

When this wavefunction crosses the double slit, it becomes a superposition of having gone through the slit L and through the slit R, and I can write it like this (see Figure below).

                             \(|\Psi_1\rangle=\frac1{\sqrt 2}\left(|\Psi_L\rangle+|\Psi_R\rangle\right)\).  (1)

I have written \(|\Psi_1\rangle\) in terms of the "bra-ket" notation to remind you that the wavefunction is a vector, and I have identified only the position (because this is what we will measure later).
Fig. 4: Sketch of the double-slit experiment in terms of wavefunctions
Of course, you are free to roll your eyes and say "How could the electron possibly be in a superposition of having gone through the left slit and the right slit at the same time?" But that is precisely what you have to assume in order to get the mathematics to agree with the experimental result, which is the intensity pattern on the right in the figure above. 

What happens when we re-unite the two branches of the wavefunction at the spot where I write \(\Psi_2\) in the figure above and then measure the space-component of the wavefunction? We learned in Parts 1-5 how to make that measurement, so let's get cracking. 

I spare you the part where I introduce the measurement device, entangle it with the quantum system, and then trace over the quantum system, because we already know what this gives rise to:  Born's rule, which says that the likelihood to get a result at x is equal to the square of the wavefunction

                                                  \(P(x)=| \langle x|\Psi_2\rangle|^2=|\Psi_2(x)|^2\).     (2)

There, that's how simple it is. I remind you here that you can use the square of the wavefunction instead of the square of the measurement device's likelihood because they are one and the same. The quantum entanglement tells you so. The Venn diagram tells you so. If this is not immediately obvious to you, I suggest a revisiting of Parts 4 & 5.

Let's calculate it then. Plugging (1) into (2) we get

      \(P(x)=\frac12\left(|\Psi_L(x)|^2+|\Psi_R(x)|^2+ 2 {\rm Re}[\Psi_L^\star(x)\Psi_R(x)]\right)\) (3)

The two first terms in (3) do what you are used to in classical physics: they make a smudge on the screen at the locations of the two slits, \(x=L\) and \(x=R\). The interesting part is the third term, which is given by the real part of the product of the complex conjugate of \(\Psi_L(x)\) with \(\Psi_R(x)\). That's what the math says. And if you write down what these things are, you will find that these are the parts that create the "fringes", namely the interference pattern between \(\Psi_L\) and \(\Psi_R\). That's because that cross term can become negative, while the first two terms must be positive. If you did not have a wavefunction split just like I showed in (1), then you would not get that cross term, and you would not be able to understand the experiment. And hence you would not understand how the world works. 

"But, but.." I can almost here some of you object, "surely I can find out through which of the slits the electron actually went, no?"

Indeed you could try. Let's see what happens if you do. One way to do this is to put a device into the path of the electron that flips its spin (its helicity) in one of the branches (say the left, L), but not in the right. Then all I had to do is measure the helicity to know through which it went, right? But how would you explain, then, the interference pattern?

Well, let's try this (and ignore for a moment that this experiment is not at all easy to do with electrons (but it is very easy to do with photons, and using polarization instead of helicity). So now the wavefunction has another tag, which I will call u (for "up" so I don't have to type \(\uparrow\) all the time), and d.

After the slits, the wave funtion becomes

                     \(|\Psi_1\rangle=\frac1{\sqrt2}(|\Psi_L\rangle|u\rangle+|\Psi_R\rangle|u\rangle) \)  (4)

The new identifier ("quantum number") is in a product state with respect to all the other quantum numbers. And if nothing else happened, these quantum numbers would not affect the interference patters. (This is why I was able to ignore the momentum variable above, for example: it is in a product state). But now let's put the helicity flipper into the left pathway, like in the figure below:
Fig. 6: Double-slit interference experiments with location variable tagged by helicity
We then get the entangled state:

         \(|\Psi\rangle=\frac1{\sqrt2}\left(|\Psi_L\rangle|d\rangle+|\Psi_R\rangle |u\rangle\right)\) . (5)

All right, you managed to tag the location variable with helicity. Now we could measure the helicity to find out which slit the electron went through, right? But before we do that, let's take a look at the interference pattern, by calculating \(P(x)\) as before. Rather than (3), we now get

 \(\frac12\left(|\Psi_L(x)|^2\langle d|d\rangle +|\Psi_R(x)|^2\langle u|u\rangle
+ \Psi_L^\star(x)\Psi_R(x)\langle d|u\rangle  + \Psi_R^\star(x)\Psi_L(x)\langle u|d\rangle\right)\)  

Now, the helicity states \(|u\rangle\) and \(|d\rangle|\) are orthogonal, of course. What that means is that \(\langle d|d\rangle=\langle u|u \rangle=1\) while \(\langle d|u\rangle=\langle u|d\rangle=0\), and \(P(x)\) simply becomes

            \(P(x)=\frac12\left(|\Psi_L(x)|^2+|\Psi_R(x)|^2\right)\).  (6)

and the interference term--is gone. What you get is the pattern that you see on the right hand side of the figure above: no fringes, just two peaks (one from the left slit, one from the right).

You note of course that we didn't even need to measure the helicity in order to destroy the pattern. Just making it possible to do so was sufficient.

But guess what. We can undo this attempt at a measurement. Yes, we can "unlook" the location of the electron, and get the interference pattern back. That's what the quantum eraser does, and you'd be surprised how easy it is.

The trick here is to insert a filter just before measuring the re-united wavefunction. This filter measures the wavefunction not in the orthogonal basis \((u,d)\) but rather in a basis that is rotated by 45\(^\circ\) with respect to the \((u,d)\) system. The rotated basis states are 
 \(|U\rangle=\frac1{\sqrt2}(|u\rangle+|d\rangle)\) and \(|D\rangle=\frac1{\sqrt2}(|u\rangle-|d\rangle)\). Detectors that measure in such a rotated (or "diagonal") basis are easy to make for photon polarizations, and quite a bit harder for spins, but let that be an experimental quibble.

If we put such a detector just before the screen (as in Fig. 7 below), the wavefunction becomes

\(\frac1{\sqrt2}\left(|\Psi_L\rangle|d\rangle+|\Psi_R\rangle |u\rangle\right)\rightarrow \frac1{2}\left(|\Psi_L\rangle|+|\Psi_R\rangle\right)\frac1{\sqrt2}(|u\rangle+|d\rangle)\),    (7)

that is, the space wavefunction and the spin wavefunction are disentangled; they are back to being a product. You lose half your initial amplitude, yes. (See the factor 1/2 in (7)?)
Fig. 7: Double-slit experiment with eraser that reverses the previous measurement. The interference pattern re-emerges.

How do you show that this is right? It's a simple application of the quantum measurement rules I showed you in Parts 1-5. To measure in the "diagonal" basis, you start with a measurement ancilla in one of the basis states (say \(|D\rangle\)), then write the L and R wavefunction in that basis, and then apply the measurement operators, which as you remember is the "attempted cloning" operator. Then finally you measure D (that is, you calculate \(\langle D|\Psi\rangle\)). I could show these three lines, or I could let you try them yourself. 

I think I'll let you try it yourself, because you should really see how the disentanglement happens. Also, I'm really tired right now :-)

That this really could work was proposed first by Marlan Scully in the now classic experiment [1] (see also [2]). The experiment was carried out successfully multiple times, notably the experiment in [3]. You might object that they used photons and not electrons there. But the important point is the erasure of the which-path information provided by the "helicity-flipper" (which is a "polarization-rotator" if you're using photons), and that certainly does not depend on whether you are massive or not. Because you know, neither the photon nor the electron are particles. Because there is no such thing as a particle. There is only the illusion of particles generated by measurement devices that go 'click'. But if you have read this far, then you already know not to trust the measurement devices: they lie. 

And the best illustration of these lies is perhaps Schrödinger's cat. You will get to play with the equations describing her, I promise. And maybe, just maybe, you will also come to appreciate that quantum reality, and the reality we perceive via our classical measurement devices, are two very different things. 

Part 7 is here. Sorry, no cats.

[1] M.O. Scully and K. Drühl, Quantum eraser – a proposed photon-correlation experiment concerning observation and delayed choice in quantum mechanics, Phys. Rev. A 25, 2208 (1982).

[2] M.O. Scully, B.-G. Englert, H. Walther, Quantum optical tests of complementarity, Nature 351, 111 (1991).

[3]  S. P. Walborn, M. O. Terra Cunha, S. Padua, Double-slit quantum eraser. Phys. Rev. A 65, 033818 (2002).

Saturday, May 23, 2015

What happens to an evaporating black hole?

For years now I have written about the quantum physics of black holes, and each and every time I have pushed a single idea: that if black holes behave as (almost perfect) black bodies, then they should be described by the same laws as black bodies are. And that means that besides the obvious processes of absorption and reflection, there is the quantum process of spontaneous emission (discovered to occur in black holes by Hawking), and this other process, called stimulated emission (neglected by Hawking, but discovered by Einstein). The latter solves the problem of what happens to information that falls into the black hole, because stimulated emission makes sure that a copy of that information is always available outside of the black hole horizon (the stories are a bit different for classical vs. quantum information. These stories are told in a series of posts on this blog:

Oh these rascally black holes (Part I)
Oh these rascally black holes (Part II)
Oh these rascally black holes (Part III)
Black holes and the fate of quantum information
The quantum cloning wars revisited 

I barely ever thought about what happens to a black hole if nothing is falling in it. We all know (I mean, we have been told) that the black hole is evaporating. Slowly, but surely. Thermodynamic calculations can tell you how fast this evaporation process is: the rate of mass loss is inversely proportional to the square of the black hole mass. 

But there is no calculation of the entropy (and hence the mass) of the black hole as a function of time!

Actually, I should not have said that. There are plenty of calculations of this sort. There is the CGHS Model, the JT Model, and several others. But these are models of quantum gravity in which the scalar field of standard curved space quantum field theory (CSQFT, the theory developed by Hawking and others to understand Hawking radiation) is coupled in one way or the other to another field (often the dilaton).You cannot calculate how the black hole loses its mass in standard CSQFT, because that theory is a free field theory! Those quantum fields interact with nothing! 

The way you recover the Hawking effect in a free field theory is you consider not a mapping of the vacuum from time \(t=0\) to a finite time \(t\), you map from past infinity to future infinity. So time disappears in CSQFT! Wouldn't it be nice if we had a theory that in some limit just becomes CSQFT, but allows us to explicitly couple the black hole degrees of freedom to the radiation degrees of freedom, so that we could do a time-dependent calculation of the S-matrix? 

Well this post serves to announce that we may have found such a theory ("we" is my colleague Kamil Brádler and I). The link to the article will appear below, but before you sneak a peek, let me first put you in the right mind to appreciate what we have done.

In general, when you want to understand how a quantum state evolves forward in time, from time \(t_1\) to time \(t_2\), say, you write
$$|\Psi(t_2)\rangle=U(t_1,t_2)|\Psi(t_1)\rangle\ \ \     (1)$$
where \(U\) is the unitary time evolution operator
$$U(t_2,t_1)=Te^{-i\int_{t_1}^{t_2}H(t')dt'}\ \ \      (2)$$
The \(H\) is of course the interaction Hamiltonian, which describes the interaction between quantum fields. The \(T\) is Dyson's time-ordering operator, and assures that products of operators always appear in the right temporal sequence. But the interaction Hamiltonian \(H\) does not exist in free-field CSQFT.

In my previous papers with Bradler and with Ver Steeg, I hinted at something, though. There we write this mapping from past to future infinity in terms of a Hamiltonian (oh, the wrath that this incurred from staunch relativists!) like so:
$$|\Psi_{\rm out}\rangle=e^{-iH}|\Psi_{\rm in}\rangle\ \ \     (3)$$
where \(|\Psi_{\rm in}\rangle\) is the quantum state at past infinity, and \(|\Psi_{\rm out}\rangle\) is at future infinity. This mapping really connects creation and annihilation operators via a Bogoliubov transformation
$$A_k=e^{-iH}a_ke^{iH}\ \ \ (4)$$
where the \(a_k\) are defined on the past null infinity time slice, and the \(A_k\) at future null infinity, but writing it as (3) makes it almost look as if \(H\) is a Hamiltonian, doesn't it? Except there is no \(t\). The same \(H\) is in fact used in quantum optics a lot, and describes squeezing. I added to this a term that allows for scattering of radiation modes on the horizon in the 2014 article with Ver Steeg, and that can be seen as a beam splitter in quantum optics. But it is not an interaction operator between black holes and radiation. 

For the longest time, I didn't know how to make time evolution possible for black holes, because I did not know how to write the interaction. Then I became aware of a paper by Paul Alsing from the Air Force Research Laboratory, who had read my paper on the classical capacity of the quantum black hole channel, repeated all of my calculations (!), and realized that there exists, in quantum optics, an extension to the Hamiltonian that explicitly quantizes the black hole modes! (Paul's background is quantum optics, so he was perfectly positioned to realize this.)

Because you see, the CSQFT that everybody is using since Hawking is really a semi-classical approximation to quantum gravity, where the black hole "field" is static. It is not quantized, and it does not change. It is a background field. That's why the black hole mass and entropy change cannot be calculated. There is no back-reaction from the Hawking radiation (or the stimulated radiation for that matter), on the black hole. In the parlance of quantum optics, this approximation is called the "undepletable pump"  scenario. What pump, you ask?

In quantum optics, "pumps" are used to create excited states of atoms. You can't have lasers, for example, without a pump that creates and re-creates the inversion necessary for lasing. The squeezing operation that I talked about above is, in quantum optics, performed via parametric downconversion, where a nonlinear crystal is used to split photons into pairs like so:
Fig. 1: Spontaneous downconversion of a pump beam into a "signal" and an "idler" beam. Source: Wikimedia
Splitting photons? How is that possible? Well it is possible because of stimulated emission! Basically, you are seeing the quantum copy machine at work here, and this quantum copy machine is "as good as it gets" (not perfect, in other words, because you remember of course that perfect quantum copying is impossible). So now you see why there is such a tantalizing equivalence between black holes and quantum optics: the mathematics describing spontaneous downconversion and black hole physics is the same: eqs (3) and (4). 

But these equations do not quantize the pump, it is "undepleted" and remains so. This means that in this description, the pump beam is maintained at the same intensity. But quantum opticians have learned how to quantize the pump mode as well! This is done using the so-called "tri-linear Hamiltonian": it has quantum fields not just for the signal and idler modes (think of these as the radiation behind and in front of the horizon), but for the pump mode as well. Basically, you start out with the pump in a mode with lots of photons in, and as they get down-converted the pump slowly depletes, until nothing is left. This will be the model of black hole evaporation, and this is precisely the approach that Alsing took, in a paper that appeared in the journal "Classical and Quantum Gravity" in 2014. 

"So Alsing solved it all", you are thinking, "why then this blog post?" 

Not so fast. Alsing brought us on the right track, to be sure, but his calculation of the quantum black hole entropy as a function of time displayed some weird features. The entropy appeared to oscillate rather than slowy decrease. What was going on here?

For you to appreciate what comes now, I need to write down the trilinear Hamiltonian:
$$H_{\rm tri}=r(ab^\dagger c^\dagger-a^\dagger bc)\ \ \ (5) $$.
Here, the modes \(b\) and \(c\) are associated with radiation degrees in front of and behind the horizon, whereas \(a\) is the annihilation operator for black hole modes (the "pump" modes). Here's a pic so that you can keep track of these.
Fig. 2: Black hole and radiation modes $b$ and $c$.
In the semi-classical approximation, the \(a\) modes are replaced with their background-field expectation value, which morphs \(H_{\rm tri}\) into \(H\) in Eqs. (3) and (4), so that's wonderful: the trilinear Hamiltonian turns into the Hermitian operator implementing Hawking's Bogoliubov transformation in the semi-classical limit. 

But how you do you use \(H_{\rm tri}\) to calculate the S-matrix I wrote down long ago, at the very beginning of this blog post? One thing you could do is to simply say, 
$$U_{\rm tri}=e^{iH_{\rm tri}t}\ ,$$
and then the role of time is akin to a linearly increasing coefficient \(r\) in eq. (5). That's essentially what Alsing did (and Nation and Blencowe before him, see also Paul Nation's blog post about it) but that, it turns out, is only a rough approximation of the true dynamics, and does not give you the correct result, as we will see. 

Suppose you calculate \(|\Psi_{\rm out}\rangle=e^{-iH_{\rm tri}t}|\Psi_{\rm in}\rangle\), and using the density matrix \(\rho_{\rm out}=|\Psi_{\rm out}\rangle \langle \Psi_{\rm out}|\) you calculate the von Neumann entropy of the black hole modes as
$$S_{\rm bh}=-{\rm Tr} \rho_{\rm out}\log \rho_{\rm out}\ \ \ (6)$$
Note that this entropy is exactly equal to the entropy of the radiation modes \(b\) together with \(c\), as the initial black hole is in a pure state with zero entropy. 

"How can a black hole that starts with zero entropy lose entropy", you ask? 

That's a good question. We begin at \(t=0\) with a black hole in a defined state of \(n\) modes (the state \(|\Psi_{\rm in}\rangle=|n\rangle\)) for convenience of calculation. We could instead start in a mixed state, but the results would not be qualitatively different after the black hole has evolved for some time, yet the calculation would be much harder. Indeed, after interacting with the radiation the black hole modes become mixed anyway, and so you should expect the entropy to start rising from zero quickly at first, and only after it approached its maximum value would it decay. That is a behavior that black hole folks are fairly used to, as a calculation performed by Don Page in 1993 shows essentially (but not exactly) this behavior. 

Page constructed an entirely abstract quantum information-theoretic scenario: suppose you have a pure bi-partite state (like we start out with here, where the black hole is one half of the bi-partitite state and the radiation field \(bc\) is the other), and let the two systems interact via random unitaries. Basically he asked: "What is the entropy of a subsystem if the joint system was in a random state?" The answer, as a function of the (log of the) size of the dimension of the radiation subsystem is shown here:
Fig. 3: Page curve (from [1]) showing first the increase in entanglement entropy of the black hole, and then a decrease back to zero. 
People usually assume that the dimension of the radation subsystem (dubbed by Page the "thermodynamic entropy" (as opposed to the entanglement entropy) is just a proxy for time, so that what you see in this "Page curve" is how at first the entropy of the black hole increases with time, then turns around at the "Page time", until it vanishes.

This calculation (which has zero black hole physics in it) turned out to be extremely useful, as it showed that the amount of information from the black hole (defined as the maximum entropy minus the entanglement entropy) may take a long time to come out (namely at least the Page time), and it would be essentially impossible to determine from the radiation field that the joint field is actually in a pure state. But as I said, there is no black hole physics in it, as the random unitaries used in that calculation were, well, random.  

Say you use the \(U_{\rm tri}\) instead for the interaction? This is essentially the calculation that Alsing did, and it turns out to be fairly laborious, because as opposed to the bi-linear Hamiltonian that can be solved analytically, you can't do that with \(U_{\rm tri}\). Instead, you have to either expand \(H_{\rm tri}\) in \(r t\) (that really only works for very short times) or use other methods. Alsing used an approximate partial differential equation approach for the quantum amplitude \(c_n(t)=\langle n|e^{-iH_{\rm tri}t}|\Psi_{\rm in}\rangle\). The result shows the increase of the black hole entropy with time as expected, and then indeed a decrease:
Fig. 4: Black hole entropy using (6) for $n$=16 as a function of $t$
Actually, the figure above is not from Alsing (but very similar to his), but rather is one that Kamil Brádler made, but using a very different method. Brádler figured out a method to calculate the action of \(U_{\rm tri}\) on a vacuum state using a sophisticated combinatorial approach involving something called a "Dyck path". You can find this work here. It reproduces the short-time result above, but allows him to go much further out in time, as shown here:
Fig. 5: Black hole entropy as in Fig. 4, at longer times. 
The calculations shown here are fairly intensive numerical affairs, as in order to get converging results, up to 500 terms in the Taylor expansion have to be summed. This result suggests that the black hole entropy is not monotonically decreasing, but rather is oscillating, as if the black hole was absorbing modes from the surrounding radiation, then losing them again. However, this is extremely unlikely physically, as the above calculation is performed in the limit of perfectly reflecting black holes. But as we will see shortly, this calculation does not capture the correct physics to begin with. 

What is wrong with this calculation? Let us go back to the beginning of this post, the time evolution of the quantum state in Eqs. (1,2).  The evolution operator \(U(t_2,t_1)=Te^{-i\int_{t_1}^{t_2}H(t')dt'}\)  is applied to the initial state gives rise to an integral over the state space: a path integral. How did that get replaced by just \(e^{-iHt}\)? 

We can start by discretizing the integral into a sum, so that \(\int_0^t H(t')dt'\approx\sum_{i=0}^NH(t_i)\Delta t\), where \(\Delta t\) is small, and \(N\Delta t=t\). And because that sum is in the exponent, \(U\) actually turns into a product:
$$U(0,t)\approx \Pi_{i=0}^N e^{-i\Delta t H(t_i)}\ \ \ (7)$$
Because of the discretization, each Hamiltonian \(H(t_i)\) acts on a different Hilbert space, and the ground state that $U$ acts on now takes the form of a product state of time slices
$$|0\rangle_{bc}=|0\rangle_1|0\rangle_2\times...\times |0\rangle_N$$
And because of the time-ordering operator, we are sure that the different terms of \(U(0,t)\) are applied in the right temporal order. If all this seems strange and foreign to you, let me assure you that this is a completely standard approximation of the path integral in quantum many-body physics. In my days as a nuclear theorist, that was how we calculated expectation values in the shell model describing heavy nuclei. I even blogged about this approach (the Monte Carlo Path Integral approach) in the post about nifty papers that nobody is reading. (Incidentally, nobody is reading those posts either).  

And now you can see why Alsing's calculation (and Bradler's initial recalculation of the same quantity with very different methods, confirming Alsing's result) was wrong: it represents an approximation of (7) using a single time-slice only (\(N\)=1). This approximation has a name in quantum many-body physics, it is called the "Static Path Approximation" (SPA). The SPA can be accurate in some cases, but it is generally only expected to be good at small times. At larger times, it ignores the self-consistent temporal fluctuations that the full path integral describes.

So now you know what we did, of course: we calculated the path integral of the S-matrix of the black hole interacting with the radiation field using many many time slices. Kamil was able to do several thousand time slices, just to make sure that the integral converges. And the result looks very different from the SPA. Take a look at the figure below, where we calculated the black hole entropy as a function of the number of time slices (which is our discretized time)
Fig. 6: Black hole entropy as a function of time, for three different initial number of modes. Orange: n=5, Red: n=20, Pink: n=50. Note that the logarithm is taken to the base n+1, to fit all three curves on the same plot. Of course the n=50 entropy is much larger than the n=5 entropy. \(\Delta t=1/15\). 
This plot shows that the entropy quickly increase as the pure state decoheres, and then starts to drop because of evaporation. Obviously, if we would start with a mixed state rather than a pure state, the entropy would just drop. The rapid increase at early times is just a reflection of our short-cut to start with a pure state. It doesn't look exactly like Page's curves, but we cannot expect that as our \(x\)-axis is indeed time, while Page's was thermodynamic entropy (which is expected to be linear in time). Note that Kamil repeated the calculation using an even smaller \(\Delta t=1/25\), and the results do not change.

I want to throw out some caution here. The tri-linear Hamiltonian is not derived from first principles (that is, from a quantum theory of gravity). It is a "guess" at what the interaction term between quantized black hole modes and radiation modes might look like. The guess is good enough that it reproduces standard CSQFT in the semi-classical limit, but it is still a guess. But it is also very satisfying that such a guess allows you to perfrom a straightforward calculation of black hole entropy as a function of time, showing that the entropy can actually get back out. One of the big paradoxes of black hole physics was always that as the black hole mass shrunk, all calculations implied that the entanglement entropy steadily increases and never turns over as in Page's calculation. This was not a tenable situation for a number of physical reasons (and this is such a long post that I will spare you these). We have now provided a way in which this can happen. 

So now you have seen with your own eyes what may happen to a black hole as it evaporates. The entropy can indeed decrease, and within a simple "extended Hawking theory", all of it gets out. This entropy is not information mind you, as there is no information in a black hole unless you throw some in it (see my series "What is Information?" if this is cryptic for you). But Steve Giddings convinced me (on the beach at Vieques no less, see photo below) that solving the infomation paradox was not enough: you've got to solve the entropy paradox also. 

A quantum gravity session at the beach in Vieques, Puerto Rico (January 2014). Steve Giddings is in sunglasses watching me explain stimulated emission in black holes. 
I should also note that there is a lesson in this calculation for the firewall folks (who were quite vocal at the Vieques meeting). Because the entanglement between the black hole and radiation involves three entities rather than two, monogamy of entanglement can never be violated, so this argument provides another (I have shown you two others in earlier posts) arguments against those silly firewalls.

[1] Don Page. Average entropy of a subystem. Phys. Rev. Lett. 71 (1993) 1291.

Note added: The paper describing these results appeared in the journal Physical Review Letters:

K. Brádler and C. Adami: One-shot decoupling and Page curves from a dynamical model for black hole evaporation. Phys. Rev. Lett. 116 (2016) 101301.

You can also find it on arXiv: One-shot decoupling and Page curves from a dynamical model for black hole evaporation.