## Tuesday, December 6, 2016

### Can Life emerge spontaneously?

It would be nice if we knew where we came from. Sure, Darwin's insight that we are the product of an ongoing process that creates new and meaningful solutions to surviving in complex and unpredictable environments is great and all. But it requires three sine qua non ingredients: inheritance, variation, and differential selection. Three does not seem like much, and the last two are really stipulated semper ibi: There is going to be variation in a noisy world, and differences will make a difference in worlds where differences matter. Like all the worlds you and I know. So it is kind of the first ingredient that is a big deal: Inheritance.

Inheritance is indeed a bit more tricky. Actually, a lot more tricky. Inheritance means that an offspring carries the characters of the parent. Not an Earth-shattering concept per se, but in the land of statistical physics, inheritance is not exactly a given. Mark the "offspring" part of that statement. Is making offspring such a common thing?

Depends on how you define "offspring". The term has many meanings. Icebergs "calf" other icebergs, but the "daughter" icebergs are not really the same as the parent in any meaningful way.  Crystals grow, and the "daughter" crystals do indeed have the same structure as the "parent" crystals. But this process (while not without interest to those interested in the origins of life), actually occurs while liberating energy (it is a first-order phase transition).

The replication of cells (or people, for that matter) is very different from the point of view of statistical physics, thermodynamics, and indeed probability theory. Here we are going to look at this process entirely from the point of view of the replication of the information inherent in the cell (or the person). The replication of this information (assuming it is stored in polymers of a particular alphabet) is not energetically favorable. Instead, it requires energy, which explains why cells only grow if there is some kind of food around.

Look, the energetics of molecular replication are complicated, messy, and depend crucially on what molecules are available in what environment, at what temperature, pressure, salt concentrations, etc. etc. My goal for this blog post is to evade all that. Instead, I'm just going to ask how likely it is in general for a molecule that encodes a specific amount of information to arise by chance. Unless the information stored in the sequence is specifically about how to speed up the formation of another such molecule, however unlikely the formation of the first molecule was, the formation of two of them would be twice as unlikely (actually, exponentially so, but we'll get to that).

So this is the trick then: We are not interested in the formation of any old information by chance: we need the spontaneous formation of information about how to make another one of those sequences. Because, if you think a little bit about it, you realize that it is the power of copying that renders the ridiculously rare ... conspicuously commonplace. Need some proof for that? Perhaps the most valuable postage stamp on Earth is the famed "Blue Mauritius", a stamp that has inspired legendary tales and shortened the breath of many a collector, as there are (most likely) only two handfuls of those stamps left in the universe today.

 Blue (left) and Red (right) Mauritius of 1847.  (Wikimedia).
But the original plate from which this stamp was printed still exists. Should someone endeavor to print a million of those, I doubt that they each would be worth the millions currently shelled out for one of those "most coveted scraps of paper in existence". (Of course experts would be able to tell apart the copies from the originals because of the sophistication of forensic methods deployed on such works and their forgeries.) But my points still stands: copying makes the rare valuable ... cheaply ordinary.

When the printing press (the molecular kind) has not yet been invented, what does it cost to obtain a piece of information? This blog post will provide the answer, and most importantly, provide pointers to how you could cheat your way to a copy of a piece of information that would be rare not just in thus universe, but a billion billion trillion more. Well, in principle.

How do you quantify rarity? Generally speaking, it is the number of things that you want, divided by the number of things there are. For the origin of life, let's imagine for a moment that replicators are sequences of linear heteropolymers. This just means that they are sequences of "letters" on a string, really. They don't have to self-replicate by themselves, but they have to encode the information necessary to ensure that they get replicated somehow. For the moment, let us restrict ourselves to sequences of a fixed length $L$. Trust me here, this is for your own good. I can write down a more general theory for arbitrary length sequences that does nothing to help you understand. On the contrary. It's not a big deal, so just go with it.

How many sequences are there of length $L$? Exactly $D^L$, of course (where $D$ is the size of the alphabet). How many self-replicators are there among those sequences? That is the big question, we all understand. It could be zero, of course. Let's imagine it is not, and that the number is $N_e$, where $N_e$ in not zero. If there is a process that randomly assembles polymers of length $L$, the likelihood $P$ that you get a replicator in that case is
$P=\frac{N_e}{D^L}$       (1)
So far so good. What we are going to do now is relate that probability to the amount of information contained in the self-replicating sequence.

That we should be able to do this is fairly obvious, right? If there is no information in a sequence, well than that sequence must be random. This means any sequence is just as good as any other, and $N_e=N$ (all sequences are functional at the same level, namely not functional at all). And in that case, $P=1$ obviously. But now suppose that every single bit in the sequence is functional. That means you can't change anything in that sequence without destroying that function, and implies that there is only one such sequence. (If there were two, you could make at a minimum one change and still retain function.) In that case, $N_e=1$ and $P=1/N$.

What is a good formula for information content that gives you $P=1$ for zero information, and $1/N$ for full information? If $I$ is the amount of information (measured in units of monomers of the polymer), the answer is
$P=D^{-I}.$      (2)
Let's quickly check that. No information is $I=0$, and $D^0=1$ indeed.  Maximal information is $I=L$ (every monomer in the length $L$ sequence is information). And $D^{-L}=1/N$ indeed. (Scroll up to the sentence "How many sequences are there of length $L$", if this is not immediately obvious to you.)

The formula (2) can actually be derived, but let's not do this here. Let's just say we guessed it correctly. But this formula, at first sight, is a monstrosity. If it was true, it should shake you to the bones.

Not shaken yet? Let me help you out. Let us imagine for a moment that $D=4$ (yeah, nucleotides!). Things will not get any better, by the way, if you use any other base. How much information is necessary (in that base) to self-replicate? Actually, this question does not have an unambiguous answer. But there are some very good guesses at the lower bound. In the lab of Gerry Joyce at the Scripps Research Institute in San Diego, for example, hand-designed self-replicating RNAs can evolve [1]. How much information is contained in them?
 Prof. Gerald Joyce, Scripps Research Institute
We can only give an upper bound, because while it takes 84 bits to specify this particular RNA sequence, only 24 of those bits are actually evolvable. The 60 un-evolvable bits (they are un-evolvable because that is how the team set up the system) could, in principle, represent far less information than 60 bits. This may not be clear to you after reading this. But explaining this now would be distracting. I'll explain it further below instead.

Let's take this number (84 bits) at face value for the moment. How likely is it that such a piece of information emerged by chance? According to our formula (2), it is about
$P\approx7.7\times 10^{-25}$
That's a soberingly small likelihood. If you wanted to have a decent chance to find this sequence in a pool of RNA molecules of that length, you'd have to have about 27 kilograms of RNA. That's almost 60 pounds, for those of you that... Never mind.

The point is, wherever linear heteropolymers are assembled by chance, you're not gonna get 27 kilograms of that stuff. You might get significantly smaller amounts (billions of times smaller), but then you would have to wait a billion times longer. On Earth, there wasn't that much time (as Life apparently arose within half a billion years of the Earth's formation). Now, as I alluded to above, the Lincoln-Joyce self-replicator may actually code for fewer than the 84 bits it took to make it. But at the origin of this replicator was intelligent design. A randomly generated one may require fewer bits. We are left with the problem: can self-replicators emerge by chance at all?

This blog post is, really, about these two words: "by chance". What does this even mean?

When writing down formula (2), "by chance" has a very specific meaning. It means that every polymer to be "tried out" has an equal chance of occurring. "Occurring", in chemistry, also has a specific meaning. It means "to be assembled from existing monomers", and if each polymer has an equal chance to be found, then that means that the likelihood to produce any monomer is also equal.

For us, this is self-evident. If I want to calculate the likelihood that a random coin toss creates 10 heads in a row by chance, I take the likelihood of "heads" and take it to the power of ten. But what if your coin is biased? What if it is a coin that lands on head 60% of the time? Well then: in that case, the likelihood to get ten heads in a row is not 1 in 1,024 anymore but rather $(0.6)^{10}$, a factor of about 6.2 larger. This is quite a gain given such a small change in likelihood for a single toss (from 0.5 to 0.6). But imagine that you are looking for 100 heads in a row. The same change in bias now buys you a factor of almost 83 million! And for a sequence of 1,000 heads in a row, you are looking at an enhancement factor of .... about $10^{79}$.

That is the power of bias on events with small probabilities. Mind you, getting 100 heads in a row is still a small probability, but gaining almost seven orders of magnitude is not peanuts. It might be the difference between impossible and... maybe-after-all-possible. Now, how can this be of use in the origin of life?

As I explained, formula (2) relies on assuming that all monomers are created equally likely, with probability $1/D$. When we think about the origin of life in terms of biochemistry, we begin by imagining a process that creates monomers, which are assembled into those linear heteropolymers, and then copied somehow. (In biochemical life on Earth, assembly is done in a template-directed manner, which means that assembly and copying are one and the same thing). But whether assembly is template-directed or not, how likely is is that all monomers occur spontaneously at the same rate? Any biochemist will tell you: extremely unlikely. Instead, some of the monomers are produced spontaneously at one rate, and others at different rate. And these rates depend on local circumstances, like temperature, pH level, abundance of minerals, abundance of just about any element as it turns out. So, depending on where you are on a pre-biotic Earth, you might be faced with wildly different monomer production rates.

This uneven-ness of production can be viewed as a D-sided "coin" where each of the D sides has a different probability of occurring. We can quantify this uneven-ness by the entropy that a sequence of such "coin" tosses produces. (I put "coin" in quotes because a D-sided coin isn't a coin unless D=2. I'm just trying to avoid saying "random variable" here.) This entropy (as you can gleam from the Information Theory tutorial that I've helpfully created for you, starting here) is equal to the length of the sequence if each monomer indeed occurs at rate 1/D (and we take logs to base D), but is smaller than the length if the probability distribution is biased. Let's call $H_a$ the average entropy per monomer, as determined by the local biochemical constraints. And let's remember that if all monomers are created at the same exact rate, $H(a)=1$, (its maximal value), and Eq. (2) holds. If the distribution is uneven, then $H(a)<1$. The entropy of a spontaneously created sequence is then $L\times H(a)$, which is smaller that $L$. In a sense, it is not random anymore, if by random we understand "each sequence equally likely". How could this help increase the likelihood of spontaneous emergence of life?

Well, let's take a closer look at the exponent in Eq. (2), the information $I$. Under certain conditions that I won't get into here, this information is given by the difference between sequence length $L$ and entropy $H$:
$I=L-H.$   (3)
That such a formula must hold is not very surprising. Let's look at the extreme cases. If a sequence is completely random, then $H(a)=1$, and therefore $H=L$, and therefore $I=0$. Thus, a random sequence has no information. On the opposite end, suppose there is only one sequence that can do the job, and any change to the sequence leads to the death of that sequence. Then, the entropy of the sequence (which is the logarithm of the number of ways you can do the job), must be zero. And thus in that case the sequence is all information: $I=L$.  While the correct formula (3) has plenty more terms that become important if there are correlations between sites, we are going to ignore them here.

So remember that the probability for spontaneous emergence of life is so small because $I$ is large, and it is in the exponent. But now we realize that the $L$ in (3) is really the entropy of a spontaneously created sequence, and if $H(a)<1$, then the first term is $L\times H(a)<L$. This can help a lot because it makes $I$ smaller. It helps a lot because the change is in the exponent. Let's look at some examples.

We could first look at English text. The linear heteropolymers of English are strings of the letters a-z (let's just stick with lower case letters and no punctuation for simplicity). What is the likelihood to find the word ${\tt origins}$ by chance? If we use an unbiased typewriter (our 26-sided coin), the likelihood is $26^{-7}$ (about 1 in 8 billion), as ${\tt origins}$ is a 7-mer, and each mer is information (there is only one way to spell the word ${\tt origins}$). Can we do better if our typewriter is biased towards English? Let's find out. If you analyze English text, you quickly notice that letters occur at different frequencies: e more often that t, which occurs more often than a, and so forth. The plot below is the distribution of letters that you would find.

 Letter distribution of English text
The entropy-per-letter of this distribution is 0.89 mers. Not very different from 1, but let's see how it changes the 1 in 8 billion odds. The biased-search chance is, according to this theory, $P_\star=26^{7\times 0.89}$, which comes out about 1.5 per billion: an enhancement of more than a factor 12. Obviously, the enhancement is going to more pronounced the longer the sequence. We can test this theory in a more appropriate system: self-replicating computer programs.

That you can breed computer programs inside a computer is nothing new to those who have been following the field of Artificial Life. The form of Artificial life that involves self-replicating programs is called "digital life" (I have written about the history of digital life on this blog), and in particular the program Avida. For those who can't be bothered to look up what kind of life Avida makes, let's just focus on the fact that avidians are computer programs written in a language that has 26 instructions (conveniently abbreviated by the letters a-z), executed on a virtual CPU (you don't want digital critters to wreak havoc on your real CPU, do you?) The letters of these linear heteropolymers have specific meanings on that virtual CPU. For example the letter 'x' stands for ${\tt divide}$, which when executed will split the code into two pieces.

Here's a sketch of what this virtual CPU looks like (with a piece of code on it, being executed)
 Avidian CPU and code (from [2]).
When we use Avida to study evolution experimentally, we seed a population with a hand-written ancestral program. The reason we do this is because self-replicators are rare within the avidian "chemistry": you can't just make a random program and hope that it self-replicates! And that is, as I'm sure has dawned on the reader a while ago, where Avida's importance for studying the origin of life comes from. How rare is such a program?

The standard hand-written replicator is a 15-mer, but we are sure that not all 15 mers are information. If they were, then its likelihood would be $26^{-15}\approx 6\times 10^{-22}$, and it would be utterly hopeless to find it via a random (unbiased) search. It would take about 50,000 years if we tested a million strings a second, on one thousand computers in parallel. We can estimate the information content by sampling the ratio $\frac{N_e}{26^{15}}$, that is, instead of trying out all possible sequences, we try out a billion, and take the fraction of self-replicators to be representative of the overall fraction. (If we don't find any, try ten billion, and so forth).

When we created 1 billion 15-mers using an unbiased distribution, we found 58 self-replicators. That was unexpectedly high, but it pins down the information content to be about
$I(15)=-\log_D(58\times 10^{-9})\approx 5.11 \pm 0.04$ mers.
The 15 in $I(15)$ reminds us that we were searching within 15 mer space only. But wait: about 5 mers encoded in a 15 mer? Could you write a self-replicator that is as short as 5 mers?

Sadly, no. We tried all 11,881,367 5-mers, and they are all as dead as doornails. (We test those sequences for life by dropping them into an empty world, and then checking whether they can form a colony.)

Perhaps 6-mers, then? Nope. We checked all 308,915,776 of them. No sign of life. We even checked all 7-mers (over 8 billion of them). No colonies. No life.

We did find life among 8-mers, though. We first sampled one billion of them, and found 6 unique sequences that would spontaneously form colonies [2]. That number immediately allows us to estimate the information content as
$I(8)=-\log_D(6\times 10^{-9})\approx 5.81 \pm 0.13$ mers,
which is curious.

It is curious because according to formula (2) waaay above, the likelihood of finding a self-replicator should only depend on the amount of information in it. How can that information depend on the length of sequence that this information is embedded in? Well it can, and you'll have to read the original reference [2] to find out how.

By the way, we later tested all sequences of length 8 [3], giving us the exact information content of 8-mer replicators as 5.91 mers.  We even know the exact information content of 9-mer replicators,   but I wont't reveal that here. It took over 3 months of compute time to get this, and I'm saving it for a different post.

But what about using a biased typewriter? Will this help in finding self-replicators? Let's find out!
We can start by using the measly 58 replicators found by scanning a billion 15-mers, and making a probability distribution out of it. It looks like this:
 Probability distribution of avidian instructions among 58 replicators of L=15. The vertical line is the unbiased expectation.
It's clear that some instructions are used a lot (b,f,g,v,w,x). If you look up what their function is, they are not exactly surprising. You may remember that 'x' means ${\tt divide}$. Obviously, without that instruction you're not going to form colonies.

The distribution has an entropy of 0.91 mers. Not fantastically smaller than 1, but we saw earlier that small changes in the exponent can have large consequences. When we searched the space of 15 mers with this distribution instead of the uniform one, we found 14,495 replicators among a billion tried, an enhancement by a factor of about 250. Certainly not bad, and a solid piece of evidence that the "theory of the biased typewriter" actually works.  In fact, the theory underestimates the enhancement, as it predicts (based on the entropy 0.91 mers) an enhancement of about 80 [2].

We even tested whether taking the distribution generated by the 14,495 replicators, which certainly is a better estimate of a "good distribution", will net even more replicators. And it does indeed. Continuing like this allows your search to zero in on the "interesting" parts of genetic space with more laser-like fashion, but the returns are, understandably, diminishing.

What we learn from all this is the following: do not be fooled by naive estimates of the likelihood of spontaneous emergence of life, even if they are based on information theory (and thus vastly superior to those who would claim that $P=D^{-L}$). Real biological systems search with a biased distribution. The bias will probably go "in the wrong direction" in most environments. (Imagine an avidian environment where 'x' is never made.) But in a few of the zillion of environments that may exist on a prebiotic Earth, a handful of them might have a distribution that is close to the one we need. And in that case, life suddenly becomes possible.

How possible? We still don't know. But at the very least, the likelihood does not have to be astronomically small, as long as nature will use that one little trick: whip out that biased typewriter, to help you mumble more coherently.

[1] T. A. Lincoln and G. F. Joyce, Self-sustained replication of an RNA enzyme, Science 323, 1229–1232, 2009.
[2] C. Adami and T. LaBar, From entropy to information: Biased typewriters and the origin of life. In: “From Matter to Life: Information and Causality” (S.I. Walker, P.C.W. Davies, and G. Ellis, eds.) Cambridge University Press (2017), pp. 95-113. Also on arXiv
[3] Nitash C.G., T. LaBar, A. Hintze, and C. Adami, Origin of life in a digital microcosm. To appear.