When a portion of genetic tape is read in a cell, the first thing that happens to the information is that it is translated from one code to another: from the DNA code to a related code that
dictates
the exact shape of a protein molecule.
Richard-Dawkins-The-Devil-s-Chaplain
great bulk of both our hard disks is unused, we needn't feel insulted.
Related species of newt have much smaller genomes.
Why the Creator should have played fast and loose with the genome sizes of newts in such a capricious way is a problem that creationists might like to ponder.
From an evolutionary point of view the explanation is simple.
*
Evidently the total information capacity of genomes is very variable across the living kingdoms, and it must have changed greatly in evolution, presumably in both directions. Losses of genetic material are called deletions. New genes arise through various kinds of duplication. This is well illustrated by haemoglobin, the complex protein molecule that transports oxygen in the blood.
Human adult haemoglobin is actually a composite of four protein chains called globins, knotted around each other. Their detailed sequences show that the four globin chains are closely related to each other, but they are not identical. Two of them are called alpha globins (each a chain of 141 amino acids), and two are beta globins (each a chain of 146 amino acids). The genes coding for the alpha globins are on chromosome 11; those coding for the beta globins are on chromo- some 16. On each of these chromosomes, there is a cluster of globin genes in a row, interspersed with some junk DNA. The alpha cluster, on chromosome 11, contains seven globin genes. Four of these are pseudogenes, versions of alpha disabled by faults in their sequence and not translated into proteins. Two are true alpha globins, used in the adult. The final one is called zeta and is used only in embryos. Similarly the beta cluster, on chromosome 16, has six genes, some of which are disabled, and one of which is used only in the embryo. Adult haemoglobin, as we've seen, contains two alpha and two beta chains.
Never mind all this complexity. Here's the fascinating point. Careful letter-by-letter analysis shows that these different kinds of globin genes are literally cousins of each other, literally members of a family. But these distant cousins still coexist inside our own genome, and that of all vertebrates. On the scale of whole organisms, all vertebrates are our cousins too. The tree of vertebrate evolution is the family tree we are all familiar with, its branch-points representing speciation events - the splitting of species into pairs of daughter species. But there is another family tree occupying the same timescale, whose branches represent not speciation events but gene duplication events within genomes.
The dozen or so different globins inside you are descended from an
*My suggestion (The Selfish Gene, 1976) that surplus DNA is parasitic was later taken up and developed by others under the catch-phrase 'Selfish DNA'. See The Selfish Gene, 2nd edn (Oxford University Press, 1989), pp. 44-5 and 275.
THE 'INFORMATION CHALLENGE'
97
? LIGHT WILL BE THROWN
ancient globin gene which, in a remote ancestor who lived about half a billion years ago, duplicated, after which both copies stayed in the genome. There were then two copies of it, in different parts of the genome of all descendant animals. One copy was destined to give rise to the alpha cluster (on what would eventually become chromosome 11 in our genome), the other to the beta cluster (on chromosome 16). As the aeons passed, there were further duplications (and doubtless some deletions as well). Around 400 million years ago the ancestral alpha gene duplicated again, but this time the two copies remained near neighbours of each other, in a cluster on the same chromosome. One of them was destined to become the zeta used by embryos, the other became the alpha globin genes used by adult humans (other branches gave rise to the nonfunctional pseudogenes I mentioned). It was a similar story along the beta branch of the family, but with duplications at other moments in geological history.
Now here's an equally fascinating point. Given that the split between the alpha cluster and the beta cluster took place 500 million years ago, it will of course not be just our human genomes that show the split - that is, possess alpha genes in a different part of the genome from beta genes. We should see the same within-genome split if we look at any other mammals, at birds, reptiles, amphibians and bony fish, for our common ancestor with all of them lived less than 500 million years ago. Wherever it has been investigated, this expectation has proved correct. Our greatest hope of finding a vertebrate that does not share with us the ancient alpha/beta split would be a jawless fish like a lamprey, for they are our most remote cousins among surviving verte- brates; they are the only surviving vertebrates whose common ancestor with the rest of the vertebrates is sufficiently ancient that it could have predated the alpha/beta split. Sure enough, these jawless fishes are the only known vertebrates that lack the alpha/beta divide.
Gene duplication, within the genome, has a similar historic impact to species duplication ('speciation') in phylogeny. It is responsible for gene diversity, in the same way as speciation is responsible for phyletic diversity. Beginning with a single universal ancestor, the magnificent diversity of life has come about through a series of branchings of new species, which eventually gave rise to the major branches of the living kingdoms and the hundreds of millions of separate species that have graced the Earth. A similar series of branchings, but this time within genomes - gene duplications - has spawned the large and diverse population of clusters of genes that constitutes the modern genome.
The story of the globins is just one among many. Gene duplications
98
? and deletions have occurred from time to time throughout genomes. It is by these, and similar means, that genome sizes can increase in evolution. But remember the distinction between the total capacity of the whole genome, and the capacity of the portion that is actually used. Recall that not all the globin genes are used. Some of them, like theta in the alpha cluster of globin genes, are pseudogenes, recognizably kin to functional genes in the same genomes, but never actually translated into the action language of protein. What is true of globins is true of most other genes. Genomes are littered with nonfunctional pseudo- genes, faulty duplicates of functional genes that do nothing, while their functional cousins (the word doesn't even need scare quotes) get on with their business in a different part of the same genome. And there's lots more DNA that doesn't even deserve the name pseudogene. It too is derived by duplication, but not duplication of functional genes. It consists of multiple copies of junk, 'tandem repeats', and other nonsense which may be useful for forensic detectives but which doesn't seem to be used in the body itself. Once again, creationists might spend some earnest time speculating on why the Creator should bother to litter genomes with untranslated pseudogenes and junk tandem repeat DNA.
Can we measure the information capacity of that portion of the genome which is actually used? We can at least estimate it. In the case of the human genome it is about 2 per cent - considerably less than the proportion of my hard disk that I have used since I bought it. Presumably the equivalent figure for the crested newt is even smaller, but I don't know if it has been measured. In any case, we mustn't run away with a chauvinistic idea that the human genome somehow ought to have the largest DNA database because we are so wonderful. The great evolutionary biologist George C. Williams has pointed out that animals with complicated life cycles need to code for the development of all stages in the life cycle, but they only have one genome with which to do so. A butterfly! s genome has to hold the complete informa- tion needed for building a caterpillar as well as a butterfly. A sheep liver fluke has six distinct stages in its life cycle, each specialized for a different way of life. We shouldn't feel too insulted if liver flukes turned out to have bigger genomes than we have (actually they don't).
Remember, too, that even the total capacity of genome that is actually used is still not the same thing as the true information content in Shannon's sense. The true information content is what's left when the redundancy has been compressed out of the message, by the theoretical equivalent of Stuffit. There are even some viruses that seem
THE -INFORMATION CHALLENGE'
99
? LIGHT WILL BE THROWN
to use a kind of Stuffit-like compression. They make use of the fact that the RNA (not DNA in these viruses, as it happens) code is read in triplets. There is a 'frame' which moves along the RNA sequence, reading off three letters at a time. Obviously, under normal conditions, if the frame starts reading in the wrong place (as in a so-called frame-shift mutation), it makes total nonsense: the 'triplets' that it reads are out of step with the meaningful ones. But these splendid viruses actually exploit frame- shifted reading. They get two messages for the price of one, by having a completely different message embedded in the very same series of letters when read frame-shifted. In principle you could even get three messages for the price of one, but I don't know of any examples.
It is one thing to estimate the total information capacity of a genome, and the amount of the genome that is actually used, but it's harder to estimate its true information content in the Shannon sense. The best we can do is probably to forget about the genome itself and look at its product, the 'phenotype', the working body of the animal or plant itself. In 1951, J. W. S. Pringle, who later became my Professor at Oxford, suggested using a Shannon-type information measure to estimate 'complexity'. Pringle wanted to express complexity mathematically in bits, but I have long found the following verbal form helpful in explaining his idea.
We have an intuitive sense that a lobster, say, is more complex (more 'advanced', some might even say more 'highly evolved') than another animal, perhaps a millipede. Can we measure something in order to confirm or deny our intuition? Without literally turning it into bits, we can make an approximate estimation of the information contents of the two bodies as follows. Imagine writing a book describing the lobster. Now write another book describing the millipede down to the same level of detail. Divide the word-count in one book by the word-count in the other, and you have an approximate estimate of the relative information content of lobster and millipede. It is important to specify that both books describe their respective animals 'down to the same level of detail'. Obviously, if we describe the millipede down to cellular detail, but stick to gross anatomical features in the case of the lobster, the millipede would come out ahead.
But if we do the test fairly, I'll bet the lobster book would come out longer than the millipede book. It's a simple plausibility argument, as follows. Both animals are made up of segments - modules of bodily architecture that are fundamentally similar to each other, arranged fore- and-aft like the trucks of a train. The millipede's segments are mostly identical to each other. The lobster's segments, though following the
100
? same basic plan (each with a nervous ganglion, a pair of appendages, and so on) are mostly different from each other. The millipede book would consist of one chapter describing a typical segment, followed by the phrase 'Repeat N times', where N is the number of segments. The lobster book would need a different chapter for each segment. This isn't quite fair on the millipede, whose front and rear end segments are a bit different from the rest. But I'd still bet that, if anyone bothered to do the experiment, the estimate of lobster information content would come out substantially greater than the estimate of millipede information content.
It's not of direct evolutionary interest to compare a lobster with a millipede in this way, because nobody thinks lobsters evolved from millipedes. Obviously no modern animal evolved from any other modern animal. Instead, any pair of modern animals had a last common ancestor which lived at some (in principle) discoverable moment in geological history. Almost all of evolution happened way back in the past, which makes it hard to study details. But we can use the 'length of book' thought-experiment to agree upon what it would mean to ask the question whether information content increases over evolution, if only we had ancestral animals to look at.
The answer in practice is complicated and controversial, all bound up with a vigorous debate over whether evolution is, in general, pro- gressive. I am one of those associated with a limited form of yes answer. My colleague Stephen Jay Gould tends towards a no answer. * I don't think anybody would deny that, by any method of measuring - whether bodily information content, total information capacity of genome, capacity of genome actually used, or true ('Stuffit compressed') information content of genome - there has been a broad overall trend towards increased information content during the course of human evolution from our remote bacterial ancestors. People might disagree, however, over two important questions: first, whether such a trend is to be found in all, or a majority of evolutionary lineages (for example, parasite evolution often shows a trend towards decreasing bodily complexity, because parasites are better off being simple); second, whether, even in lineages where there is a clear overall trend over the very long term, it is bucked by so many reversals and re-reversals in the shorter term as to undermine the very idea of progress. This is not the place to resolve this interesting controversy. There are distinguished biologists with good arguments on both sides.
*See 'Human Chauvinism and Evolutionary Progress' (pp. 206-17).
THE 'INFORMATION CHALLENGE'
101
? LIGHT WILL BE THROWN
Supporters of 'intelligent design' guiding evolution, by the way, should be deeply committed to the view that information content increases during evolution. Even if the information comes from God, perhaps especially if it does, it should surely increase, and the increase should presumably show itself in the genome.
Perhaps the main lesson we should learn from Pringle is that the information content of a biological system is another name for its complexity. Therefore the creationist challenge with which we began is tantamount to the standard challenge to explain how biological com- plexity can evolve from simpler antecedents, one that I have devoted three books to answering, and I do not propose to repeat their contents here. The 'information challenge' turns out to be none other than our old friend: 'How could something as complex as an eye evolve? ' It is just dressed up in fancy mathematical language - perhaps in an attempt to bamboozle. Or perhaps those who ask it have already bamboozled themselves, and don't realize that it is the same old - and thoroughly answered - question.
Let me turn, finally, to another way of looking at whether the information content of genomes increases in evolution. We now switch from the broad sweep of evolutionary history to the minutiae of natural selection. Natural selection itself, when you think about it, is a narrow- ing down from a wide initial field of possible alternatives, to the narrower field of the alternatives actually chosen. Random genetic error (mutation), sexual recombination and migratory mixing all provide a wide field of genetic variation: the available alternatives. Mutation is not an increase in true information content, rather the reverse, for mutation, in the Shannon analogy, contributes to increasing the prior uncertainty. But now we come to natural selection, which reduces the 'prior uncertainty' and therefore, in Shannon's sense, contributes infor- mation to the gene pool. In every generation, natural selection removes the less successful genes from the gene pool, so the remaining gene pool is a narrower subset. The narrowing is nonrandom, in the direction of improvement, where improvement is defined, in the Darwinian way, as improvement in fitness to survive and reproduce. Of course the total range of variation is topped up again in every generation by new mutation and other kinds of variation. But it still remains true that natural selection is a narrowing down from an initially wider field of possibilities, including mostly unsuccessful ones, to a narrower field of successful ones. This is analogous to the definition of information with which we began: information is what enables the narrowing down from prior uncertainty (the initial range of possibilities) to later certainty (the
102
? 'successful' choice among the prior probabilities). According to this analogy, natural selection is by definition a process whereby information is fed into the gene pool of the next generation.
If natural selection feeds information into gene pools, what is the information about? It is about how to survive. Strictly, it is about how to survive and reproduce, in the conditions that prevailed when pre- vious generations were alive. To the extent that present day conditions are different from ancestral conditions, the ancestral genetic advice will be wrong. In extreme cases, the species may then go extinct. To the extent that conditions for the present generation are not too different from conditions for past generations, the information fed into present- day genomes from past generations is helpful information. Information from the ancestral past can be seen as a manual for surviving in the present: a family bible of ancestral 'advice' on how to survive today. We need only a little poetic licence to say that the information fed into modern genomes by natural selection is actually information about ancient environments in which ancestors survived.
This idea of information fed from ancestral generations into descendant gene pools is one of the themes of my book Unweaving the Rainbow. It takes a whole chapter, 'The Genetic Book of the Dead', to develop the notion, so I won't repeat it here except to say two things. First, it is the gene pool of the species as a whole, not the genome of any particular individual, which is best seen as the recipient of the ancestral information about how to survive. The genomes of particular individuals are random samples of the current gene pool, randomized by sexual recombination. Second, we are privileged to 'intercept' the information if we wish, and 'read' an animal's body, or even its genes, as a coded description of ancestral worlds. To quote from Unweaving the Rainbow:
And isn't it an arresting thought? We are digital archives of the African Pliocene, even of Devonian seas; walking repositories of wisdom out of the old days. You could spend a lifetime reading in this ancient library and die unsated by the wonder of it.
THE -INFORMATION CHALLENGE'
103
? Genes Aren't Us
The bogey of genetic determinism needs to be laid to rest. The discovery of a so-called 'gay gene' is as good an opportunity as we'll get to lay it.
69
The facts are quickly stated. In the magazine Science , a team of
researchers from the National Institutes of Health, in Bethesda, Maryland, reported the following pattern. Homosexual males are more likely than you'd expect by chance to have homosexual brothers. Revealingly, they are also more likely than you'd expect by chance to have homosexual maternal uncles and homosexual cousins on the mother's side, but not on the father's side. This pattern raises the immediate suspicion that at least one gene causing homosexuality in males is carried on the X chromosome. *
The Bethesda team went further. Modern technology made it possible for them to search for particular marker strings in the DNA code itself. In one region, called Xq28, near the tip of the X chromosome, they found five identical markers shared by a suggestively high percentage of homosexual brothers. These facts combine elegantly with one another to confirm earlier evidence of a hereditary component to male homosexuality.
So what? Are sociology's foundations trembling? Should theologians be wringing their hands with concern, and lawyers rubbing theirs with anticipation? Does this finding tell us anything new about 'blame' or 'responsibility'? Does it add anything, one way or the other, to arguments about whether homosexuality is a condition that could, or should, be 'cured'? Should it make individual homosexuals more or less proud, or ashamed, of their predilections? No to all these questions. If you are proud, you can stay proud. If you prefer to be guilty, stay guilty. Nothing has changed. In explaining what I mean,
? Because males have only one X chromosome, which they necessarily get from their mother. Females have two X chomosomes, one from each parent. A male shares X chromosome genes with his maternal, but not his paternal, uncle.
104
? I am less interested in this particular case than I am in using it to illustrate a more general point about genes and the bogey of genetic determinism.
There is an important distinction between a blueprint and a recipe. *
A blueprint is a detailed, point-for-point specification of some end product like a house or a car. One diagnostic feature of a blueprint is that it is reversible. Give an engineer a car and he can reconstruct its blueprint. But offer to a chef a rival's piece de resistance to taste and he will fail to reconstruct the recipe. There is a one-to-one mapping between components of a blueprint and components of the end product. This bit of the car corresponds to this bit of the blueprint. That bit of the car corresponds to that bit of the blueprint. There is no such one-to-one mapping in the case of a recipe. You can't isolate a particular blob of souffle and seek one word of the recipe that 'determines' that blob. All the words of the recipe, taken together with all the ingredients, combine to form the whole souffle.
Genes, in different aspects of their behaviour, are sometimes like blueprints and sometimes like recipes. It is important to keep the two aspects separate. Genes are digital, textual information, and they retain their hard, textual integrity as they change partners down the genera- tions. Chromosomes - long strings of genes - are formally just like long computer tapes.
When a portion of genetic tape is read in a cell, the first thing that happens to the information is that it is translated from one code to another: from the DNA code to a related code that dictates the exact shape of a protein molecule. So far, the gene behaves like a blue- print. There really is a one-to-one mapping between bits of gene and bits of protein, and it really is deterministic.
It is in the next step of the process - the development of a whole body and its psychological predispositions - that things start to get more complicated and recipe-like. There is seldom a simple one-to-one mapping between particular genes and 'bits' of body. Rather, there is a mapping between genes and rates at which processes happen during embryonic development. The eventual effects on bodies and their behaviour are often multifarious and hard to unravel.
The recipe is a good metaphor but, as an even better one, think of the body as a blanket, suspended from the ceiling by 100,000 rubber bands, all tangled and twisted around one another. The shape of the blanket - the body - is determined by the tensions of all these rubber bands taken together. Some of the rubber bands represent genes, others
*This distinction was also used in 'Darwin Triumphant' (p. 89).
GENES AREN'T US
105
? LIGHT WILL BE THROWN
environmental factors. A change in a particular gene corresponds to a lengthening or shortening of one particular rubber band. But any one rubber band is linked to the blanket only indirectly via countless con- nections amid the welter of other rubber bands. If you cut one rubber band, or tighten it, there will be a distributed shift in tensions, and the effect on the shape of the blanket will be complex and hard to predict.
In the same way, possession of a particular gene need not infallibly dictate that an individual will be homosexual. Far more probably the causal influence will be statistical. The effect of genes on bodies and behaviour is like the effect of cigarette smoke on lungs. If you smoke heavily, you increase the statistical odds that you'll get lung cancer. You won't infallibly give yourself lung cancer. Nor does refraining from smoking protect you infallibly from cancer. We live in a statistical world.
Imagine the following newspaper headline: 'Scientists discover that homosexuality is caused. ' Obviously this is not news at all; it is trivial. Everything is caused. To say that homosexuality is caused by genes is more interesting, and it has the aesthetic merit of discomfiting politically- inspired bores, but it doesn't say more than my trivial headline does about the irrevocability of homosexuality.
Some genetic causes are hard to reverse. Others are easy. Some environ- mental causes are easy to reverse. Others are hard. Think how tenaciously we cling to the accent of childhood: an adult immigrant is labelled a foreigner for life. This is far more ineluctably deterministic than many genetic effects. It would be interesting to know the statistical likelihood that a child, subjected to a particular environmental influence such as religious indoctrination by nuns, will be able to escape the influence later on. It would similarly be interesting to know the statistical likelihood that a man possessing a particular gene in the Xq28 region of the X chromosome will turn out to be homosexual. The mere demonstration that there exists a gene 'for' homosexuality leaves the value of that likelihood almost totally open. Genes have no monopoly on determinism.
So, if you hate homosexuals or love them, if you want to lock them up or 'cure' them, your reasons had better have nothing to do with genes.
106
? Son of Moore's Law
Great achievers who have gone far sometimes amuse themselves by then going too far. Peter Medawar knew what he was doing when he wrote, in his review of James D. Watson's The Double Helix,
It is simply not worth arguing with anyone so obtuse as not to realize that this complex of discoveries [molecular genetics] is the greatest achievement of science in the twentieth century.
Medawar, like the author of the book he was reviewing, could justify his arrogance in spades, but you don't have to be obtuse to dissent from his opinion. What about that earlier Anglo-American complex of discoveries known as the Neo-Darwinian Modern Synthesis? Physicists could make a good case for relativity or quantum mechanics, and cosmologists for the expanding universe. The 'greatest' anything is ultimately undecidable, but the molecular genetic revolution was undeniably one of the greatest achievements of science in the twentieth century - and that means of the human species, ever. Where shall we take it - or where will it take us - in the nextfiftyyears? By mid-century, history may judge Medawar to have been closer to the truth than his contemporaries - or even he - allowed.
If asked to summarize molecular genetics in a word, I would choose 'digital'. Of course, Mendel's genetics was digital in being particulate with respect to the independent assortment of genes through pedigrees. But the interior of genes was unknown and they could still have been substances with continuously varying qualities, strengths and flavours, inextricably intertwined with their effects. Watson/Crick genetics is digital through and through, digital to its very backbone, the double helix itself. A genome's size can be measured in gigabases with exactly the same precision as a hard drive is sized up in gigabytes. Indeed, the two units are interconvertible by constant multiplication. Genetics today is pure information technology. This, precisely, is why an
107
? LIGHT WILL BE THROWN
antifreeze gene can be copied from an Arctic fish and pasted into a tomato. *
The explosion sparked by Watson and Crick grew exponentially, as a good explosion should, during the half century since their famous joint publication. I think I mean that literally, and I'll support it by analogy with a better known explosion, this time from information technology as conventionally understood. Moore's Law states that computer power doubles every eighteen months. It is an empirical law without an agreed theoretical underpinning, though Nathan Myhrvold offers a wittily self-referential candidate: 'Nathan's Law' states that software grows faster than Moore's Law, and that is why we have Moore's Law. What- ever the underlying reason, or complex of reasons, Moore's Law has held true for nearly fifty years. Many analysts expect it to continue for as long again, with stunning effects upon human affairs - but that is not my concern in this essay.
Instead, is there something equivalent to Moore's Law for DNA information technology? The best measure would surely be an economic one, for money is a good composite index of man-hours and equipment costs. As the decades go by, what is the benchmark number of DNA kilobases that can be sequenced for a standard quantity of money? Does it increase exponentially, and if so what is its doubling time? Notice, by the way (it is another aspect of DNA science's being a branch of information technology) that it makes no difference which animal or plant provides the DNA. The sequencing techniques and the costs in any one decade are much the same. Indeed, unless you read the text message itself, it is impossible to tell whether DNA comes from a man, a mushroom or a microbe.
Having chosen my economic benchmark, I didn't know how to measure the costs in practice. Fortunately, I had the good sense to ask my colleague Jonathan Hodgkin, Professor of Genetics at Oxford University. I was delighted to discover that he had recently done the very thing while preparing a lecture for his old school, and he kindly sent me the following estimates of the cost, in pounds sterling, per base pair (that is, 'per letter' of the DNA code) sequenced. In 1965, it cost about ? 1000 per letter to sequence 5S ribosomal RNA from bacteria (not DNA, but RNA costs are similar). In 1975, to sequence DNA from the virus XI74 cost about ? 10 per letter. Hodgkin didn't find a good example for 1985, but in 1995 it cost ? 1 per letter to sequence the DNA
of Caenorhabditis elegans, the tiny nematode worm of which molecular
*See 'Science, Genetics and Ethics: Memo for Tony Blair' (p. 28). 108
? SON OF MOORE'S LAW
1960 1980 2000 2020 2040 2060 Linear regression fitted to four data points, then extrapolated to 2050
biologists are so (rightly) enamoured that they call it 'the' nematode, or even 'the' worm. * By the time the Human Genome Project culminated around 2000, sequencing costs were about ? 0. 1 per letter. To show the positive trend of growth, I inverted thesefiguresto 'bangs for the buck' - that is, quantity of DNA that can be sequenced for a fixed amount of money, and I chose ? 1000, correcting for inflation. I have plotted the resulting kilobases per ? 1000 on a logarithmic scale, which is convenient because exponential growth shows up as a straight line. (See graph. )
I must emphasize, as Professor Hodgkin did to me, that the four data points are back-of-the-envelope calculations. Nevertheless, they do fall convincingly close to a straight line, suggesting that the increase in our
The absurdity of this can be gauged from an image I have never forgotten, quoted in one of the first zoology books I ever owned, Ralph Buchsbaum's Animals without Backbones (University of Chicago Press). 'If all the matter in the universe except the nematodes were swept away, our world would still be dimly recognizable . . . we should find its mountains, hills, vales, rivers, lakes, and oceans represented by a film of nematodes . . . Trees would still stand in ghostly rows representing our streets and highways. The location of the various plants and animals would still be decipherable, and, had we sufficient knowledge, in many cases even their species could be determined by an examination of their erstwhile nematode parasites. ' There are probably more than half a million species of nematodes, hugely outnumbering the species in all the vertebrate classes put together.
SON OF MOORE'S LAW
109
? LIGHT WILL BE THROWN
DNA sequencing power is exponential. The doubling time (or cost- halving time) is twenty-seven months, which may be compared with the eighteen months of Moore's Law. To the extent that DNA sequencing work depends upon computer power (quite a large extent), the new law we have discovered probably owes a great deal to Moore's Law itself, which justifies my facetious label, 'Son of Moore's Law'.
It is by no means to be expected that technological progress should advance in this exponential way. I haven't plotted the figures out, but I'd be surprised if, say, speed of aircraft, fuel economy of cars, or height of skyscrapers were found to advance exponentially. Rather than double and double again in a fixed time, I suspect that they advance by some- thing closer to arithmetic addition. Indeed, the late Christopher Evans, as long ago as 1979, when Moore's Law had scarcely begun, wrote:
Today's car differs from those of the immediate postwar years on a number of counts . . . But suppose for a moment that the automobile industry had developed at the same rate as computers and over the same period: how much cheaper and more efficient would the current models be? . . . Today you would be able to buy a Rolls-Royce for ? 1. 35*, it would do three million miles to the gallon, and it would deliver enough power to drive the Queen Elizabeth II. And if you were interested in miniaturization, you could place half a dozen of them on a pinhead.
Space exploration also seemed to me a likely candidate for modest additive increase like motor cars. Then I remembered a fascinating speculation mentioned by Arthur C. Clarke, whose credentials as a prophet are not to be ignored. Imagine a future spacecraft heading off for a distant star. Even travelling at the highest speed allowed by the current state of the art, it would still take many centuries to reach its distant destination. And before it had completed half its journey it would be overtaken by a faster vessel, the product of a later century's technology. So, it might be said, the original ship should never have bothered to set out. By the same argument, even the second spaceship should not bother to set out, because its crew is fated to wave to their great-grandchildren as they zoom by in a third. And so on. One way to resolve the paradox is to point out that the technology to develop later spaceships would not become available without the research and development that went into their slower predecessors. I would give the same answer to anybody who suggested that since the entire Human Genome Project could now be started from scratch and completed in a
*Two US dollars. 110
? fraction of the years the actual project took, the original enterprise should have been postponed appropriately.
If our four data points are admittedly rough estimates, the extrapolation of the straight line out to the year 2050 is even more tentative. But by analogy with Moore's Law, and especially if Son of Moore's Law really does owe something to its parent, this straight line probably represents a defensible prognostication. Let's at least follow to see where it will take us. It suggests that in the year 2050 we shall be able to sequence a complete individual human genome for ? 100 at today's values (about $160). Instead of 'the' human genome project, every individual will be able to afford their own personal genome project. Population geneticists will have the ultimate data on human diversity. It will be possible to work out trees of cousinship linking any person in the world to any other person. It is a historian's wildest dream. They will use the geographic distribution of genes to reconstruct the great migrations and invasions of the centuries, track voyages of Viking longships, follow the American tribes by their genes down from Alaska to Tierra del Fuego and the Saxons across Britain, document the diaspora of the Jews, even identify the modern descendants of pillaging warlords like Genghis Khan. *
Today, a chest X-ray will tell you whether you have lung cancer or tuberculosis. In 2050, for the price of a chest X-ray, you will be able to know the full text of every one of your genes. The doctor will hand you not the prescription recommended for an average person with your complaint but the prescription that precisely suits your genome. That is no doubt good, but your personal printout will also predict, with alarming precision, your natural end. Shall we want such knowledge? Even if we want it ourselves, shall we want our DNA printout to be read by insurance actuaries, paternity lawyers, governments? Even in a benign democracy, not everybody is happy with such a prospect. How some future Hitler might abuse this knowledge needs thinking about.
Weighty as such concerns may be, they are again not mine in this essay. I retreat to my ivory tower and more academic preoccupations. If ? 100 becomes the price of sequencing a human genome, the same money will buy the genome of any other mammal; all are about the same size, in the gigabase order of magnitude, as is true of all verte- brates. Even if we assume that Son of Moore's Law will flatten off before 2050, as many people believe Moore's Law will, we can still safely predict that it will become economically feasible to sequence the
*DNA analysis is already making exciting contributions to historical research. See, for example, Bryan Sykes, The Seven Daughters ofEve (London, Bantam Press, 2001) and S. Wells, The Journey ofMan: A Genetic Odyssey (London, Allen Lane, 2002).
SON OF MOORE'S L A W
Ill
? LIGHT WILL BE THROWN
genomes of hundreds of species per year. Having such a welter of information is one thing. What can we do with it? How shall we digest it, sift it, collate it, use it?
One relatively modest goal will be total and final knowledge of the phylogenetic tree. For there is, after all, one true tree of life, the unique pattern of evolutionary branching that actually happened. It exists. It is in principle knowable. We don't know it all yet. By 2050 we should - or if we do not, we shall have been defeated only at the terminal twigs, by the sheer number of species (a number that, as my colleague Robert May points out, is at present unknown to the nearest one or even two orders of magnitude).
My research assistant Yan Wong suggests that naturalists and ecologists in 2050 will carry a small field taxonomy kit, which will obviate the need to send specimens off to a museum expert for identification. A fine probe, hooked up to a portable computer, will be inserted into a tree, or a freshly trapped vole or grasshopper. Within minutes, the computer will chew over a few key segments of DNA, then spit out the species name and any other details that may be in its stored database.
Already, DNA taxonomy has turned up some sharp surprises. My traditional zoologist's mind protests almost unendurably at being asked to believe that hippos are more closely related to whales than they are to pigs. This is still controversial. It will be settled, one way or the other, along with countless other such disputes, by 2050. It will be settled because the Hippo Genome Project, the Pig Genome Project, and the Whale (if our Japanese friends haven't eaten them all by then) Genome Project will have been completed. Actually, it will not be necessary to sequence entire genomes to dissolve taxonomic uncertainty forever.
A spin-off benefit, which will perhaps have its greatest impact in the United States, is that full knowledge of the tree of life will make it even harder to doubt the fact of evolution. Fossils will become by comparison irrelevant to the argument, as hundreds of separate genes, in as many surviving species as we can bear to sequence, are found to corroborate each other's accounts of the one true tree of life.
It has been said often enough to become a platitude but I had better say it again: to know the genome of an animal is not the same as to understand that animal. Following Sydney Brenner (the single individual regarding whom, more than any other, I have heard people wonder at the absence so far of a Nobel Prize*), I shall think in terms of three steps, of increasing difficulty, in 'computing' an animal from its
*Stop press: Sydney Brenner's Nobel Prize was announced while this book was in proof. 112
? genome. Step 1 was hard but has now been completely solved. It is to compute the amino acid sequence of a protein from the nucleotide sequence of a gene. Step 2 is to compute the three-dimensional folding pattern of a protein from its one-dimensional sequence of amino acids. Physicists believe that in principle this can be done, but it is hard, and it may often be quicker to make the protein and see what happens. Step 3 is to compute the developing embryo from its genes and their interaction with their environment - which mostly consists of other genes. This is the hardest step, but the science of embryology (especially of the workings of Hox and similar genes) is advancing at such a rate that by 2050 it will probably be solved. In other words, I conjecture that an embryologist of 2050 will feed the genome of an unknown animal into a computer, and the computer will simulate an embryology that will culminate in a full rendering of the adult animal. This will not be a particularly useful accomplishment in itself, since a real embryo will always be a cheaper computer than an electronic one. But it will be a way of signifying the completeness of our understanding. And parti- cular implementations of the technology will be useful. For instance, detectives finding a bloodstain may be able to issue a computer image of the face of a suspect - or rather, since genes don't mature with age, a series of faces from babyhood to dotage!
I also think that by 2050 my dream of the Genetic Book of the Dead will become a reality. Darwinian reasoning shows that the genes of a species must constitute a kind of description of the ancestral environments through which those genes have survived. The gene pool of a species is the clay which is shaped by natural selection. As I put it in Unweaving the Rainbow:
Like sandbluffs carved into fantastic shapes by the desert winds, like rocks shaped by ocean waves, camel DNA has been sculpted by survival in ancient deserts, and even more ancient seas, to yield modern camels. Camel DNA speaks - if only we could read the language - of the changing worlds of camel ancestors. If only we could read the language, the DNA of tuna and starfish would have 'sea' written into the text. The DNA of moles and earthworms would spell 'underground'.
Evidently the total information capacity of genomes is very variable across the living kingdoms, and it must have changed greatly in evolution, presumably in both directions. Losses of genetic material are called deletions. New genes arise through various kinds of duplication. This is well illustrated by haemoglobin, the complex protein molecule that transports oxygen in the blood.
Human adult haemoglobin is actually a composite of four protein chains called globins, knotted around each other. Their detailed sequences show that the four globin chains are closely related to each other, but they are not identical. Two of them are called alpha globins (each a chain of 141 amino acids), and two are beta globins (each a chain of 146 amino acids). The genes coding for the alpha globins are on chromosome 11; those coding for the beta globins are on chromo- some 16. On each of these chromosomes, there is a cluster of globin genes in a row, interspersed with some junk DNA. The alpha cluster, on chromosome 11, contains seven globin genes. Four of these are pseudogenes, versions of alpha disabled by faults in their sequence and not translated into proteins. Two are true alpha globins, used in the adult. The final one is called zeta and is used only in embryos. Similarly the beta cluster, on chromosome 16, has six genes, some of which are disabled, and one of which is used only in the embryo. Adult haemoglobin, as we've seen, contains two alpha and two beta chains.
Never mind all this complexity. Here's the fascinating point. Careful letter-by-letter analysis shows that these different kinds of globin genes are literally cousins of each other, literally members of a family. But these distant cousins still coexist inside our own genome, and that of all vertebrates. On the scale of whole organisms, all vertebrates are our cousins too. The tree of vertebrate evolution is the family tree we are all familiar with, its branch-points representing speciation events - the splitting of species into pairs of daughter species. But there is another family tree occupying the same timescale, whose branches represent not speciation events but gene duplication events within genomes.
The dozen or so different globins inside you are descended from an
*My suggestion (The Selfish Gene, 1976) that surplus DNA is parasitic was later taken up and developed by others under the catch-phrase 'Selfish DNA'. See The Selfish Gene, 2nd edn (Oxford University Press, 1989), pp. 44-5 and 275.
THE 'INFORMATION CHALLENGE'
97
? LIGHT WILL BE THROWN
ancient globin gene which, in a remote ancestor who lived about half a billion years ago, duplicated, after which both copies stayed in the genome. There were then two copies of it, in different parts of the genome of all descendant animals. One copy was destined to give rise to the alpha cluster (on what would eventually become chromosome 11 in our genome), the other to the beta cluster (on chromosome 16). As the aeons passed, there were further duplications (and doubtless some deletions as well). Around 400 million years ago the ancestral alpha gene duplicated again, but this time the two copies remained near neighbours of each other, in a cluster on the same chromosome. One of them was destined to become the zeta used by embryos, the other became the alpha globin genes used by adult humans (other branches gave rise to the nonfunctional pseudogenes I mentioned). It was a similar story along the beta branch of the family, but with duplications at other moments in geological history.
Now here's an equally fascinating point. Given that the split between the alpha cluster and the beta cluster took place 500 million years ago, it will of course not be just our human genomes that show the split - that is, possess alpha genes in a different part of the genome from beta genes. We should see the same within-genome split if we look at any other mammals, at birds, reptiles, amphibians and bony fish, for our common ancestor with all of them lived less than 500 million years ago. Wherever it has been investigated, this expectation has proved correct. Our greatest hope of finding a vertebrate that does not share with us the ancient alpha/beta split would be a jawless fish like a lamprey, for they are our most remote cousins among surviving verte- brates; they are the only surviving vertebrates whose common ancestor with the rest of the vertebrates is sufficiently ancient that it could have predated the alpha/beta split. Sure enough, these jawless fishes are the only known vertebrates that lack the alpha/beta divide.
Gene duplication, within the genome, has a similar historic impact to species duplication ('speciation') in phylogeny. It is responsible for gene diversity, in the same way as speciation is responsible for phyletic diversity. Beginning with a single universal ancestor, the magnificent diversity of life has come about through a series of branchings of new species, which eventually gave rise to the major branches of the living kingdoms and the hundreds of millions of separate species that have graced the Earth. A similar series of branchings, but this time within genomes - gene duplications - has spawned the large and diverse population of clusters of genes that constitutes the modern genome.
The story of the globins is just one among many. Gene duplications
98
? and deletions have occurred from time to time throughout genomes. It is by these, and similar means, that genome sizes can increase in evolution. But remember the distinction between the total capacity of the whole genome, and the capacity of the portion that is actually used. Recall that not all the globin genes are used. Some of them, like theta in the alpha cluster of globin genes, are pseudogenes, recognizably kin to functional genes in the same genomes, but never actually translated into the action language of protein. What is true of globins is true of most other genes. Genomes are littered with nonfunctional pseudo- genes, faulty duplicates of functional genes that do nothing, while their functional cousins (the word doesn't even need scare quotes) get on with their business in a different part of the same genome. And there's lots more DNA that doesn't even deserve the name pseudogene. It too is derived by duplication, but not duplication of functional genes. It consists of multiple copies of junk, 'tandem repeats', and other nonsense which may be useful for forensic detectives but which doesn't seem to be used in the body itself. Once again, creationists might spend some earnest time speculating on why the Creator should bother to litter genomes with untranslated pseudogenes and junk tandem repeat DNA.
Can we measure the information capacity of that portion of the genome which is actually used? We can at least estimate it. In the case of the human genome it is about 2 per cent - considerably less than the proportion of my hard disk that I have used since I bought it. Presumably the equivalent figure for the crested newt is even smaller, but I don't know if it has been measured. In any case, we mustn't run away with a chauvinistic idea that the human genome somehow ought to have the largest DNA database because we are so wonderful. The great evolutionary biologist George C. Williams has pointed out that animals with complicated life cycles need to code for the development of all stages in the life cycle, but they only have one genome with which to do so. A butterfly! s genome has to hold the complete informa- tion needed for building a caterpillar as well as a butterfly. A sheep liver fluke has six distinct stages in its life cycle, each specialized for a different way of life. We shouldn't feel too insulted if liver flukes turned out to have bigger genomes than we have (actually they don't).
Remember, too, that even the total capacity of genome that is actually used is still not the same thing as the true information content in Shannon's sense. The true information content is what's left when the redundancy has been compressed out of the message, by the theoretical equivalent of Stuffit. There are even some viruses that seem
THE -INFORMATION CHALLENGE'
99
? LIGHT WILL BE THROWN
to use a kind of Stuffit-like compression. They make use of the fact that the RNA (not DNA in these viruses, as it happens) code is read in triplets. There is a 'frame' which moves along the RNA sequence, reading off three letters at a time. Obviously, under normal conditions, if the frame starts reading in the wrong place (as in a so-called frame-shift mutation), it makes total nonsense: the 'triplets' that it reads are out of step with the meaningful ones. But these splendid viruses actually exploit frame- shifted reading. They get two messages for the price of one, by having a completely different message embedded in the very same series of letters when read frame-shifted. In principle you could even get three messages for the price of one, but I don't know of any examples.
It is one thing to estimate the total information capacity of a genome, and the amount of the genome that is actually used, but it's harder to estimate its true information content in the Shannon sense. The best we can do is probably to forget about the genome itself and look at its product, the 'phenotype', the working body of the animal or plant itself. In 1951, J. W. S. Pringle, who later became my Professor at Oxford, suggested using a Shannon-type information measure to estimate 'complexity'. Pringle wanted to express complexity mathematically in bits, but I have long found the following verbal form helpful in explaining his idea.
We have an intuitive sense that a lobster, say, is more complex (more 'advanced', some might even say more 'highly evolved') than another animal, perhaps a millipede. Can we measure something in order to confirm or deny our intuition? Without literally turning it into bits, we can make an approximate estimation of the information contents of the two bodies as follows. Imagine writing a book describing the lobster. Now write another book describing the millipede down to the same level of detail. Divide the word-count in one book by the word-count in the other, and you have an approximate estimate of the relative information content of lobster and millipede. It is important to specify that both books describe their respective animals 'down to the same level of detail'. Obviously, if we describe the millipede down to cellular detail, but stick to gross anatomical features in the case of the lobster, the millipede would come out ahead.
But if we do the test fairly, I'll bet the lobster book would come out longer than the millipede book. It's a simple plausibility argument, as follows. Both animals are made up of segments - modules of bodily architecture that are fundamentally similar to each other, arranged fore- and-aft like the trucks of a train. The millipede's segments are mostly identical to each other. The lobster's segments, though following the
100
? same basic plan (each with a nervous ganglion, a pair of appendages, and so on) are mostly different from each other. The millipede book would consist of one chapter describing a typical segment, followed by the phrase 'Repeat N times', where N is the number of segments. The lobster book would need a different chapter for each segment. This isn't quite fair on the millipede, whose front and rear end segments are a bit different from the rest. But I'd still bet that, if anyone bothered to do the experiment, the estimate of lobster information content would come out substantially greater than the estimate of millipede information content.
It's not of direct evolutionary interest to compare a lobster with a millipede in this way, because nobody thinks lobsters evolved from millipedes. Obviously no modern animal evolved from any other modern animal. Instead, any pair of modern animals had a last common ancestor which lived at some (in principle) discoverable moment in geological history. Almost all of evolution happened way back in the past, which makes it hard to study details. But we can use the 'length of book' thought-experiment to agree upon what it would mean to ask the question whether information content increases over evolution, if only we had ancestral animals to look at.
The answer in practice is complicated and controversial, all bound up with a vigorous debate over whether evolution is, in general, pro- gressive. I am one of those associated with a limited form of yes answer. My colleague Stephen Jay Gould tends towards a no answer. * I don't think anybody would deny that, by any method of measuring - whether bodily information content, total information capacity of genome, capacity of genome actually used, or true ('Stuffit compressed') information content of genome - there has been a broad overall trend towards increased information content during the course of human evolution from our remote bacterial ancestors. People might disagree, however, over two important questions: first, whether such a trend is to be found in all, or a majority of evolutionary lineages (for example, parasite evolution often shows a trend towards decreasing bodily complexity, because parasites are better off being simple); second, whether, even in lineages where there is a clear overall trend over the very long term, it is bucked by so many reversals and re-reversals in the shorter term as to undermine the very idea of progress. This is not the place to resolve this interesting controversy. There are distinguished biologists with good arguments on both sides.
*See 'Human Chauvinism and Evolutionary Progress' (pp. 206-17).
THE 'INFORMATION CHALLENGE'
101
? LIGHT WILL BE THROWN
Supporters of 'intelligent design' guiding evolution, by the way, should be deeply committed to the view that information content increases during evolution. Even if the information comes from God, perhaps especially if it does, it should surely increase, and the increase should presumably show itself in the genome.
Perhaps the main lesson we should learn from Pringle is that the information content of a biological system is another name for its complexity. Therefore the creationist challenge with which we began is tantamount to the standard challenge to explain how biological com- plexity can evolve from simpler antecedents, one that I have devoted three books to answering, and I do not propose to repeat their contents here. The 'information challenge' turns out to be none other than our old friend: 'How could something as complex as an eye evolve? ' It is just dressed up in fancy mathematical language - perhaps in an attempt to bamboozle. Or perhaps those who ask it have already bamboozled themselves, and don't realize that it is the same old - and thoroughly answered - question.
Let me turn, finally, to another way of looking at whether the information content of genomes increases in evolution. We now switch from the broad sweep of evolutionary history to the minutiae of natural selection. Natural selection itself, when you think about it, is a narrow- ing down from a wide initial field of possible alternatives, to the narrower field of the alternatives actually chosen. Random genetic error (mutation), sexual recombination and migratory mixing all provide a wide field of genetic variation: the available alternatives. Mutation is not an increase in true information content, rather the reverse, for mutation, in the Shannon analogy, contributes to increasing the prior uncertainty. But now we come to natural selection, which reduces the 'prior uncertainty' and therefore, in Shannon's sense, contributes infor- mation to the gene pool. In every generation, natural selection removes the less successful genes from the gene pool, so the remaining gene pool is a narrower subset. The narrowing is nonrandom, in the direction of improvement, where improvement is defined, in the Darwinian way, as improvement in fitness to survive and reproduce. Of course the total range of variation is topped up again in every generation by new mutation and other kinds of variation. But it still remains true that natural selection is a narrowing down from an initially wider field of possibilities, including mostly unsuccessful ones, to a narrower field of successful ones. This is analogous to the definition of information with which we began: information is what enables the narrowing down from prior uncertainty (the initial range of possibilities) to later certainty (the
102
? 'successful' choice among the prior probabilities). According to this analogy, natural selection is by definition a process whereby information is fed into the gene pool of the next generation.
If natural selection feeds information into gene pools, what is the information about? It is about how to survive. Strictly, it is about how to survive and reproduce, in the conditions that prevailed when pre- vious generations were alive. To the extent that present day conditions are different from ancestral conditions, the ancestral genetic advice will be wrong. In extreme cases, the species may then go extinct. To the extent that conditions for the present generation are not too different from conditions for past generations, the information fed into present- day genomes from past generations is helpful information. Information from the ancestral past can be seen as a manual for surviving in the present: a family bible of ancestral 'advice' on how to survive today. We need only a little poetic licence to say that the information fed into modern genomes by natural selection is actually information about ancient environments in which ancestors survived.
This idea of information fed from ancestral generations into descendant gene pools is one of the themes of my book Unweaving the Rainbow. It takes a whole chapter, 'The Genetic Book of the Dead', to develop the notion, so I won't repeat it here except to say two things. First, it is the gene pool of the species as a whole, not the genome of any particular individual, which is best seen as the recipient of the ancestral information about how to survive. The genomes of particular individuals are random samples of the current gene pool, randomized by sexual recombination. Second, we are privileged to 'intercept' the information if we wish, and 'read' an animal's body, or even its genes, as a coded description of ancestral worlds. To quote from Unweaving the Rainbow:
And isn't it an arresting thought? We are digital archives of the African Pliocene, even of Devonian seas; walking repositories of wisdom out of the old days. You could spend a lifetime reading in this ancient library and die unsated by the wonder of it.
THE -INFORMATION CHALLENGE'
103
? Genes Aren't Us
The bogey of genetic determinism needs to be laid to rest. The discovery of a so-called 'gay gene' is as good an opportunity as we'll get to lay it.
69
The facts are quickly stated. In the magazine Science , a team of
researchers from the National Institutes of Health, in Bethesda, Maryland, reported the following pattern. Homosexual males are more likely than you'd expect by chance to have homosexual brothers. Revealingly, they are also more likely than you'd expect by chance to have homosexual maternal uncles and homosexual cousins on the mother's side, but not on the father's side. This pattern raises the immediate suspicion that at least one gene causing homosexuality in males is carried on the X chromosome. *
The Bethesda team went further. Modern technology made it possible for them to search for particular marker strings in the DNA code itself. In one region, called Xq28, near the tip of the X chromosome, they found five identical markers shared by a suggestively high percentage of homosexual brothers. These facts combine elegantly with one another to confirm earlier evidence of a hereditary component to male homosexuality.
So what? Are sociology's foundations trembling? Should theologians be wringing their hands with concern, and lawyers rubbing theirs with anticipation? Does this finding tell us anything new about 'blame' or 'responsibility'? Does it add anything, one way or the other, to arguments about whether homosexuality is a condition that could, or should, be 'cured'? Should it make individual homosexuals more or less proud, or ashamed, of their predilections? No to all these questions. If you are proud, you can stay proud. If you prefer to be guilty, stay guilty. Nothing has changed. In explaining what I mean,
? Because males have only one X chromosome, which they necessarily get from their mother. Females have two X chomosomes, one from each parent. A male shares X chromosome genes with his maternal, but not his paternal, uncle.
104
? I am less interested in this particular case than I am in using it to illustrate a more general point about genes and the bogey of genetic determinism.
There is an important distinction between a blueprint and a recipe. *
A blueprint is a detailed, point-for-point specification of some end product like a house or a car. One diagnostic feature of a blueprint is that it is reversible. Give an engineer a car and he can reconstruct its blueprint. But offer to a chef a rival's piece de resistance to taste and he will fail to reconstruct the recipe. There is a one-to-one mapping between components of a blueprint and components of the end product. This bit of the car corresponds to this bit of the blueprint. That bit of the car corresponds to that bit of the blueprint. There is no such one-to-one mapping in the case of a recipe. You can't isolate a particular blob of souffle and seek one word of the recipe that 'determines' that blob. All the words of the recipe, taken together with all the ingredients, combine to form the whole souffle.
Genes, in different aspects of their behaviour, are sometimes like blueprints and sometimes like recipes. It is important to keep the two aspects separate. Genes are digital, textual information, and they retain their hard, textual integrity as they change partners down the genera- tions. Chromosomes - long strings of genes - are formally just like long computer tapes.
When a portion of genetic tape is read in a cell, the first thing that happens to the information is that it is translated from one code to another: from the DNA code to a related code that dictates the exact shape of a protein molecule. So far, the gene behaves like a blue- print. There really is a one-to-one mapping between bits of gene and bits of protein, and it really is deterministic.
It is in the next step of the process - the development of a whole body and its psychological predispositions - that things start to get more complicated and recipe-like. There is seldom a simple one-to-one mapping between particular genes and 'bits' of body. Rather, there is a mapping between genes and rates at which processes happen during embryonic development. The eventual effects on bodies and their behaviour are often multifarious and hard to unravel.
The recipe is a good metaphor but, as an even better one, think of the body as a blanket, suspended from the ceiling by 100,000 rubber bands, all tangled and twisted around one another. The shape of the blanket - the body - is determined by the tensions of all these rubber bands taken together. Some of the rubber bands represent genes, others
*This distinction was also used in 'Darwin Triumphant' (p. 89).
GENES AREN'T US
105
? LIGHT WILL BE THROWN
environmental factors. A change in a particular gene corresponds to a lengthening or shortening of one particular rubber band. But any one rubber band is linked to the blanket only indirectly via countless con- nections amid the welter of other rubber bands. If you cut one rubber band, or tighten it, there will be a distributed shift in tensions, and the effect on the shape of the blanket will be complex and hard to predict.
In the same way, possession of a particular gene need not infallibly dictate that an individual will be homosexual. Far more probably the causal influence will be statistical. The effect of genes on bodies and behaviour is like the effect of cigarette smoke on lungs. If you smoke heavily, you increase the statistical odds that you'll get lung cancer. You won't infallibly give yourself lung cancer. Nor does refraining from smoking protect you infallibly from cancer. We live in a statistical world.
Imagine the following newspaper headline: 'Scientists discover that homosexuality is caused. ' Obviously this is not news at all; it is trivial. Everything is caused. To say that homosexuality is caused by genes is more interesting, and it has the aesthetic merit of discomfiting politically- inspired bores, but it doesn't say more than my trivial headline does about the irrevocability of homosexuality.
Some genetic causes are hard to reverse. Others are easy. Some environ- mental causes are easy to reverse. Others are hard. Think how tenaciously we cling to the accent of childhood: an adult immigrant is labelled a foreigner for life. This is far more ineluctably deterministic than many genetic effects. It would be interesting to know the statistical likelihood that a child, subjected to a particular environmental influence such as religious indoctrination by nuns, will be able to escape the influence later on. It would similarly be interesting to know the statistical likelihood that a man possessing a particular gene in the Xq28 region of the X chromosome will turn out to be homosexual. The mere demonstration that there exists a gene 'for' homosexuality leaves the value of that likelihood almost totally open. Genes have no monopoly on determinism.
So, if you hate homosexuals or love them, if you want to lock them up or 'cure' them, your reasons had better have nothing to do with genes.
106
? Son of Moore's Law
Great achievers who have gone far sometimes amuse themselves by then going too far. Peter Medawar knew what he was doing when he wrote, in his review of James D. Watson's The Double Helix,
It is simply not worth arguing with anyone so obtuse as not to realize that this complex of discoveries [molecular genetics] is the greatest achievement of science in the twentieth century.
Medawar, like the author of the book he was reviewing, could justify his arrogance in spades, but you don't have to be obtuse to dissent from his opinion. What about that earlier Anglo-American complex of discoveries known as the Neo-Darwinian Modern Synthesis? Physicists could make a good case for relativity or quantum mechanics, and cosmologists for the expanding universe. The 'greatest' anything is ultimately undecidable, but the molecular genetic revolution was undeniably one of the greatest achievements of science in the twentieth century - and that means of the human species, ever. Where shall we take it - or where will it take us - in the nextfiftyyears? By mid-century, history may judge Medawar to have been closer to the truth than his contemporaries - or even he - allowed.
If asked to summarize molecular genetics in a word, I would choose 'digital'. Of course, Mendel's genetics was digital in being particulate with respect to the independent assortment of genes through pedigrees. But the interior of genes was unknown and they could still have been substances with continuously varying qualities, strengths and flavours, inextricably intertwined with their effects. Watson/Crick genetics is digital through and through, digital to its very backbone, the double helix itself. A genome's size can be measured in gigabases with exactly the same precision as a hard drive is sized up in gigabytes. Indeed, the two units are interconvertible by constant multiplication. Genetics today is pure information technology. This, precisely, is why an
107
? LIGHT WILL BE THROWN
antifreeze gene can be copied from an Arctic fish and pasted into a tomato. *
The explosion sparked by Watson and Crick grew exponentially, as a good explosion should, during the half century since their famous joint publication. I think I mean that literally, and I'll support it by analogy with a better known explosion, this time from information technology as conventionally understood. Moore's Law states that computer power doubles every eighteen months. It is an empirical law without an agreed theoretical underpinning, though Nathan Myhrvold offers a wittily self-referential candidate: 'Nathan's Law' states that software grows faster than Moore's Law, and that is why we have Moore's Law. What- ever the underlying reason, or complex of reasons, Moore's Law has held true for nearly fifty years. Many analysts expect it to continue for as long again, with stunning effects upon human affairs - but that is not my concern in this essay.
Instead, is there something equivalent to Moore's Law for DNA information technology? The best measure would surely be an economic one, for money is a good composite index of man-hours and equipment costs. As the decades go by, what is the benchmark number of DNA kilobases that can be sequenced for a standard quantity of money? Does it increase exponentially, and if so what is its doubling time? Notice, by the way (it is another aspect of DNA science's being a branch of information technology) that it makes no difference which animal or plant provides the DNA. The sequencing techniques and the costs in any one decade are much the same. Indeed, unless you read the text message itself, it is impossible to tell whether DNA comes from a man, a mushroom or a microbe.
Having chosen my economic benchmark, I didn't know how to measure the costs in practice. Fortunately, I had the good sense to ask my colleague Jonathan Hodgkin, Professor of Genetics at Oxford University. I was delighted to discover that he had recently done the very thing while preparing a lecture for his old school, and he kindly sent me the following estimates of the cost, in pounds sterling, per base pair (that is, 'per letter' of the DNA code) sequenced. In 1965, it cost about ? 1000 per letter to sequence 5S ribosomal RNA from bacteria (not DNA, but RNA costs are similar). In 1975, to sequence DNA from the virus XI74 cost about ? 10 per letter. Hodgkin didn't find a good example for 1985, but in 1995 it cost ? 1 per letter to sequence the DNA
of Caenorhabditis elegans, the tiny nematode worm of which molecular
*See 'Science, Genetics and Ethics: Memo for Tony Blair' (p. 28). 108
? SON OF MOORE'S LAW
1960 1980 2000 2020 2040 2060 Linear regression fitted to four data points, then extrapolated to 2050
biologists are so (rightly) enamoured that they call it 'the' nematode, or even 'the' worm. * By the time the Human Genome Project culminated around 2000, sequencing costs were about ? 0. 1 per letter. To show the positive trend of growth, I inverted thesefiguresto 'bangs for the buck' - that is, quantity of DNA that can be sequenced for a fixed amount of money, and I chose ? 1000, correcting for inflation. I have plotted the resulting kilobases per ? 1000 on a logarithmic scale, which is convenient because exponential growth shows up as a straight line. (See graph. )
I must emphasize, as Professor Hodgkin did to me, that the four data points are back-of-the-envelope calculations. Nevertheless, they do fall convincingly close to a straight line, suggesting that the increase in our
The absurdity of this can be gauged from an image I have never forgotten, quoted in one of the first zoology books I ever owned, Ralph Buchsbaum's Animals without Backbones (University of Chicago Press). 'If all the matter in the universe except the nematodes were swept away, our world would still be dimly recognizable . . . we should find its mountains, hills, vales, rivers, lakes, and oceans represented by a film of nematodes . . . Trees would still stand in ghostly rows representing our streets and highways. The location of the various plants and animals would still be decipherable, and, had we sufficient knowledge, in many cases even their species could be determined by an examination of their erstwhile nematode parasites. ' There are probably more than half a million species of nematodes, hugely outnumbering the species in all the vertebrate classes put together.
SON OF MOORE'S LAW
109
? LIGHT WILL BE THROWN
DNA sequencing power is exponential. The doubling time (or cost- halving time) is twenty-seven months, which may be compared with the eighteen months of Moore's Law. To the extent that DNA sequencing work depends upon computer power (quite a large extent), the new law we have discovered probably owes a great deal to Moore's Law itself, which justifies my facetious label, 'Son of Moore's Law'.
It is by no means to be expected that technological progress should advance in this exponential way. I haven't plotted the figures out, but I'd be surprised if, say, speed of aircraft, fuel economy of cars, or height of skyscrapers were found to advance exponentially. Rather than double and double again in a fixed time, I suspect that they advance by some- thing closer to arithmetic addition. Indeed, the late Christopher Evans, as long ago as 1979, when Moore's Law had scarcely begun, wrote:
Today's car differs from those of the immediate postwar years on a number of counts . . . But suppose for a moment that the automobile industry had developed at the same rate as computers and over the same period: how much cheaper and more efficient would the current models be? . . . Today you would be able to buy a Rolls-Royce for ? 1. 35*, it would do three million miles to the gallon, and it would deliver enough power to drive the Queen Elizabeth II. And if you were interested in miniaturization, you could place half a dozen of them on a pinhead.
Space exploration also seemed to me a likely candidate for modest additive increase like motor cars. Then I remembered a fascinating speculation mentioned by Arthur C. Clarke, whose credentials as a prophet are not to be ignored. Imagine a future spacecraft heading off for a distant star. Even travelling at the highest speed allowed by the current state of the art, it would still take many centuries to reach its distant destination. And before it had completed half its journey it would be overtaken by a faster vessel, the product of a later century's technology. So, it might be said, the original ship should never have bothered to set out. By the same argument, even the second spaceship should not bother to set out, because its crew is fated to wave to their great-grandchildren as they zoom by in a third. And so on. One way to resolve the paradox is to point out that the technology to develop later spaceships would not become available without the research and development that went into their slower predecessors. I would give the same answer to anybody who suggested that since the entire Human Genome Project could now be started from scratch and completed in a
*Two US dollars. 110
? fraction of the years the actual project took, the original enterprise should have been postponed appropriately.
If our four data points are admittedly rough estimates, the extrapolation of the straight line out to the year 2050 is even more tentative. But by analogy with Moore's Law, and especially if Son of Moore's Law really does owe something to its parent, this straight line probably represents a defensible prognostication. Let's at least follow to see where it will take us. It suggests that in the year 2050 we shall be able to sequence a complete individual human genome for ? 100 at today's values (about $160). Instead of 'the' human genome project, every individual will be able to afford their own personal genome project. Population geneticists will have the ultimate data on human diversity. It will be possible to work out trees of cousinship linking any person in the world to any other person. It is a historian's wildest dream. They will use the geographic distribution of genes to reconstruct the great migrations and invasions of the centuries, track voyages of Viking longships, follow the American tribes by their genes down from Alaska to Tierra del Fuego and the Saxons across Britain, document the diaspora of the Jews, even identify the modern descendants of pillaging warlords like Genghis Khan. *
Today, a chest X-ray will tell you whether you have lung cancer or tuberculosis. In 2050, for the price of a chest X-ray, you will be able to know the full text of every one of your genes. The doctor will hand you not the prescription recommended for an average person with your complaint but the prescription that precisely suits your genome. That is no doubt good, but your personal printout will also predict, with alarming precision, your natural end. Shall we want such knowledge? Even if we want it ourselves, shall we want our DNA printout to be read by insurance actuaries, paternity lawyers, governments? Even in a benign democracy, not everybody is happy with such a prospect. How some future Hitler might abuse this knowledge needs thinking about.
Weighty as such concerns may be, they are again not mine in this essay. I retreat to my ivory tower and more academic preoccupations. If ? 100 becomes the price of sequencing a human genome, the same money will buy the genome of any other mammal; all are about the same size, in the gigabase order of magnitude, as is true of all verte- brates. Even if we assume that Son of Moore's Law will flatten off before 2050, as many people believe Moore's Law will, we can still safely predict that it will become economically feasible to sequence the
*DNA analysis is already making exciting contributions to historical research. See, for example, Bryan Sykes, The Seven Daughters ofEve (London, Bantam Press, 2001) and S. Wells, The Journey ofMan: A Genetic Odyssey (London, Allen Lane, 2002).
SON OF MOORE'S L A W
Ill
? LIGHT WILL BE THROWN
genomes of hundreds of species per year. Having such a welter of information is one thing. What can we do with it? How shall we digest it, sift it, collate it, use it?
One relatively modest goal will be total and final knowledge of the phylogenetic tree. For there is, after all, one true tree of life, the unique pattern of evolutionary branching that actually happened. It exists. It is in principle knowable. We don't know it all yet. By 2050 we should - or if we do not, we shall have been defeated only at the terminal twigs, by the sheer number of species (a number that, as my colleague Robert May points out, is at present unknown to the nearest one or even two orders of magnitude).
My research assistant Yan Wong suggests that naturalists and ecologists in 2050 will carry a small field taxonomy kit, which will obviate the need to send specimens off to a museum expert for identification. A fine probe, hooked up to a portable computer, will be inserted into a tree, or a freshly trapped vole or grasshopper. Within minutes, the computer will chew over a few key segments of DNA, then spit out the species name and any other details that may be in its stored database.
Already, DNA taxonomy has turned up some sharp surprises. My traditional zoologist's mind protests almost unendurably at being asked to believe that hippos are more closely related to whales than they are to pigs. This is still controversial. It will be settled, one way or the other, along with countless other such disputes, by 2050. It will be settled because the Hippo Genome Project, the Pig Genome Project, and the Whale (if our Japanese friends haven't eaten them all by then) Genome Project will have been completed. Actually, it will not be necessary to sequence entire genomes to dissolve taxonomic uncertainty forever.
A spin-off benefit, which will perhaps have its greatest impact in the United States, is that full knowledge of the tree of life will make it even harder to doubt the fact of evolution. Fossils will become by comparison irrelevant to the argument, as hundreds of separate genes, in as many surviving species as we can bear to sequence, are found to corroborate each other's accounts of the one true tree of life.
It has been said often enough to become a platitude but I had better say it again: to know the genome of an animal is not the same as to understand that animal. Following Sydney Brenner (the single individual regarding whom, more than any other, I have heard people wonder at the absence so far of a Nobel Prize*), I shall think in terms of three steps, of increasing difficulty, in 'computing' an animal from its
*Stop press: Sydney Brenner's Nobel Prize was announced while this book was in proof. 112
? genome. Step 1 was hard but has now been completely solved. It is to compute the amino acid sequence of a protein from the nucleotide sequence of a gene. Step 2 is to compute the three-dimensional folding pattern of a protein from its one-dimensional sequence of amino acids. Physicists believe that in principle this can be done, but it is hard, and it may often be quicker to make the protein and see what happens. Step 3 is to compute the developing embryo from its genes and their interaction with their environment - which mostly consists of other genes. This is the hardest step, but the science of embryology (especially of the workings of Hox and similar genes) is advancing at such a rate that by 2050 it will probably be solved. In other words, I conjecture that an embryologist of 2050 will feed the genome of an unknown animal into a computer, and the computer will simulate an embryology that will culminate in a full rendering of the adult animal. This will not be a particularly useful accomplishment in itself, since a real embryo will always be a cheaper computer than an electronic one. But it will be a way of signifying the completeness of our understanding. And parti- cular implementations of the technology will be useful. For instance, detectives finding a bloodstain may be able to issue a computer image of the face of a suspect - or rather, since genes don't mature with age, a series of faces from babyhood to dotage!
I also think that by 2050 my dream of the Genetic Book of the Dead will become a reality. Darwinian reasoning shows that the genes of a species must constitute a kind of description of the ancestral environments through which those genes have survived. The gene pool of a species is the clay which is shaped by natural selection. As I put it in Unweaving the Rainbow:
Like sandbluffs carved into fantastic shapes by the desert winds, like rocks shaped by ocean waves, camel DNA has been sculpted by survival in ancient deserts, and even more ancient seas, to yield modern camels. Camel DNA speaks - if only we could read the language - of the changing worlds of camel ancestors. If only we could read the language, the DNA of tuna and starfish would have 'sea' written into the text. The DNA of moles and earthworms would spell 'underground'.
