Since DNA
information
is digital, we can measure it in
the same kind of units as we measure computer information.
the same kind of units as we measure computer information.
Richard-Dawkins-Unweaving-the-Rainbow
Most of us, including most jurors and lawyers, have an intuitive sense that there is something specially reliable about eye-witness evidence.
In this we are almost certainly wrong, but the error is a pardonable one.
It may even be built into us by millennia of evolutionary history in which eye-witness evidence really was the most reliable.
If I see a man in a red woolly hat climbing a drainpipe, you will have a hard time persuading me later that he was actually wearing a blue beret.
Our intuitive biases are such that eye-witness evidence trumps all other categories.
Yet numerous studies have shown that eye-witnesses, however convinced they may be, however sincere and well-meaning, frequently mis-remember even conspicuous details such as the colour of clothing and the number of assailants present.
When individual identification is important, for instance when a woman who has been raped is called upon to identify her attacker, courts perform a rudimentary statistical test known as the identity parade or line-up. The woman is led past a line of men, one of whom the police suspect on other grounds. The others have been pulled in off the streets or are out-of-work actors, or police officers dressed in plain clothes. If the woman picks out one of these stooges, her identification evidence is discounted. But if she picks out the man the police already suspect, her evidence is taken seriously.
Rightly so. Especially if the number of people in the identity parade is large. We are all statisticians enough to see why this is. The prior suspicion of the police must be open to doubt - otherwise there would be no point in seeking the woman's evidence at all. What impresses us is agreement between the woman's identification and the independent evidence offered by the police. If the identity parade contains only two men, the witness would have a 50 per cent chance of picking the man already suspected by the police, even if she chose at random - or if she were mistaken. Since the police might also be mistaken, this represents an unacceptably high risk of injustice. But if there are 20 men in the line, the woman has only a 1 in 20 chance of choosing, by guesswork or error, the man the police already suspect. The coincidence of her identification and the police's prior suspicion probably really means something. What
is going on here is the assessment of coincidence, or the odds that something might happen by chance alone. The probability of meaningless coincidence is even less if the identity parade has 100 men, because a 1 in 100 chance of error is noticeably less than a 1 in 20 chance of error. The longer the line-up, the more secure the eventual conviction.
We also have an intuitive sense that the men chosen for the line-up must not look too obviously different from the suspect. If the woman originally told the police to look for a man with a beard, and the police have now arrested a bearded suspect, it is clearly unjust to stand him in a line with 19 clean-shaven men. He might as well be standing by himself. Even if the woman has said nothing about the appearance of her attacker, if the police have arrested a punk in a leather jacket it would be wrong to stand him in a line of suited accountants with furled umbrellas. In multiracial countries such considerations have added importance. Everyone understands that a black suspect should not be placed in an otherwise all-white line-up, or vice versa.
When we think about how we identify somebody, the face first leaps to mind. We are particularly good at distinguishing faces. As we shall see in another connection, we even seem to have evolved a special part of the brain set aside for the purpose, and certain kinds of brain damage
disable our face-recognition faculty while leaving the rest of vision intact. In any case, faces are good for recognition because they are so variable. With the well-known exception of identical twins, you seldom meet two people whose faces are confusable. It is not totally unknown, however, and an actor can be made up to look very like somebody else. Dictators often employ doubles to perform for them when they are too busy, or to draw the fire of assassins. It has been suggested that one reason charismatic leaders so often sport moustaches (Hitler, Stalin, Franco, Saddam Hussein, Oswald Mosley) is to make it easier for doubles to impersonate them. Mussolini's shaven head perhaps served the same purpose.
Apart from identical twins, ordinary close relatives are sometimes sufficiently alike to fool people who don't know them well. (Unfortunately the story that Doctor Spooner, when Warden of my college, once stopped an undergraduate and said, 'I never can remember is it you or your brother was killed in the war? ' is probably not true, like most alleged Spoonerisms. ) The resemblance of brothers and sisters, of fathers and sons, of grandparents and grandchildren, serves to remind us of the huge pool of facial variety in the general population of non-relatives.
But faces are only a special case. We are riddled with idiosyncrasies which, with sufficient training, can be used to identify individuals. I had
a school friend who claimed (and my spot checks confirmed it) that he could recognize any member of the 80-strong residence in which we lived purely by listening to their footsteps. I had another friend from Switzerland who claimed that when she walked into a room she could tell, by smell, which members of her circle of acquaintances had recently left the room. It is not that her colleagues didn't wash, just that she was unusually sensitive. That this is in principle possible is confirmed by the fact that police dogs can distinguish between any two human beings by smell alone, with the exception, yet again, of identical twins. As far as I know, the police haven't adopted the following technique, but I bet you could train bloodhounds to track down a kidnapped child after giving them a sample sniff of his brother. A way might even be found to use a jury of bloodhounds to decide paternity cases.
Voices are as idiosyncratic as faces, and various research teams are working on computer voice recognition systems for authenticating identity. It would be a great boon if, in the future, we could dispense with front door keys and rely on a voice-operated computer to obey our personal Open Sesame command. Handwriting is sufficiently individual for the written signature to be used as a guarantee of identity on bank cheques and important legal documents. Signatures are actually not particularly secure because they are too easily forged, but it is still impressive how recognizable handwriting can be. A promising newcomer
to the list of individual 'signatures' is the iris of the eye. At least one bank is experimenting with automated iris-scanning machines as a way of verifying identity. The customer stands in front of a camera which photographs the eye, digitizes the image into what a newspaper described as 'a 256 byte: human barcode'. But none of these methods of verifying human identity even comes close to the potential of DNA fingerprinting, properly applied.
It is not surprising that police dogs can smell the difference between any two humans except identical twins. Our sweat contains a complicated cocktail of proteins, and the precise details of all proteins are minutely specified by the coded DNA instructions that are our genes. Unlike handwriting and faces, which vary continuously and grade smoothly into one another, genes are digital codes, much like those used in computers. Again with the exception of identical twins, we differ genetically from all other people in discrete, discontinuous ways: an exact number of ways that you could even count if you had the patience. The DNA in each one of my cells (give or take a tiny minority of mistakes, and not including red blood cells which have lost all their DNA, or reproductive cells which contain a random half of my genes) is identical to the DNA in all my other cells. It differs from the DNA in every one of your cells, not in some vague, impressionistic way but at a precise number of locations dotted along the billions of DNA letters that we both have.
It is almost impossible to exaggerate the importance of the digital revolution in molecular genetics. Before Watson and Crick's epochal announcement in 1953 of the structure of DNA, it was still possible to agree with the concluding words of Charles Singer's authoritative A Short History of Biology, published in 1931:
. . despite interpretations to the contrary, the theory of the gene is not a 'mechanist' theory. The gene is no more comprehensible as a chemical or physical entity than is the cell or, for that matter, the organism itself. Further, though the theory speaks in terms of genes as the atomic theory speaks in terms of atoms, it must be remembered that there is a fundamental distinction between the two theories. Atoms exist independently, and their properties as such can be examined. They can even be isolated. Though we cannot see them, we can deal with them under various conditions and in various combinations. We can deal with them individually. Not so the gene. It exists only as a part of the chromosome, and the chromosome only as part of a cell. If I ask for a living chromosome, that is, for the only effective kind of chromosome, no one can give it to me except in its living surroundings any more than he can give me a living arm or leg. The doctrine of the relativity of functions is as true for the gene as it is for any of the organs of the body. They exist and function only in relation to other organs. Thus the last of the
biological theories leaves us where the first started, in the presence of a power called life or psyche which is not only of its own kind but unique in each and all of its exhibitions.
This is dramatically, profoundly, hugely wrong. And it really matters. Following Watson and Crick and the revolution that they sparked, a gene can be isolated. It can be purified, bottled, crystallized, read as digitally coded information, printed on a page, fed into a computer, read out again into a test tube and reinserted into an organism where it works exactly
as it did before. When the Human Genome Project, which set out to work out the complete gene sequence of a human being, is completed,
probably by the year 2005, the full genome will fit comfortably on two standard CD ROM discs, leaving enough space for a textbook of
molecular embryology. These two discs could then be sent into outer space, and the human race could go extinct secure in the knowledge that there is now a chance that at some future time and in some distant place, a sufficiently advanced civilization would be able to reconstitute a human being. Meanwhile, back on earth, it is because DNA is deeply and fundamentally digital - because the differences between individuals and between species can be precisely counted, not vaguely and impressionistically measured - that DNA fingerprinting is potentially so powerful.
I assert the uniqueness of each individual's DNA with confidence, but even this is only a statistical judgement. Theoretically, the sexual lottery could throw up the same genetic sequence twice. An 'identical twin' of Isaac Newton could be born tomorrow. But the number of people that would have to be born in order to make this event at all likely would be larger than the number of atoms in the universe. Unlike our face, voice
or handwriting, the DNA in most of our cells stays the same from babyhood to old age, and it cannot be altered by training or cosmetic surgery. Our DNA text has such a huge number of letters that we can precisely quantify the expected number shared by, say, brothers or first cousins as opposed to, say, second cousins or random pairs chosen from the population at large. This makes it useful not only for labelling individuals uniquely and matching them to traces such as blood or semen, but for establishing paternity and other genetic relationships. British law allows people to immigrate if they can prove that their
parents are already British citizens. A number of children from the
Indian subcontinent have been arrested by sceptical immigration officials. Before the advent of DNA fingerprinting it was often impossible for these unfortunate people to prove their parentage. Now it is easy. All you do is take a sample of blood from the putative parents and compare a particular set of genes with the corresponding set of genes from the child. The verdict is clear and unequivocal, with none of the doubt or fuzziness
that creates a need for qualitative judgements. Several young people in Britain today owe their citizenship to DNA technology.
"A similar method was used to identify skeletons discovered in Yekaterinburg and suspected of belonging to the executed Russian royal family. Prince Philip, Duke of Edinburgh, whose exact relationship to the Romanovs is known, graciously gave blood, and from this it was possible to establish that the skeletons were indeed those of the Tsar's family. In a more macabre case, a skeleton exhumed in South America was proved to belong to Doctor Josef Mengele, the Nazi war criminal known as the 'Angel of Death'. DNA taken from the bones was compared with blood from Mengele's still-living son, and the identity of the skeleton proved. More recently, a corpse dug up in Berlin has been proved, by the same method, to be that of Martin Bormann, Hitler's deputy, whose disappearance had led to endless legends and rumours and more than 6,000 'sightings' around the world.
Despite the name 'fingerprinting', our DNA, being digital, is even more individually characteristic than the patterns of whorls on our fingers. The name is appropriate because, like true fingerprints, DNA evidence is often inadvertently left behind after a person has departed the scene. DNA can be extracted from a bloodstain on a carpet, from semen inside a rape victim, from a crust of dried nasal mucus on a handkerchief, from sweat or from shed hairs. The DNA in the sample can then be compared with that in the blood taken from a suspect. It is possible to assess, to almost any desired level of probability, whether the sample belongs to a particular person or not.
So, what are the snags? Why is DNA evidence controversial? What is it about this important kind of evidence that makes it possible for lawyers to bamboozle juries into misinterpreting or ignoring it? Why have some courts been moved to the despairing extreme of ruling out this evidence altogether?
There are three major classes of potential problem, one simple, one sophisticated and one silly. I'll come to the silly problem and the more sophisticated difficulties later but first, as with any kind of evidence, there is the simple - and very important - possibility of human error. Possibilities, rather, for there are plenty of opportunities for mistakes and even sabotage. A tube of blood may be mislabelled, either by accident or in a deliberate attempt to frame somebody. A sample from the scene of a crime may be contaminated by sweat from a lab technician or a police officer. The danger of contamination is especially great in those cases where an ingenious technique of amplification called PCR (polymerase chain reaction) is used.
You can easily see why amplification might be desirable. A tiny smear of sweat on a gun butt contains precious little DNA. Sensitive though DNA analysis can be, it needs a certain minimum quantity of material to work on. The technique of PCR, invented in 1983 by the American biochemist Kary B. Mullis, is the dramatically successful answer. PCR takes what little DNA there is and produces millions of copies, multiplying again and again whatever code sequences are there. But, as always with amplification, errors are amplified along with the true signal. Stray scraps of DNA contamination from a technician's sweat are amplified as effectively as the specimen from the scene of the crime, with obvious possibilities for injustice.
But human error is not peculiar to DNA evidence. All kinds of evidence are vulnerable to bungling and sabotage, and must be handled with scrupulous care. The files in a conventional fingerprint library may be mislabelled. The murder weapon may have been touched by innocent people as well as the murderer, and their fingerprints have to be taken, along with the suspect's, for elimination purposes. Courts of law are already accustomed to the need to take all possible precautions against mistakes and they still, sometimes tragically, happen. DNA evidence is not immune to human bungling but nor is it particularly vulnerable, except in so far as PCR amplifies error. If all DNA evidence were to be thrown out because of occasional mistakes, the precedent should rule out most other kinds of evidence, too. We have to suppose that codes of practice and rigorous precautions can be developed to guard against human error in the presentation of all kinds of legal evidence.
The more sophisticated difficulties that bedevil DNA evidence will take longer to explain. They, too, have their precedents in conventional types of evidence, although this point often does not seem to be understood in law courts.
Where identification evidence of any kind is concerned, there are two types of error which correspond to the two types of error in any statistical evidence. In another chapter, we shall call them Type 1 and Type 2 errors, but it is easier to think of them as false positive and false negative. A guilty suspect may escape, through not being recognized - false negative. And - false positive (which most people would see as the more dangerous error) - an innocent suspect may be convicted because he happens, by ill luck, to resemble the genuinely guilty party. In the case of ordinary eye- witness identification, an innocent bystander who happens to look a bit like the real criminal could consequently be arrested - false positive. Identity parades are designed to make this less probable. The chance of a miscarriage of justice is inversely related to the number of people standing in the line-up. The danger can be increased in the ways we have
already considered - the line-up being unfairly stacked with clean-shaven men for example.
In the case of DNA evidence the danger of a false positive conviction is theoretically very low indeed. We have a blood sample from a suspect, and we have a specimen from the scene of the crime. If the entire set of genes in both these samples could be written down, the probability of a false conviction is one in billions and billions. Identical twins apart, the chance that any two humans would match all their DNA is tantamount to zero. But unfortunately it is not practical to work out the complete gene sequence of a human being. Even after the Human Genome Project is completed, to attempt the equivalent in the solution of each crime is unrealistic. In practice, forensic detectives concentrate on small sections of the genome, preferably sections that are known to vary in the population. And now our fear must be that, although we could safely rule out mis-identification if the whole genome were considered, there might be a danger of two individuals' being identical with respect to the small portion of DNA that we have time to analyse.
The probability that this would happen ought to be measurable for any particular section of the genome; we could then decide whether it was an acceptable risk. The larger the section of DNA, the smaller the probability of error, just as, in an identity parade, the longer the line-up the safer the conviction. The difference is that an identity parade, in order to compete with the DNA equivalent, would need to contain not a couple of dozen people but thousands, millions or even billions in the line. Apart from this quantitative difference, the analogy- with the identity parade continues. We shall see that there is a DNA equivalent of our hypothetical line-up of clean-shaven men with one bearded suspect. But first, a little more background on DNA fingerprinting.
Obviously we sample the equivalent parts of the genome in both suspect and specimen. These parts of the genome are chosen for their tendency to vary widely in the population. A Darwinian would note that the parts that don't vary are often the parts that have an important role to play in the survival of the organism. Any substantial variations in these important genes are likely to have been removed from the population by the death of their possessors - Darwinian natural selection. But there are other parts of the genome that are very variable, perhaps because they are not important for survival. This isn't the whole story because in fact some useful genes are quite variable. The reasons for this are controversial. It's a bit of a digression but . . . What is this life if, full of stress, we have no freedom to digress?
The 'neutralist' school of thought, associated with the distinguished Japanese geneticist Motoo Kimura, believes that useful genes are equally
useful in a variety of different forms. This emphatically does not mean that they are useless, only that the different forms are equally good at what they do. If you think of genes as writing out their recipes in words, the alternative forms of a gene can be thought of as the very same words written in different typefaces: the meaning is the same, and the product of the recipe will come out the same. Genetic changes, 'mutations', that make no difference are not 'seen' by natural selection. They aren't mutations at all, for all the difference they make to the life of the animal, but they are potentially useful mutations from the point of view of the forensic scientist. The population ends up with lots of variety at such a locus (position in a chromosome), and this kind of variety could in principle be used for fingerprinting.
The other theory of variation, opposed to Kimura's neutral theory, believes that the different versions of the genes really do different things and that there is some special reason why both are preserved by natural selection in the population. For example, there might be two alternative forms of a blood protein, A? and ss, which are susceptible to two infectious diseases called alfluenza and betaccosis respectively, each being immune to the other disease. Typically, an infectious disease needs a critical density of susceptible victims in a population, otherwise an epidemic can't get going. In a population dominated by A? types, there are frequent epidemics of alfluenza but not of betaccosis. So natural selection favours the ss types who are immune to alfluenza. It favours them so much that after a while they come to dominate the population. Now the tables are turned. There are epidemics of betaccosis, but not of alfluenza. The A? types now are favoured by natural selection because they are immune to betaccosis. The population may keep oscillating between A? dominance and ss dominance, or it may settle down to an intermediate mixture, an 'equilibrium'. Either way, we'll see plenty of variation at the gene locus concerned, and this is good news for the finger-printers. The phenomenon is called 'frequency dependent selection' and it is one suggested reason for high levels of genetic variation in the population. There are others.
However, for our forensic purposes, it matters only that there are variable sections of the genome. Whatever the verdict in the controversy over whether the useful bits of the genome are variable, there are in any case lots of other regions of the genome which are never even read, or never translated into their protein equivalents. Indeed, an astonishingly high proportion of our genes seem to be doing nothing whatsoever. They are therefore free to vary-, which makes them excellent DNA fingerprinting material.
As if to confirm the fact that a great deal of DNA is doing nothing useful, the sheer quantity of DNA in the cells of different kinds of organisms is
wildly variable.
Since DNA information is digital, we can measure it in
the same kind of units as we measure computer information. One bit of information is enough to specify one yes/no decision: a 1 or a 0, a true or a false. The computer on which I am writing this has 256 megabits (32 megabytes) of core memory. (The first computer that I owned was a
bigger box but had less than one five thousandth of the memory capacity. ) The equivalent fundamental unit in DNA is the nucleotide base. Since there are 4 possible bases, the information content of each base is equivalent to 2 bits. The common gut bacterium Escherichia coli has a genome of 4 mega-bases or 8 megabits. The crested newt, Triturus cristatus, has 40,000 megabits. The 5,000-fold ratio between crested
newt and bacterium is about the same as that between my present computer and my first one. We humans have 5,000 mega-bases or 6,000 megabits. This is 750 times as great as the bacterium (which satisfies
our vanity), but what are we to make of the newt trumping us sixfold? We'd like to think that genome size is not strictly proportional to what it does: presumably quite a lot of that newt DNA isn't doing anything. This is certainly true. It is also true of most of our DNA. We know from other evidence that, of the 3,000 mega-base human genome, only about 2 per cent is actually used for coding protein synthesis. The rest is often called junk DNA. Presumably the crested newt has an even higher percentage
of junk DNA. Other newts have not.
The surplus of unused DNA falls into various categories. Some of it looks like real genetic information, and probably represents old, defunct genes, or out-of-date copies of genes that are still in use. These pseudo-genes would make sense if they were read and translated. But they are not read and translated. Hard disks on computers usually contain comparable junk: old copies of work in progress, scratchpad space used by the computer for interim operations, and so on. We users don't see this junk, because our computers only show us those parts of the disk that we need to know about. But if you get right down and read the actual information on the disk, byte by byte, you'll see the junk, and much of it will make some sort of sense. There are probably dozens of disjointed fragments of this very chapter peppered around my hard disk at present, although there is only one 'official' copy that the computer tells me about (plus a prudent back-up).
In addition to the junk DNA which could he read but isn't, there is plenty of junk DNA which not only isn't read but wouldn't make any sense if it were. There are huge stretches of repeated nonsense, perhaps repeats of one base, or alternations of the same two bases, or repeats of a more complicated pattern. Unlike the other class of junk DNA, we cannot account for these 'tandem repeats' as outdated copies of useful genes. This repetitive DNA has never been decoded, and presumably has never been of any use. (Never useful for the animal's survival, anyway. From
the point of view of the selfish gene, as I explained in another book, we could say that any kind of junk DNA is 'useful' to itself if it just keeps surviving and making more copies of itself. This suggestion has come to be known by the catch-phrase 'selfish DNA', although this is a little unfortunate because, in my original sense, working DNA is selfish too. For this reason, some people have taken to calling it 'ultra-selfish DNA'. )
Anyway, whatever the reason, junk DNA is there, and there in prodigious quantities. Because it is not used, it is free to vary. Useful genes, as we have seen, are severely constrained in their freedom to change. Most changes (mutations) make a gene work less effectively, the animal dies and the change is not passed on. This is what Darwinian natural selection is all about. But mutations in junk DNA (mostly changes in the number of repeats in a given region) are not noticed by natural selection. So, as we look around the population, we find most of the variation that is useful for fingerprinting in the junk regions. As we shall now see, tandem repeats are particularly useful because they vary with respect to number of repeats, a gross feature which is easy to measure.
If it wasn't for this, the forensic geneticist would need to look at the exact sequence of bases in our sample region. This can be done, but sequencing DNA is time-consuming. The tandem repeats allow us to use cunning short-cuts, as discovered by Alec Jeffreys of the University of Leicester, rightly regarded as the father of DNA fingerprinting (and now Sir Alec). Different people have different numbers of tandem repeats in particular places. I might have 147 repeats of a particular piece of nonsense, where you have 84 repeats of the same piece of nonsense in the corresponding place in your genome. In another region, I might have 24 repeats of a particular piece of nonsense to your 38 repeats. Each of us has a characteristic fingerprint consisting of a set of numbers. Each of these numbers in our fingerprint is the number of times a particular piece of nonsense is repeated in our genome.
We get our tandem repeats from our parents. We each have 46 chromosomes, 25 from our father and 23 homologous, or corresponding, chromosomes from our mother. These chromosomes come complete with tandem repeats. Your father got his 46 chromosomes from your paternal grandparents, but he didn't pass them on to you in their entirety. Each of his mother's chromosomes was lined up with its paternal opposite number and bits were exchanged before a composite chromosome was put into the sperm that helped to make you. Every sperm and every egg is unique because it is a different mix of maternal and paternal chromosomes. The mixing process affects the tandem repeat sections as well as the meaningful sections of the chromosomes. So our characteristic numbers of tandem repeats are inherited, in much the same way as our eye colour and hair curliness are inherited. With the difference that, whereas our eye colour results from some kind of joint
verdict of our paternal and our maternal genes, our tandem repeat numbers are properties of the chromosomes themselves and can therefore be measured separately for paternal and maternal chromosomes. At any particular tandem repeat region, each of us has two readings: a paternal chromosome repeat number and a maternal chromosome repeat number. From time to time, chromosomes mutate - suffer a random change - in their tandem repeat numbers. Or a particular tandem region may be split by chromosomal crossing over. This is why there is variation in tandem repeat numbers in the population. The beauty of tandem repeat numbers is that they are easy to measure. You don't have to get embroiled in detailed sequencing of coded DNA bases. You do something a bit like weighing them. Or, to take another equally apt analogy, you spread them out like coloured bands from a prism. I'll explain one way of doing this.
First you need to make some preparations. You make a so-called DNA probe, which is a short sequence of DNA that exactly matches the nonsense sequence in question - up to about 20 nucleotide bases long. This is not difficult to do nowadays. There are several methods. You can even buy a machine off the shelf which makes short DNA sequences to any specification, just as you can buy a keyboard to punch any desired string of letters on a paper tape. By supplying the synthesizing machine with radioactive raw materials, you make the probes themselves radioactive, and so 'label' them. This makes the probes easy to find again later, as natural DNA is not radioactive, and so the two are readily distinguishable from each other.
Radioactive probes are a tool of the trade, which you must have ready before you start a Jeffreys fingerprinting exercise. Another essential tool is the 'restriction enzyme'. Restriction enzymes are chemical tools that specialize in cutting DNA, but cutting it only in particular places. For example, one restriction enzyme may search the length of a chromosome until it finds the sequence GAATTC (G, C, T and A are the four letters of the DNA alphabet; all genes, from all species on earth, differ only in consisting of different sequences of these four letters). Another restriction enzyme cuts the DNA wherever it can find the sequence GCGGCCGC. A number of different restriction enzymes are available in the toolbox of the molecular biologist. They originate from bacteria, who use them for their own defensive purposes. Each restriction enzyme has its own unique search string which it homes in on and cuts.
Now, the trick is to choose a restriction enzyme whose specific search string is completely absent from the tandem repeat we are interested in. The whole length of DNA is therefore chopped into short stretches, bounded by the characteristic search string of the restriction enzyme. Of course, not all the stretches will consist of the tandem repeat we are
looking for. All sorts of other stretches of DNA will happen to be bounded by the favoured search string of the restriction enzyme scissors. But some of them will consist of tandem repeats and the length of each scissored stretch will be largely determined by the number of tandem repeats in it. If I have 147 repeats of a particular piece of DNA nonsense, where you have only 85, my snipped fragments will be correspondingly longer than your snipped fragments.
We can measure these characteristic lengths using a technique that has been around in molecular biology for quite a while. This is the bit that is rather like spreading them out with a prism, as Newton did for white light. The standard DNA 'prism' is a gel electrophoresis column, that is, a long tube filled with jelly through which an electric current is passed. A solution containing the scissored stretches of DNA, all jumbled together, is poured into one end of the tube. The DNA fragments are all electrically attracted to the negative end of the column, which is at the other end of the tube, and they move steadily through the jelly. But they don't all move at the same rate. Like light of low vibration frequency moving through glass, small fragments of DNA move faster than large ones. The result is that, if you switch the current off after a suitable interval, the fragments have spread themselves out along the column, just as Newton's colours spread themselves out because light from the blue end of the spectrum is more readily slowed down by glass than light from the red end.
But so far we can't see the fragments. The jelly column looks uniform all the way down. There is nothing to show that DNA fragments of different size are lurking in discrete bands along its length, and nothing to show which bands contain which variety of tandem repeat. How do we make them visible? This is where the radioactive probes come in.
To make them visible you can use another cunning technique, the Southern blot, named after its inventor, Edward Southern. (Slightly confusingly, there are other techniques called the Northern blot and the Western blot, but no Mr Northern or Mr Western. ) The jelly column is removed from the tube and laid out on blotting paper. The liquid in the jelly, including the DNA fragments, seeps out of the jelly into the blotting paper. The blotting paper has previously been laced with quantities of the radioactive probe for the particular tandem repeat that we are interested in. The probe molecules line up along the blotting paper, pairing precisely, by the ordinary rules of DNA, with their opposite numbers in the tandem repeats. Surplus probe molecules are washed away. Now the only radioactive probe molecules left in the blotting paper are those bound to their exact opposite numbers that seeped out of the jelly. The blotting paper is now placed on a piece of X-ray film, which is then marked by the radioactivity. So, what you see when you develop the
film is a set of dark bands - another barcode. The final barcode pattern that we read on the Southern blot is a fingerprint for a person, in very much the same way as the Fraunhofer lines are a fingerprint for a star, or the formant lines are the fingerprint for a vowel sound. Indeed, the barcode from the blood looks very like Fraunhofer lines or formant lines.
The details of DNA fingerprinting techniques get quite complicated and I won't go much further. For instance, one strategy is to hit the DNA with lots of probes all at the same time. What you get then is a mixed bag of barcode stripes simultaneously. In extreme cases, the stripes merge into each other and all you get is one big smear with all possible sizes of DNA fragment represented somewhere in the genome. This is no good for identification purposes. At the other extreme, people use only one probe at a time looking at one genetic 'locus'. This 'single-locus fingerprinting' gives you nice clean bars like Fraunhofer lines. But only one or two bars per person. Even so, the chances of confusing people are small. This is because the characteristics we are talking about are not like 'brown eyes versus blue eyes', in which case lots of people would be the same. The characteristics we are measuring, remember, are lengths of tandem repeat fragments. The number of possible lengths is very large, so even single-locus fingerprinting is pretty good for identification purposes. Not quite good enough, however, so in practice forensic DNA finger-printers usually use half a dozen separate probes. Now the chances of error are very low indeed. But we still need to talk about exactly how low, because people's lives or liberties might depend upon it.
First, we must return to our distinction between false positives and false negatives. DNA evidence can be used to clear an innocent suspect, or it can be made to point the finger at a guilty one. Suppose semen is recovered from the vagina of a rape victim. Circumstantial evidence leads the police to arrest a man, suspect A. Suspect A gives a blood sample and it is compared to the semen sample, using a single DNA probe to look at one tandem repeat locus. If the two are different, suspect A is in the clear. We don't even need to look at a second locus.
But what if suspect A's blood matches the semen sample at this locus? Suppose they both share the same barcode pattern, which we shall call pattern P. This is compatible with the suspect's being guilty, but it doesn't prove it. He could just happen to share pattern P with the real rapist. We must now look at some more loci. If the samples still match, what are the odds against such a match being coincidental - a false positive mis-identification? This is where we have to start thinking statistically about the population at large. In theory, by taking blood from a sample of men in the population at large, we should be able to calculate the likelihood that any two men will be identical at each locus
concerned. But from which section of the population do we draw our sample?
Remember our lone bearded man in the old-fashioned line-up identity parade? Here's the molecular equivalent. Suppose that, in the world at large, only one in a million men has pattern P. Does this mean that there is a million to one chance against a wrongful conviction of suspect A? No. Suspect A may belong to a minority group of people whose ancestors immigrated from a particular part of the world. Local populations often share genetic peculiarities, for the simple reason that they are descended from the same ancestors. Of the 2. 5 million South African Dutch, or Afrikaners, most are descended from one shipload of immigrants who arrived from the Netherlands in 1652. As an indicator of the narrowness of this genetic bottleneck, about a million still bear the surnames of 20 of these original settlers. The Afrikaners have a much higher frequency of certain genetic diseases than the population of the world in general. According to one estimate, about 8,000 (one in 300) have the blood condition porphyria variegata, which is much rarer in the rest of the world. This is apparently because they are descended from one particular couple on the ship, Gerrit Jansz and Ariaantje Jacobs, although it is not known which one was the carrier of the (dominant) gene for the condition. (She was one of eight Rotterdam orphanage girls put on the ship to provide wives for the settlers. ) In fact, the condition wasn't noticed at all before modern medicine, because its most marked symptom is a lethal reaction to certain modern anaesthetics (South African hospitals now routinely test for the gene before administering anaesthetic). Other populations often have locally high frequencies of other particular genes, for the same kind of reason. If, to return to our hypothetical court case, suspect A and the real criminal both belong to the same minority group, the likelihood of chance confusion could be dramatically greater than you'd think if you based your estimates on the population at large. The point is that the frequency of pattern P in humans at large is no longer relevant. We need to know the frequency of pattern P in the group to which the suspect belongs.
This need is nothing new. We've already seen the equivalent danger in an ordinary line-up identity parade. If the prime suspect is Chinese, it doesn't do to stand him in a line-up largely consisting of westerners. And the same kind of statistical reasoning about the background population is needed in identifying stolen goods, as well as individual suspects. I have already mentioned my jury service in the Oxford Court. In one of the three cases I sat on, a man was accused of stealing three coins from a rival numismatist. The accused had been caught with three coins in his possession which matched those lost. Counsel for the prosecution was eloquent.
Ladies and gentlemen of the jury, are we really supposed to believe that three coins, of exactly the same type as the three missing coins, would just happen to be present in the house of a rival collector? I put it to you that such a coincidence is too much to stomach.
Jurymen are not permitted to cross-examine. That was the duty of counsel for the defence, and he, though doubtless learned in the law and also eloquent, had no more clue about probability theory' than the prosecutor. I wish he'd said something like this:
M'Lud, we don't know whether the coincidence is too much to stomach, because m'learned friend has not presented us with any evidence at all as to the rarity or commonness of these three coins in the population at large. If these coins are so rare that only one in a hundred collectors in the country has any one of them, the prosecution has a good case, since the defendant was caught with three of them. If on the other hand, these coins are as common as dirt, there is not enough evidence to convict. (To push to the extreme, three coins that I have in my pocket today, all current legal tender, are very probably the same as three coins in Your Lordships pocket)
My point is that it simply never occurred to any of the legally trained minds in the court that it was relevant even to ask how rare these three coins were in the population at large. Lawyers can certainly add up (I once received a lawyer's bill, the last item of which was 'Time spent making out this bill') but probability theory is another matter.
I expect the coins were actually rare. If they hadn't been, the theft would not have been such a serious matter, and the prosecution presumably would never have been brought. But the jury should have been told explicitly. I remember that the question came up in the jury room, and we wished that we were allowed to go back into the court to seek clarification. The equivalent question is equally relevant in the case of DNA evidence, and it is most certainly being asked. Fortunately, provided a sufficient number of separate genetic loci are examined, the chances of mis-identification - even among members of minority groups, even among family members (except identical twins) - can be reduced to genuinely very small levels, far smaller than can be achieved by any other method of identification, including eye-witness evidence.
Exactly how small the residual possibility of error is may still be open to dispute. And this is where we come to the third category of objection to DNA evidence, the just plain silly. Lawyers are accustomed to pouncing when expert witnesses seem to disagree. If two geneticists are summoned to the stand and are asked to estimate the probability of a mis-
identification with DNA evidence, the first may say a 1,000,000 to one while the second may say only a 100,000 to one. Pounce. 'Aha! AHA! The experts disagree! Ladies and gentlemen of the jury, what confidence can we place in a scientific method if the experts themselves can't get within a factor of ten of one another? Obviously the only thing to do is throw the entire evidence out, lock, stock and barrel. '
But, in these cases, although geneticists may be inclined to give different weightings to imponderables such as the racial subgroup effect, any disagreement between them is only over whether the odds against a wrongful identification are hyper-mega-astronomical or just plain astronomical. The odds cannot normally be lower than thousands to one, and they may well be up in the billions. Even on the most conservative estimate, the odds against wrongful identification are hugely greater than they are in an ordinary identity parade. 'M'lud, an identity parade of only 30 men is grossly unfair on my client.
When individual identification is important, for instance when a woman who has been raped is called upon to identify her attacker, courts perform a rudimentary statistical test known as the identity parade or line-up. The woman is led past a line of men, one of whom the police suspect on other grounds. The others have been pulled in off the streets or are out-of-work actors, or police officers dressed in plain clothes. If the woman picks out one of these stooges, her identification evidence is discounted. But if she picks out the man the police already suspect, her evidence is taken seriously.
Rightly so. Especially if the number of people in the identity parade is large. We are all statisticians enough to see why this is. The prior suspicion of the police must be open to doubt - otherwise there would be no point in seeking the woman's evidence at all. What impresses us is agreement between the woman's identification and the independent evidence offered by the police. If the identity parade contains only two men, the witness would have a 50 per cent chance of picking the man already suspected by the police, even if she chose at random - or if she were mistaken. Since the police might also be mistaken, this represents an unacceptably high risk of injustice. But if there are 20 men in the line, the woman has only a 1 in 20 chance of choosing, by guesswork or error, the man the police already suspect. The coincidence of her identification and the police's prior suspicion probably really means something. What
is going on here is the assessment of coincidence, or the odds that something might happen by chance alone. The probability of meaningless coincidence is even less if the identity parade has 100 men, because a 1 in 100 chance of error is noticeably less than a 1 in 20 chance of error. The longer the line-up, the more secure the eventual conviction.
We also have an intuitive sense that the men chosen for the line-up must not look too obviously different from the suspect. If the woman originally told the police to look for a man with a beard, and the police have now arrested a bearded suspect, it is clearly unjust to stand him in a line with 19 clean-shaven men. He might as well be standing by himself. Even if the woman has said nothing about the appearance of her attacker, if the police have arrested a punk in a leather jacket it would be wrong to stand him in a line of suited accountants with furled umbrellas. In multiracial countries such considerations have added importance. Everyone understands that a black suspect should not be placed in an otherwise all-white line-up, or vice versa.
When we think about how we identify somebody, the face first leaps to mind. We are particularly good at distinguishing faces. As we shall see in another connection, we even seem to have evolved a special part of the brain set aside for the purpose, and certain kinds of brain damage
disable our face-recognition faculty while leaving the rest of vision intact. In any case, faces are good for recognition because they are so variable. With the well-known exception of identical twins, you seldom meet two people whose faces are confusable. It is not totally unknown, however, and an actor can be made up to look very like somebody else. Dictators often employ doubles to perform for them when they are too busy, or to draw the fire of assassins. It has been suggested that one reason charismatic leaders so often sport moustaches (Hitler, Stalin, Franco, Saddam Hussein, Oswald Mosley) is to make it easier for doubles to impersonate them. Mussolini's shaven head perhaps served the same purpose.
Apart from identical twins, ordinary close relatives are sometimes sufficiently alike to fool people who don't know them well. (Unfortunately the story that Doctor Spooner, when Warden of my college, once stopped an undergraduate and said, 'I never can remember is it you or your brother was killed in the war? ' is probably not true, like most alleged Spoonerisms. ) The resemblance of brothers and sisters, of fathers and sons, of grandparents and grandchildren, serves to remind us of the huge pool of facial variety in the general population of non-relatives.
But faces are only a special case. We are riddled with idiosyncrasies which, with sufficient training, can be used to identify individuals. I had
a school friend who claimed (and my spot checks confirmed it) that he could recognize any member of the 80-strong residence in which we lived purely by listening to their footsteps. I had another friend from Switzerland who claimed that when she walked into a room she could tell, by smell, which members of her circle of acquaintances had recently left the room. It is not that her colleagues didn't wash, just that she was unusually sensitive. That this is in principle possible is confirmed by the fact that police dogs can distinguish between any two human beings by smell alone, with the exception, yet again, of identical twins. As far as I know, the police haven't adopted the following technique, but I bet you could train bloodhounds to track down a kidnapped child after giving them a sample sniff of his brother. A way might even be found to use a jury of bloodhounds to decide paternity cases.
Voices are as idiosyncratic as faces, and various research teams are working on computer voice recognition systems for authenticating identity. It would be a great boon if, in the future, we could dispense with front door keys and rely on a voice-operated computer to obey our personal Open Sesame command. Handwriting is sufficiently individual for the written signature to be used as a guarantee of identity on bank cheques and important legal documents. Signatures are actually not particularly secure because they are too easily forged, but it is still impressive how recognizable handwriting can be. A promising newcomer
to the list of individual 'signatures' is the iris of the eye. At least one bank is experimenting with automated iris-scanning machines as a way of verifying identity. The customer stands in front of a camera which photographs the eye, digitizes the image into what a newspaper described as 'a 256 byte: human barcode'. But none of these methods of verifying human identity even comes close to the potential of DNA fingerprinting, properly applied.
It is not surprising that police dogs can smell the difference between any two humans except identical twins. Our sweat contains a complicated cocktail of proteins, and the precise details of all proteins are minutely specified by the coded DNA instructions that are our genes. Unlike handwriting and faces, which vary continuously and grade smoothly into one another, genes are digital codes, much like those used in computers. Again with the exception of identical twins, we differ genetically from all other people in discrete, discontinuous ways: an exact number of ways that you could even count if you had the patience. The DNA in each one of my cells (give or take a tiny minority of mistakes, and not including red blood cells which have lost all their DNA, or reproductive cells which contain a random half of my genes) is identical to the DNA in all my other cells. It differs from the DNA in every one of your cells, not in some vague, impressionistic way but at a precise number of locations dotted along the billions of DNA letters that we both have.
It is almost impossible to exaggerate the importance of the digital revolution in molecular genetics. Before Watson and Crick's epochal announcement in 1953 of the structure of DNA, it was still possible to agree with the concluding words of Charles Singer's authoritative A Short History of Biology, published in 1931:
. . despite interpretations to the contrary, the theory of the gene is not a 'mechanist' theory. The gene is no more comprehensible as a chemical or physical entity than is the cell or, for that matter, the organism itself. Further, though the theory speaks in terms of genes as the atomic theory speaks in terms of atoms, it must be remembered that there is a fundamental distinction between the two theories. Atoms exist independently, and their properties as such can be examined. They can even be isolated. Though we cannot see them, we can deal with them under various conditions and in various combinations. We can deal with them individually. Not so the gene. It exists only as a part of the chromosome, and the chromosome only as part of a cell. If I ask for a living chromosome, that is, for the only effective kind of chromosome, no one can give it to me except in its living surroundings any more than he can give me a living arm or leg. The doctrine of the relativity of functions is as true for the gene as it is for any of the organs of the body. They exist and function only in relation to other organs. Thus the last of the
biological theories leaves us where the first started, in the presence of a power called life or psyche which is not only of its own kind but unique in each and all of its exhibitions.
This is dramatically, profoundly, hugely wrong. And it really matters. Following Watson and Crick and the revolution that they sparked, a gene can be isolated. It can be purified, bottled, crystallized, read as digitally coded information, printed on a page, fed into a computer, read out again into a test tube and reinserted into an organism where it works exactly
as it did before. When the Human Genome Project, which set out to work out the complete gene sequence of a human being, is completed,
probably by the year 2005, the full genome will fit comfortably on two standard CD ROM discs, leaving enough space for a textbook of
molecular embryology. These two discs could then be sent into outer space, and the human race could go extinct secure in the knowledge that there is now a chance that at some future time and in some distant place, a sufficiently advanced civilization would be able to reconstitute a human being. Meanwhile, back on earth, it is because DNA is deeply and fundamentally digital - because the differences between individuals and between species can be precisely counted, not vaguely and impressionistically measured - that DNA fingerprinting is potentially so powerful.
I assert the uniqueness of each individual's DNA with confidence, but even this is only a statistical judgement. Theoretically, the sexual lottery could throw up the same genetic sequence twice. An 'identical twin' of Isaac Newton could be born tomorrow. But the number of people that would have to be born in order to make this event at all likely would be larger than the number of atoms in the universe. Unlike our face, voice
or handwriting, the DNA in most of our cells stays the same from babyhood to old age, and it cannot be altered by training or cosmetic surgery. Our DNA text has such a huge number of letters that we can precisely quantify the expected number shared by, say, brothers or first cousins as opposed to, say, second cousins or random pairs chosen from the population at large. This makes it useful not only for labelling individuals uniquely and matching them to traces such as blood or semen, but for establishing paternity and other genetic relationships. British law allows people to immigrate if they can prove that their
parents are already British citizens. A number of children from the
Indian subcontinent have been arrested by sceptical immigration officials. Before the advent of DNA fingerprinting it was often impossible for these unfortunate people to prove their parentage. Now it is easy. All you do is take a sample of blood from the putative parents and compare a particular set of genes with the corresponding set of genes from the child. The verdict is clear and unequivocal, with none of the doubt or fuzziness
that creates a need for qualitative judgements. Several young people in Britain today owe their citizenship to DNA technology.
"A similar method was used to identify skeletons discovered in Yekaterinburg and suspected of belonging to the executed Russian royal family. Prince Philip, Duke of Edinburgh, whose exact relationship to the Romanovs is known, graciously gave blood, and from this it was possible to establish that the skeletons were indeed those of the Tsar's family. In a more macabre case, a skeleton exhumed in South America was proved to belong to Doctor Josef Mengele, the Nazi war criminal known as the 'Angel of Death'. DNA taken from the bones was compared with blood from Mengele's still-living son, and the identity of the skeleton proved. More recently, a corpse dug up in Berlin has been proved, by the same method, to be that of Martin Bormann, Hitler's deputy, whose disappearance had led to endless legends and rumours and more than 6,000 'sightings' around the world.
Despite the name 'fingerprinting', our DNA, being digital, is even more individually characteristic than the patterns of whorls on our fingers. The name is appropriate because, like true fingerprints, DNA evidence is often inadvertently left behind after a person has departed the scene. DNA can be extracted from a bloodstain on a carpet, from semen inside a rape victim, from a crust of dried nasal mucus on a handkerchief, from sweat or from shed hairs. The DNA in the sample can then be compared with that in the blood taken from a suspect. It is possible to assess, to almost any desired level of probability, whether the sample belongs to a particular person or not.
So, what are the snags? Why is DNA evidence controversial? What is it about this important kind of evidence that makes it possible for lawyers to bamboozle juries into misinterpreting or ignoring it? Why have some courts been moved to the despairing extreme of ruling out this evidence altogether?
There are three major classes of potential problem, one simple, one sophisticated and one silly. I'll come to the silly problem and the more sophisticated difficulties later but first, as with any kind of evidence, there is the simple - and very important - possibility of human error. Possibilities, rather, for there are plenty of opportunities for mistakes and even sabotage. A tube of blood may be mislabelled, either by accident or in a deliberate attempt to frame somebody. A sample from the scene of a crime may be contaminated by sweat from a lab technician or a police officer. The danger of contamination is especially great in those cases where an ingenious technique of amplification called PCR (polymerase chain reaction) is used.
You can easily see why amplification might be desirable. A tiny smear of sweat on a gun butt contains precious little DNA. Sensitive though DNA analysis can be, it needs a certain minimum quantity of material to work on. The technique of PCR, invented in 1983 by the American biochemist Kary B. Mullis, is the dramatically successful answer. PCR takes what little DNA there is and produces millions of copies, multiplying again and again whatever code sequences are there. But, as always with amplification, errors are amplified along with the true signal. Stray scraps of DNA contamination from a technician's sweat are amplified as effectively as the specimen from the scene of the crime, with obvious possibilities for injustice.
But human error is not peculiar to DNA evidence. All kinds of evidence are vulnerable to bungling and sabotage, and must be handled with scrupulous care. The files in a conventional fingerprint library may be mislabelled. The murder weapon may have been touched by innocent people as well as the murderer, and their fingerprints have to be taken, along with the suspect's, for elimination purposes. Courts of law are already accustomed to the need to take all possible precautions against mistakes and they still, sometimes tragically, happen. DNA evidence is not immune to human bungling but nor is it particularly vulnerable, except in so far as PCR amplifies error. If all DNA evidence were to be thrown out because of occasional mistakes, the precedent should rule out most other kinds of evidence, too. We have to suppose that codes of practice and rigorous precautions can be developed to guard against human error in the presentation of all kinds of legal evidence.
The more sophisticated difficulties that bedevil DNA evidence will take longer to explain. They, too, have their precedents in conventional types of evidence, although this point often does not seem to be understood in law courts.
Where identification evidence of any kind is concerned, there are two types of error which correspond to the two types of error in any statistical evidence. In another chapter, we shall call them Type 1 and Type 2 errors, but it is easier to think of them as false positive and false negative. A guilty suspect may escape, through not being recognized - false negative. And - false positive (which most people would see as the more dangerous error) - an innocent suspect may be convicted because he happens, by ill luck, to resemble the genuinely guilty party. In the case of ordinary eye- witness identification, an innocent bystander who happens to look a bit like the real criminal could consequently be arrested - false positive. Identity parades are designed to make this less probable. The chance of a miscarriage of justice is inversely related to the number of people standing in the line-up. The danger can be increased in the ways we have
already considered - the line-up being unfairly stacked with clean-shaven men for example.
In the case of DNA evidence the danger of a false positive conviction is theoretically very low indeed. We have a blood sample from a suspect, and we have a specimen from the scene of the crime. If the entire set of genes in both these samples could be written down, the probability of a false conviction is one in billions and billions. Identical twins apart, the chance that any two humans would match all their DNA is tantamount to zero. But unfortunately it is not practical to work out the complete gene sequence of a human being. Even after the Human Genome Project is completed, to attempt the equivalent in the solution of each crime is unrealistic. In practice, forensic detectives concentrate on small sections of the genome, preferably sections that are known to vary in the population. And now our fear must be that, although we could safely rule out mis-identification if the whole genome were considered, there might be a danger of two individuals' being identical with respect to the small portion of DNA that we have time to analyse.
The probability that this would happen ought to be measurable for any particular section of the genome; we could then decide whether it was an acceptable risk. The larger the section of DNA, the smaller the probability of error, just as, in an identity parade, the longer the line-up the safer the conviction. The difference is that an identity parade, in order to compete with the DNA equivalent, would need to contain not a couple of dozen people but thousands, millions or even billions in the line. Apart from this quantitative difference, the analogy- with the identity parade continues. We shall see that there is a DNA equivalent of our hypothetical line-up of clean-shaven men with one bearded suspect. But first, a little more background on DNA fingerprinting.
Obviously we sample the equivalent parts of the genome in both suspect and specimen. These parts of the genome are chosen for their tendency to vary widely in the population. A Darwinian would note that the parts that don't vary are often the parts that have an important role to play in the survival of the organism. Any substantial variations in these important genes are likely to have been removed from the population by the death of their possessors - Darwinian natural selection. But there are other parts of the genome that are very variable, perhaps because they are not important for survival. This isn't the whole story because in fact some useful genes are quite variable. The reasons for this are controversial. It's a bit of a digression but . . . What is this life if, full of stress, we have no freedom to digress?
The 'neutralist' school of thought, associated with the distinguished Japanese geneticist Motoo Kimura, believes that useful genes are equally
useful in a variety of different forms. This emphatically does not mean that they are useless, only that the different forms are equally good at what they do. If you think of genes as writing out their recipes in words, the alternative forms of a gene can be thought of as the very same words written in different typefaces: the meaning is the same, and the product of the recipe will come out the same. Genetic changes, 'mutations', that make no difference are not 'seen' by natural selection. They aren't mutations at all, for all the difference they make to the life of the animal, but they are potentially useful mutations from the point of view of the forensic scientist. The population ends up with lots of variety at such a locus (position in a chromosome), and this kind of variety could in principle be used for fingerprinting.
The other theory of variation, opposed to Kimura's neutral theory, believes that the different versions of the genes really do different things and that there is some special reason why both are preserved by natural selection in the population. For example, there might be two alternative forms of a blood protein, A? and ss, which are susceptible to two infectious diseases called alfluenza and betaccosis respectively, each being immune to the other disease. Typically, an infectious disease needs a critical density of susceptible victims in a population, otherwise an epidemic can't get going. In a population dominated by A? types, there are frequent epidemics of alfluenza but not of betaccosis. So natural selection favours the ss types who are immune to alfluenza. It favours them so much that after a while they come to dominate the population. Now the tables are turned. There are epidemics of betaccosis, but not of alfluenza. The A? types now are favoured by natural selection because they are immune to betaccosis. The population may keep oscillating between A? dominance and ss dominance, or it may settle down to an intermediate mixture, an 'equilibrium'. Either way, we'll see plenty of variation at the gene locus concerned, and this is good news for the finger-printers. The phenomenon is called 'frequency dependent selection' and it is one suggested reason for high levels of genetic variation in the population. There are others.
However, for our forensic purposes, it matters only that there are variable sections of the genome. Whatever the verdict in the controversy over whether the useful bits of the genome are variable, there are in any case lots of other regions of the genome which are never even read, or never translated into their protein equivalents. Indeed, an astonishingly high proportion of our genes seem to be doing nothing whatsoever. They are therefore free to vary-, which makes them excellent DNA fingerprinting material.
As if to confirm the fact that a great deal of DNA is doing nothing useful, the sheer quantity of DNA in the cells of different kinds of organisms is
wildly variable.
Since DNA information is digital, we can measure it in
the same kind of units as we measure computer information. One bit of information is enough to specify one yes/no decision: a 1 or a 0, a true or a false. The computer on which I am writing this has 256 megabits (32 megabytes) of core memory. (The first computer that I owned was a
bigger box but had less than one five thousandth of the memory capacity. ) The equivalent fundamental unit in DNA is the nucleotide base. Since there are 4 possible bases, the information content of each base is equivalent to 2 bits. The common gut bacterium Escherichia coli has a genome of 4 mega-bases or 8 megabits. The crested newt, Triturus cristatus, has 40,000 megabits. The 5,000-fold ratio between crested
newt and bacterium is about the same as that between my present computer and my first one. We humans have 5,000 mega-bases or 6,000 megabits. This is 750 times as great as the bacterium (which satisfies
our vanity), but what are we to make of the newt trumping us sixfold? We'd like to think that genome size is not strictly proportional to what it does: presumably quite a lot of that newt DNA isn't doing anything. This is certainly true. It is also true of most of our DNA. We know from other evidence that, of the 3,000 mega-base human genome, only about 2 per cent is actually used for coding protein synthesis. The rest is often called junk DNA. Presumably the crested newt has an even higher percentage
of junk DNA. Other newts have not.
The surplus of unused DNA falls into various categories. Some of it looks like real genetic information, and probably represents old, defunct genes, or out-of-date copies of genes that are still in use. These pseudo-genes would make sense if they were read and translated. But they are not read and translated. Hard disks on computers usually contain comparable junk: old copies of work in progress, scratchpad space used by the computer for interim operations, and so on. We users don't see this junk, because our computers only show us those parts of the disk that we need to know about. But if you get right down and read the actual information on the disk, byte by byte, you'll see the junk, and much of it will make some sort of sense. There are probably dozens of disjointed fragments of this very chapter peppered around my hard disk at present, although there is only one 'official' copy that the computer tells me about (plus a prudent back-up).
In addition to the junk DNA which could he read but isn't, there is plenty of junk DNA which not only isn't read but wouldn't make any sense if it were. There are huge stretches of repeated nonsense, perhaps repeats of one base, or alternations of the same two bases, or repeats of a more complicated pattern. Unlike the other class of junk DNA, we cannot account for these 'tandem repeats' as outdated copies of useful genes. This repetitive DNA has never been decoded, and presumably has never been of any use. (Never useful for the animal's survival, anyway. From
the point of view of the selfish gene, as I explained in another book, we could say that any kind of junk DNA is 'useful' to itself if it just keeps surviving and making more copies of itself. This suggestion has come to be known by the catch-phrase 'selfish DNA', although this is a little unfortunate because, in my original sense, working DNA is selfish too. For this reason, some people have taken to calling it 'ultra-selfish DNA'. )
Anyway, whatever the reason, junk DNA is there, and there in prodigious quantities. Because it is not used, it is free to vary. Useful genes, as we have seen, are severely constrained in their freedom to change. Most changes (mutations) make a gene work less effectively, the animal dies and the change is not passed on. This is what Darwinian natural selection is all about. But mutations in junk DNA (mostly changes in the number of repeats in a given region) are not noticed by natural selection. So, as we look around the population, we find most of the variation that is useful for fingerprinting in the junk regions. As we shall now see, tandem repeats are particularly useful because they vary with respect to number of repeats, a gross feature which is easy to measure.
If it wasn't for this, the forensic geneticist would need to look at the exact sequence of bases in our sample region. This can be done, but sequencing DNA is time-consuming. The tandem repeats allow us to use cunning short-cuts, as discovered by Alec Jeffreys of the University of Leicester, rightly regarded as the father of DNA fingerprinting (and now Sir Alec). Different people have different numbers of tandem repeats in particular places. I might have 147 repeats of a particular piece of nonsense, where you have 84 repeats of the same piece of nonsense in the corresponding place in your genome. In another region, I might have 24 repeats of a particular piece of nonsense to your 38 repeats. Each of us has a characteristic fingerprint consisting of a set of numbers. Each of these numbers in our fingerprint is the number of times a particular piece of nonsense is repeated in our genome.
We get our tandem repeats from our parents. We each have 46 chromosomes, 25 from our father and 23 homologous, or corresponding, chromosomes from our mother. These chromosomes come complete with tandem repeats. Your father got his 46 chromosomes from your paternal grandparents, but he didn't pass them on to you in their entirety. Each of his mother's chromosomes was lined up with its paternal opposite number and bits were exchanged before a composite chromosome was put into the sperm that helped to make you. Every sperm and every egg is unique because it is a different mix of maternal and paternal chromosomes. The mixing process affects the tandem repeat sections as well as the meaningful sections of the chromosomes. So our characteristic numbers of tandem repeats are inherited, in much the same way as our eye colour and hair curliness are inherited. With the difference that, whereas our eye colour results from some kind of joint
verdict of our paternal and our maternal genes, our tandem repeat numbers are properties of the chromosomes themselves and can therefore be measured separately for paternal and maternal chromosomes. At any particular tandem repeat region, each of us has two readings: a paternal chromosome repeat number and a maternal chromosome repeat number. From time to time, chromosomes mutate - suffer a random change - in their tandem repeat numbers. Or a particular tandem region may be split by chromosomal crossing over. This is why there is variation in tandem repeat numbers in the population. The beauty of tandem repeat numbers is that they are easy to measure. You don't have to get embroiled in detailed sequencing of coded DNA bases. You do something a bit like weighing them. Or, to take another equally apt analogy, you spread them out like coloured bands from a prism. I'll explain one way of doing this.
First you need to make some preparations. You make a so-called DNA probe, which is a short sequence of DNA that exactly matches the nonsense sequence in question - up to about 20 nucleotide bases long. This is not difficult to do nowadays. There are several methods. You can even buy a machine off the shelf which makes short DNA sequences to any specification, just as you can buy a keyboard to punch any desired string of letters on a paper tape. By supplying the synthesizing machine with radioactive raw materials, you make the probes themselves radioactive, and so 'label' them. This makes the probes easy to find again later, as natural DNA is not radioactive, and so the two are readily distinguishable from each other.
Radioactive probes are a tool of the trade, which you must have ready before you start a Jeffreys fingerprinting exercise. Another essential tool is the 'restriction enzyme'. Restriction enzymes are chemical tools that specialize in cutting DNA, but cutting it only in particular places. For example, one restriction enzyme may search the length of a chromosome until it finds the sequence GAATTC (G, C, T and A are the four letters of the DNA alphabet; all genes, from all species on earth, differ only in consisting of different sequences of these four letters). Another restriction enzyme cuts the DNA wherever it can find the sequence GCGGCCGC. A number of different restriction enzymes are available in the toolbox of the molecular biologist. They originate from bacteria, who use them for their own defensive purposes. Each restriction enzyme has its own unique search string which it homes in on and cuts.
Now, the trick is to choose a restriction enzyme whose specific search string is completely absent from the tandem repeat we are interested in. The whole length of DNA is therefore chopped into short stretches, bounded by the characteristic search string of the restriction enzyme. Of course, not all the stretches will consist of the tandem repeat we are
looking for. All sorts of other stretches of DNA will happen to be bounded by the favoured search string of the restriction enzyme scissors. But some of them will consist of tandem repeats and the length of each scissored stretch will be largely determined by the number of tandem repeats in it. If I have 147 repeats of a particular piece of DNA nonsense, where you have only 85, my snipped fragments will be correspondingly longer than your snipped fragments.
We can measure these characteristic lengths using a technique that has been around in molecular biology for quite a while. This is the bit that is rather like spreading them out with a prism, as Newton did for white light. The standard DNA 'prism' is a gel electrophoresis column, that is, a long tube filled with jelly through which an electric current is passed. A solution containing the scissored stretches of DNA, all jumbled together, is poured into one end of the tube. The DNA fragments are all electrically attracted to the negative end of the column, which is at the other end of the tube, and they move steadily through the jelly. But they don't all move at the same rate. Like light of low vibration frequency moving through glass, small fragments of DNA move faster than large ones. The result is that, if you switch the current off after a suitable interval, the fragments have spread themselves out along the column, just as Newton's colours spread themselves out because light from the blue end of the spectrum is more readily slowed down by glass than light from the red end.
But so far we can't see the fragments. The jelly column looks uniform all the way down. There is nothing to show that DNA fragments of different size are lurking in discrete bands along its length, and nothing to show which bands contain which variety of tandem repeat. How do we make them visible? This is where the radioactive probes come in.
To make them visible you can use another cunning technique, the Southern blot, named after its inventor, Edward Southern. (Slightly confusingly, there are other techniques called the Northern blot and the Western blot, but no Mr Northern or Mr Western. ) The jelly column is removed from the tube and laid out on blotting paper. The liquid in the jelly, including the DNA fragments, seeps out of the jelly into the blotting paper. The blotting paper has previously been laced with quantities of the radioactive probe for the particular tandem repeat that we are interested in. The probe molecules line up along the blotting paper, pairing precisely, by the ordinary rules of DNA, with their opposite numbers in the tandem repeats. Surplus probe molecules are washed away. Now the only radioactive probe molecules left in the blotting paper are those bound to their exact opposite numbers that seeped out of the jelly. The blotting paper is now placed on a piece of X-ray film, which is then marked by the radioactivity. So, what you see when you develop the
film is a set of dark bands - another barcode. The final barcode pattern that we read on the Southern blot is a fingerprint for a person, in very much the same way as the Fraunhofer lines are a fingerprint for a star, or the formant lines are the fingerprint for a vowel sound. Indeed, the barcode from the blood looks very like Fraunhofer lines or formant lines.
The details of DNA fingerprinting techniques get quite complicated and I won't go much further. For instance, one strategy is to hit the DNA with lots of probes all at the same time. What you get then is a mixed bag of barcode stripes simultaneously. In extreme cases, the stripes merge into each other and all you get is one big smear with all possible sizes of DNA fragment represented somewhere in the genome. This is no good for identification purposes. At the other extreme, people use only one probe at a time looking at one genetic 'locus'. This 'single-locus fingerprinting' gives you nice clean bars like Fraunhofer lines. But only one or two bars per person. Even so, the chances of confusing people are small. This is because the characteristics we are talking about are not like 'brown eyes versus blue eyes', in which case lots of people would be the same. The characteristics we are measuring, remember, are lengths of tandem repeat fragments. The number of possible lengths is very large, so even single-locus fingerprinting is pretty good for identification purposes. Not quite good enough, however, so in practice forensic DNA finger-printers usually use half a dozen separate probes. Now the chances of error are very low indeed. But we still need to talk about exactly how low, because people's lives or liberties might depend upon it.
First, we must return to our distinction between false positives and false negatives. DNA evidence can be used to clear an innocent suspect, or it can be made to point the finger at a guilty one. Suppose semen is recovered from the vagina of a rape victim. Circumstantial evidence leads the police to arrest a man, suspect A. Suspect A gives a blood sample and it is compared to the semen sample, using a single DNA probe to look at one tandem repeat locus. If the two are different, suspect A is in the clear. We don't even need to look at a second locus.
But what if suspect A's blood matches the semen sample at this locus? Suppose they both share the same barcode pattern, which we shall call pattern P. This is compatible with the suspect's being guilty, but it doesn't prove it. He could just happen to share pattern P with the real rapist. We must now look at some more loci. If the samples still match, what are the odds against such a match being coincidental - a false positive mis-identification? This is where we have to start thinking statistically about the population at large. In theory, by taking blood from a sample of men in the population at large, we should be able to calculate the likelihood that any two men will be identical at each locus
concerned. But from which section of the population do we draw our sample?
Remember our lone bearded man in the old-fashioned line-up identity parade? Here's the molecular equivalent. Suppose that, in the world at large, only one in a million men has pattern P. Does this mean that there is a million to one chance against a wrongful conviction of suspect A? No. Suspect A may belong to a minority group of people whose ancestors immigrated from a particular part of the world. Local populations often share genetic peculiarities, for the simple reason that they are descended from the same ancestors. Of the 2. 5 million South African Dutch, or Afrikaners, most are descended from one shipload of immigrants who arrived from the Netherlands in 1652. As an indicator of the narrowness of this genetic bottleneck, about a million still bear the surnames of 20 of these original settlers. The Afrikaners have a much higher frequency of certain genetic diseases than the population of the world in general. According to one estimate, about 8,000 (one in 300) have the blood condition porphyria variegata, which is much rarer in the rest of the world. This is apparently because they are descended from one particular couple on the ship, Gerrit Jansz and Ariaantje Jacobs, although it is not known which one was the carrier of the (dominant) gene for the condition. (She was one of eight Rotterdam orphanage girls put on the ship to provide wives for the settlers. ) In fact, the condition wasn't noticed at all before modern medicine, because its most marked symptom is a lethal reaction to certain modern anaesthetics (South African hospitals now routinely test for the gene before administering anaesthetic). Other populations often have locally high frequencies of other particular genes, for the same kind of reason. If, to return to our hypothetical court case, suspect A and the real criminal both belong to the same minority group, the likelihood of chance confusion could be dramatically greater than you'd think if you based your estimates on the population at large. The point is that the frequency of pattern P in humans at large is no longer relevant. We need to know the frequency of pattern P in the group to which the suspect belongs.
This need is nothing new. We've already seen the equivalent danger in an ordinary line-up identity parade. If the prime suspect is Chinese, it doesn't do to stand him in a line-up largely consisting of westerners. And the same kind of statistical reasoning about the background population is needed in identifying stolen goods, as well as individual suspects. I have already mentioned my jury service in the Oxford Court. In one of the three cases I sat on, a man was accused of stealing three coins from a rival numismatist. The accused had been caught with three coins in his possession which matched those lost. Counsel for the prosecution was eloquent.
Ladies and gentlemen of the jury, are we really supposed to believe that three coins, of exactly the same type as the three missing coins, would just happen to be present in the house of a rival collector? I put it to you that such a coincidence is too much to stomach.
Jurymen are not permitted to cross-examine. That was the duty of counsel for the defence, and he, though doubtless learned in the law and also eloquent, had no more clue about probability theory' than the prosecutor. I wish he'd said something like this:
M'Lud, we don't know whether the coincidence is too much to stomach, because m'learned friend has not presented us with any evidence at all as to the rarity or commonness of these three coins in the population at large. If these coins are so rare that only one in a hundred collectors in the country has any one of them, the prosecution has a good case, since the defendant was caught with three of them. If on the other hand, these coins are as common as dirt, there is not enough evidence to convict. (To push to the extreme, three coins that I have in my pocket today, all current legal tender, are very probably the same as three coins in Your Lordships pocket)
My point is that it simply never occurred to any of the legally trained minds in the court that it was relevant even to ask how rare these three coins were in the population at large. Lawyers can certainly add up (I once received a lawyer's bill, the last item of which was 'Time spent making out this bill') but probability theory is another matter.
I expect the coins were actually rare. If they hadn't been, the theft would not have been such a serious matter, and the prosecution presumably would never have been brought. But the jury should have been told explicitly. I remember that the question came up in the jury room, and we wished that we were allowed to go back into the court to seek clarification. The equivalent question is equally relevant in the case of DNA evidence, and it is most certainly being asked. Fortunately, provided a sufficient number of separate genetic loci are examined, the chances of mis-identification - even among members of minority groups, even among family members (except identical twins) - can be reduced to genuinely very small levels, far smaller than can be achieved by any other method of identification, including eye-witness evidence.
Exactly how small the residual possibility of error is may still be open to dispute. And this is where we come to the third category of objection to DNA evidence, the just plain silly. Lawyers are accustomed to pouncing when expert witnesses seem to disagree. If two geneticists are summoned to the stand and are asked to estimate the probability of a mis-
identification with DNA evidence, the first may say a 1,000,000 to one while the second may say only a 100,000 to one. Pounce. 'Aha! AHA! The experts disagree! Ladies and gentlemen of the jury, what confidence can we place in a scientific method if the experts themselves can't get within a factor of ten of one another? Obviously the only thing to do is throw the entire evidence out, lock, stock and barrel. '
But, in these cases, although geneticists may be inclined to give different weightings to imponderables such as the racial subgroup effect, any disagreement between them is only over whether the odds against a wrongful identification are hyper-mega-astronomical or just plain astronomical. The odds cannot normally be lower than thousands to one, and they may well be up in the billions. Even on the most conservative estimate, the odds against wrongful identification are hugely greater than they are in an ordinary identity parade. 'M'lud, an identity parade of only 30 men is grossly unfair on my client.