Once again, we have an exercise in
calculating
the odds of coincidence.
Richard-Dawkins-Unweaving-the-Rainbow
were the initial letters of my wife's mother's maiden name.
Her married initials of M.
A.
W.
would have seemed just as impressive had they been found on the watch.
Surnames beginning with W are nearly as common in the telephone book as those beginning with B.
This consideration approximately doubles the petwhac, by doubling the number of people in the country who would have been deemed, by a coincidence hunter, to have 'the same initials' as my wife's mother.
Moreover, if somebody bought a watch and found it to be engraved not with her mother's initials but with her own, she might consider it an even greater coincidence and more worthy to be embraced within the (ever-growing) petwhac.
The late Arthur Koestler, as I have already mentioned, was a great enthusiast of coincidences. Among the stories that he recounts in The Roots of Coincidence (1972) are several originally collected by his hero, the Austrian biologist Paul Kammerer (famous for publishing a faked experiment purportedly demonstrating the 'inheritance of acquired characteristics' in the midwife toad). Here is a typical Kammerer story quoted by Koestler:
On September 18, 1916, my wife, while waiting for her turn in the consulting rooms of Prof. Dr J. v. H. , reads the magazine Die Kunst; she is impressed by some reproductions of pictures by a painter named Schwalbach, and makes a mental note to remember his name because she would like to see the originals. At that moment the door opens and the receptionist calls out to the patients: 'Is Frau Schwalbach here? She is wanted on the telephone. '
It probably isn't worth trying to estimate the odds against this coincidence, but we can at least write down some of the things that we'd need to know. 'At that moment the door opens' is a little vague. Did the door open one second after she made the mental note to look up Schwalbach's paintings or 20 minutes? How long could the interval have been, leaving her still impressed by the coincidence? The frequency of the name Schwalbach is obviously relevant: we'd be less impressed if it had been Schmidt or Strauss; more impressed if it had been Twistleton- Wykeham-Fiennes or Knatchbull-Huguesson. My local library doesn't
have the Vienna telephone book, but a quick look in another large Germanic telephone directory, the Berlin one, yields half a dozen Schwalbachs: the name is not particularly common, therefore, and it is understandable that the lady was impressed. But we need to think further about the size of the petwhac. Similar coincidences could have happened to people in other doctors' waiting rooms; and in dentists' waiting rooms, government offices and so on; and not just in Vienna but anywhere else. The quantity to keep bearing in mind is the number of opportunities for coincidence that would have been thought, if they had occurred, just as remarkable as the one that actually did occur.
Now let's take another kind of coincidence, where it is even harder to know how to start calculating odds. Consider the often-quoted experience of dreaming of an old acquaintance for the first time in years and then getting a letter from him, out of the blue, the next day. Or of learning that he died in the night. Or of learning that he didn't die in the night but his father did. Or that his father didn't die but won the football pools. See how the petwhac grows out of control when we relax our vigilance?
Often, these coincidence stories are gathered together from a large field. The correspondence columns of popular newspapers contain letters sent in by individual readers who would not have written but for the amazing coincidence that had happened to them. In order to decide whether we should be impressed, we need to know the circulation figure for the newspaper. If it is 4 million, it would be surprising if we did not read daily of some stunning coincidence, since a coincidence only has to happen to one of the 4 million in order for us to have a good chance of being told about it in the paper. It is hard to calculate the probability of a particular coincidence happening to one person, say a long-forgotten old friend dying during the night we happen to dream about him. But whatever this probability is, it is surely far greater than one in 4 million.
So, there really is no reason for us to be impressed when we read in the newspaper of a coincidence that has happened to one of the readers, or to somebody, somewhere in the world. This argument against being impressed is entirely valid. Nevertheless, there may be something lurking here that still bothers us. You may be happy to agree that, from the point of view of a reader of a mass-circulation newspaper, we have no right to be impressed at a coincidence that happens to another of the millions of readers of the same newspaper who bothers to write in. But it is much harder to shake the feeling of spine-chilled awe when the coincidence happens to you yourself. This is not just personal bias. One can make a serious case for it. The feeling occurs to almost everybody I meet; if you ask anybody at random, there is a good chance that they will have at least one pretty uncanny story of coincidence to relate. On the face of it,
this undermines the sceptic's point about newspaper stories having been culled from a millions-strong readership - a huge catchment of opportunity.
Actually it doesn't undermine it, for the following reason. Each one of us, though only a single person, none the less amounts to a very large population of opportunities for coincidence. Each ordinary day that you or I live through is an unbroken sequence of events, or incidents, any of which is potentially a coincidence. I am now looking at a picture on my wall of a deep-sea fish with a fascinatingly alien face. It is possible that, at this very moment, the telephone will ring and the caller will identify himself as a Mr Fish. I'm waiting . . .
The telephone didn't ring. My point is that, whatever you may be doing in any given minute of the day, there probably is some other event - a phone call, say - which, if it were to happen, would with hindsight be rated an eerie coincidence. There are so many minutes in every individual's lifetime that it would be quite surprising to find an individual who had never experienced a startling coincidence. During this particular minute, my thoughts have strayed to a schoolfellow called Haviland (I don't remember his Christian name, nor what he looked like) whom I haven't seen or thought of for 45 years. If, at this moment, an aeroplane manufactured by the de Haviland company were to fly past the window, I'd have a coincidence on my hands. In fact I have to report that no such plane has been forthcoming, but I have now moved on to think about something else, which gives yet another opportunity for coincidence. And so the opportunities for coincidence go on throughout the day and every day. But the negative occurrences, the failures to coincide, are not noticed and not reported.
Our propensity to see significance and pattern in coincidence, whether or not there is any real significance there, is part of a more general tendency to seek patterns. This tendency is laudable and useful. Many events and features in the world really are patterned in a non-random way and it is helpful to us, and to animals generally, to detect these patterns. The difficulty is to navigate between the Scylla of detecting apparent pattern when there isn't any, and the Charybdis of failing to detect pattern when there is. The science of statistics is quite largely concerned with steering this difficult course. But long before statistical methods were formalized, humans and indeed other animals were reasonably good intuitive statisticians. It is easy to make mistakes, however, in both directions.
Here are some true statistical patterns in nature which are not totally obvious, and which humans have not always known.
True pattern
Sexual intercourse is statistically followed by birth about 266 days later
Reason difficult to detect
The exact interval varies around the average of 266 days. Intercourse more often than not fails to result in conception. Intercourse is often frequent anyway, so it is not obvious that conception results from that rather than from, say, eating, which is also frequent.
True pattern
Conception is relatively probable in the middle of a woman's cycle, and relatively improbable near menstruation
Reason difficult to detect
See above. In addition, women who don't menstruate don't conceive. This is a spurious correlation which gets in the way and even, to a naive mind, suggests the opposite of the truth.
True pattern
Smoking causes lung cancer
Reason difficult to detect
Plenty of people who smoke don't get lung cancer. Many people get lung cancer who never smoked.
True pattern
In a time of bubonic plague, proximity to rats, and especially their fleas, tends to lead to infection
Reason difficult to detect
Lots of rats and fleas around anyway. Rats and fleas are associated with so many other things, such as dirt and 'bad air', that it is hard to know which of the many correlated factors is the important one. i. e. again, there are spurious correlations that get in the way.
Now here are some false patterns which humans have mistakenly thought they detected.
False pattern
Droughts can be brought to an end by a rain dance (or human sacrifice, or sprinkling goats' blood on a ferret's kidneys, or whatever arbitrary custom the particular theology lays down)
Reason easy to he misled
Occasionally, rains do chance to follow upon a rain dance (etc. ), and these rare lucky strikes lodge in the memory. When the rain dance, say, is not followed by rain, it is assumed that some detail went wrong with the ceremony, or that the gods are angry for some other reason: it is always easy enough to find a sufficiently plausible excuse.
False pattern
Comets and other astronomical events portend crises in human affairs
Reason easy to he misled
See above. Also, it is in the interests of astrologers to foster the myth, just as it is no doubt in the interests of priests and witch-doctors to foster the myths about rain dances and ferrets' kidneys.
False pattern
After a run of ill-luck, good luck becomes more likely
Reason easy to he misled
If bad luck persists, we assume that the run of bad luck hasn't ended yet, and we look forward all the more to its eventual end. If bad luck does not persist, the prophecy is seen as fulfilled. We subconsciously define a
'run' of bad luck in terms of its end. Therefore it obviously has to be followed by good luck.
We are not the only animals to seek statistical patterns of non- randomness in nature, and we are not the only animals to make mistakes of the kind that might be called superstitious. Both these facts are neatly demonstrated in the apparatus called the Skinner box, after the famous American psychologist B. F. Skinner. A Skinner box is a simple but versatile piece of equipment for studying the psychology of, usually, a rat or a pigeon. It is a box with a switch or switches let into one wall which the pigeon (say) can operate by pecking. There is also an electrically operated feeding (or other rewarding) apparatus. The two are connected in such a way that pecking by the pigeon has some influence on the feeding apparatus. In the simplest case, every time the pigeon pecks the key it gets food. Pigeons readily learn the task. So do rats and, in suitably enlarged and reinforced Skinner boxes, so do pigs.
We know that the causal link between key peck and food is provided by electrical apparatus, but the pigeon doesn't. As far as the pigeon is concerned, pecking a key might as well be a rain dance. Moreover, the link can be quite a weak, statistical one. The apparatus may be set up so that, far from every peck being rewarded, only one in 10 pecks is rewarded. This can mean literally every tenth peck. Or, with a different setting of the apparatus, it can mean that one in 10 pecks on average is rewarded, but on any particular occasion the exact number of pecks
required is determined at random. Or there may be a clock which determines that one tenth of the time, on average, a peck will yield reward, but it is impossible to tell which tenth of the time. Pigeons and rats learn to press keys even when, one might think, you'd need to be quite a good statistician to detect the cause-effect relationship. They can be worked up to a schedule in which only a very small proportion of pecks is rewarded. Interestingly, habits learned when pecks are only occasionally rewarded are more durable than habits learned when all pecks are rewarded: the pigeon is less swiftly discouraged when the rewarding mechanism is switched off altogether. This makes intuitive sense if you think about it.
Pigeons and rats, then, are quite good statisticians, able to pick up slight, statistical laws of patterning in their world. Presumably this ability
serves them in nature as well as in the Skinner box. Life out there is rich in pattern; the world is a big, complicated Skinner box. Actions by a wild animal are frequently followed by rewards or punishments or other important events. The relationship between cause and effect is frequently not absolute but statistical. If a curlew probes mud with its long, curved bill, there is a certain probability that it will strike a worm. The relationship between probe events and worm events is statistical but real. A whole school of research on animals has grown up around so-called Optimal Foraging Theory. Wild birds show quite sophisticated abilities to assess, statistically, the relative food-richness of different areas and they switch their time between the areas accordingly.
Back in the laboratory, Skinner founded a large school of research using Skinner boxes for all kinds of detailed purposes. Then, in 1948, he tried
a brilliant variant on the standard technique. He completely severed the causal link between behaviour and reward. He set up the apparatus to 'reward' the pigeon from time to time no matter what the bird did. Now
all that the birds actually needed to do was sit back and wait for the reward. But in fact this is not what they did. Instead, in six out of eight cases, they built up -exactly as though they were learning a rewarded habit - what Skinner called 'superstitious' behaviour. Precisely what this consisted of varied from pigeon to pigeon. One bird spun itself round like a top, two or three turns anticlockwise, between 'rewards'. Another bird repeatedly thrust its head towards one particular upper corner of the box. A third bird showed 'tossing' behaviour, as if lifting an invisible curtain with its head. Two birds independently developed the habit of rhythmic, side-to-side 'pendulum swinging' of the head and body. This last habit, incidentally, must have looked rather like the courtship dance of some birds of paradise. Skinner used the word superstition because the birds behaved as if they thought that their habitual movement had a causal influence on the reward mechanism, when actually it didn't. It was the pigeon equivalent of a rain dance.
A superstitious habit, once established, might persist for hours, long after the reward mechanism had been switched off. The habits did not, however, remain unchanged in form. They drifted, like the progressive improvisations of an organist. In one typical case the pigeon's superstitious habit began as a sharp movement of the head from the middle position towards the left. As time went by, the movement became more energetic. Eventually the whole body moved in the same direction and a step or two would be taken with the legs. After many hours of 'topographic drift', this leftward stepping movement became the predominant feature of the habit. The superstitious habits themselves may have been derived from the species' natural repertoire, but it is still fair to say that performing them in this context, and performing them repeatedly, is unnatural for pigeons.
Skinner's superstitious pigeons were behaving like statisticians, but statisticians who have got it wrong. They were alert to the possibility of links between events in their world, especially links between rewards that they wanted and actions that it was in their power to take. A habit, such as shoving the head up into the corner of the cage, began by chance. The bird just happened to do it at the moment before the reward mechanism was due to clunk into action. Understandably enough, the bird developed the tentative hypothesis that there was a link between the two events. So it shoved its head into the comer again. Sure enough, by the luck of Skinner's timing mechanism, the reward came again. If the bird had tried the experiment of not shoving its head into the corner, it would have found that the reward came anyway. But it would have needed to be a better and more sceptical statistician than many of us humans are in order to try this experiment.
Skinner makes the comparison with human gamblers developing little lucky 'tics' when playing cards. This kind of behaviour is also a familiar spectacle on bowling greens. Once the 'wood' (ball) has left the bowler's hand there is nothing more he can do to encourage it to move towards the 'jack' (target ball). Nevertheless, expert bowlers nearly always trot after their wood, often still in the stooped position, twisting and turning their bodies as if to impart desperate instructions to the now indifferent ball, and often speaking futile words of encouragement to it. A one-arm bandit in Las Vegas is nothing more nor less than a human Skinner box. 'Key-pecking' is represented not just by pulling the lever but also, of course, by putting money in the slot. It really is a fool's game because the odds are known to be stacked in favour of the casino - how else would the casino pay its huge electricity bills? Whether or not a given lever pull will deliver a jackpot is determined at random. It is a perfect recipe for superstitious habits. Sure enough, if you watch gambling addicts in Las Vegas you see movements highly reminiscent of Skinner's superstitious
pigeons. Some talk to the machine. Others make funny signs to it with their fingers, or stroke it or "pat it with their hands. They once patted it and won the jackpot and they've never forgotten it. I have watched computer addicts, impatient for a server to respond, behaving in a similar way, say, knocking the terminal with their knuckles.
My informant about Las Vegas has also made an informal study of London betting shops. She reports that one particular gambler habitually runs, after placing his bet, to a certain tile in the floor, where he stands on one leg while watching the race on the bookmaker's television. Presumably he once won while standing on this tile and conceived the notion that there was a causal link. Now, if somebody else stands on 'his' lucky tile (some other sportsmen do this deliberately, perhaps to try to hijack some of his 'luck' or just to annoy him) he dances around it, desperately trying to get a foot on the tile before the race ends. Other gamblers refuse to change their shirt, or to cut their hair, while they are 'on a lucky streak'. In contrast one Irish punter, who had a fine head of hair, shaved it completely bald in a desperate effort to change his luck. His hypothesis was that he was having rotten luck on the horses and he had lots of hair. Perhaps the two were connected somehow; perhaps these facts were all part of a meaningful pattern! Before we feel too superior, let us remember that large numbers of us were brought up to believe that Samson's fortunes changed utterly after Delilah cut off his hair.
How can we tell which apparent patterns are genuine, which random and meaningless? Methods exist, and they belong in the science of statistics and experimental design. I want to spend a little more time explaining a few of the principles, though not the details, of statistics. Statistics can largely be seen as the art of distinguishing pattern from randomness. Randomness means lack of pattern. There are various ways of explaining the ideas of randomness and pattern. Suppose I claim that I can tell girls' handwriting from boys'. If I am right, this would have to mean that there is a real pattern relating sex to handwriting. A sceptic might doubt this, agreeing that handwriting varies from person to person but denying that there is a sex-related pattern to this variation. How should you decide whether my claim, or the sceptic's, is right? It is no use just accepting my word for it. Like a superstitious Las Vegas gambler, I could easily have mistaken a lucky streak for a real, repeatable skill. In any case, you have every right to demand evidence. What evidence should satisfy you? The answer is evidence that is publicly recorded, and properly analysed.
The claim is, in any case, only a statistical claim. I do not maintain (in this hypothetical example - in reality I am not claiming anything) that I can infallibly judge the sex of the author of a given piece of handwriting. I claim only that among the great variation that exists among handwriting,
some component of that variation correlates with sex. Therefore, even though I shall often make mistakes, if you give me, say, 100 samples of handwriting I should be able to sort them into boys and girls more accurately than could be achieved purely by guessing at random. It follows that, in order to assess any claim, you are going to have to calculate how likely it is that a given result could have been achieved by guessing at random.
Once again, we have an exercise in calculating the odds of coincidence.
Before we get to the statistics, there are some precautions you need to take in designing the experiment. The pattern - the non-randomness we seek - is a pattern relating sex to handwriting. It is important not to confound the issue with extraneous variables. The handwriting samples that you give me should not, for instance, be personal letters. It would be too easy for me to guess the sex of the writer from the content of the letter rather than from the handwriting. Don't choose all the girls from one school and all the boys from another. The pupils from one school might share aspects of their handwriting, learning either from each other or from a teacher. These could result in real differences in handwriting, and they might even be interesting, but they could be representative of different schools, and only incidentally of different sexes. And don't ask the children to write out a passage from a favourite book. I should be influenced by a choice of Black Beauty or Biggies (readers whose childhood culture is different from mine will substitute examples of their own).
Obviously, it is important that the children should all be strangers to me, otherwise I'd recognize their individual writing and hence know their sex. When you hand me the papers they must not have the children's names on them, but you must have some means of keeping track of whose is which. Put secret codes on them for your own benefit, but be careful how you choose the codes. Don't put a green mark on the boys' papers and a yellow mark on the girls'. Admittedly, I won't know which is which, but I'll guess that yellow denotes one sex and green the other, and that would be a big help. It would be a good idea to give every paper a code number. But don't give the boys the numbers 1 to 10 and the girls 11 to 20; that would be just like the yellow and green marks all over again. So would giving the boys odd numbers and the girls even. Instead, give the papers random numbers and keep the crib list locked up where I cannot find it. These precautions are those named 'double blind' in the literature of medical trials.
Let's assume that all the proper double blind precautions have been taken, and that you have assembled 20 anonymous samples of handwriting, shuffled into random order. I go through the papers, sorting them into two piles for suspected boys and suspected girls. I may have
some 'don't knows', but let's assume that you compel me to make the best guess I can in such cases. At the end of the experiment I have made two piles and you look through to see how accurate I have been.
Now the statistics. You'd expect me to guess right quite often even if I was guessing purely at random. But how often? If my claim to be able to sex handwriting is unjustified, my guessing rate should be no better than somebody tossing a coin. The question is whether my actual performance is sufficiently different from a coin-tosser's to be impressive. Here is how to set about answering the question.
Think about all possible ways in which I could have guessed the sex of the 20 writers. List them in order of impressiveness, beginning with all 20 correct and going down to completely random (all 20 exactly wrong is nearly as impressive as all 20 exactly right, because it shows that I can discriminate, even though I perversely reverse the sign). Then look at the actual way I sorted them and count up the percentage of all possible sortings that would have been as impressive as the actual one, or more. Here's how to think about all possible sortings. First, note that there is only one way of being 100 per cent right, and one way of being 100 per cent wrong, but there are lots of ways of being 50 per cent right. One could be right on the first paper, wrong on the second, wrong on the third, right on the fourth . . , There are somewhat fewer ways of being 60 per cent right. Fewer ways still of being 70 per cent right, and so on. The number of ways of making a single mistake is sufficiently few that we can write them all down. There were 20 scripts. The mistake could have been made on the first one, or on the second one, or on the third one . . . or on the twentieth one. That is, there are exactly 20 ways of making a single mistake. It is more tedious to write down all the ways of making two mistakes, but we can calculate how many ways there are, easily enough, and it comes to 190. It is harder still to count the ways of making three mistakes, but you can see that it could be done. And so on.
Suppose, in this hypothetical experiment, two mistakes is actually what I did make. We want to know how good my score was, on a spectrum of all possible ways of guessing. What we need to know is how many possible ways of choosing are as good as, or better than, my score. The number as good as my score is 190. The number better than my score is 20 (one mistake) plus 1 (no mistakes). So, the total number as good as or better than my score is 211. It is important to add in the ways of scoring better than my actual score because they properly belong in the petwhac, along with the 190 ways of scoring exactly as well as I did.
We have to set 211 against the total number of ways in which the 20 scripts could have been classified by penny-tossers. This is not difficult to calculate. The first script could have been boy or girl; that is two
possibilities. The second script also could have been boy or girl. So, for each of the two possibilities for the first script, there were two possibilities for the second. That is 2 x 2 = 4 possibilities for the first two scripts. The possibilities for the first three scripts are 2 x 2 x 2 = 8. And the possible ways of classifying all 20 scripts are 2 x 2 x 2 . . . 2. 0 times, or 2 to the power 20. This is a pretty big number, 1,048,576.
So, of all possible ways of guessing, the proportion of ways that are as good as or better than my actual score is 211 divided by 1,048,576, which is approximately 0. 0003, or 0. 02 per cent. To put it another way, if 10,000 people sorted the scripts entirely by tossing pennies, you'd expect only two of them to score as well as I actually did. This means that my score is pretty impressive and, if I performed as well as this, it would be strong evidence that boys and girls differ systematically in their handwriting. Let me repeat that this is all hypothetical. As far as I know, I have no such ability to sex handwriting. I should also add that, even if there was good evidence for a sex difference in handwriting, this would say nothing about whether the difference is innate or learned. The evidence, at least if it came from the kind of experiment just described, would be equally compatible with the idea that girls are systematically taught a different handwriting from boys - perhaps a more 'ladylike' and less 'assertive' fist.
We have just performed what is technically called a test of statistical significance. We reasoned from first principles, which made it rather tedious. In practice, research workers can call upon tables of probabilities and distributions that have been previously calculated. We therefore don't literally have to write down all possible ways in which things could have happened. But the underlying theory, the basis upon which the tables were calculated, depends, in essence, upon the same fundamental procedure. Take the events that could have been obtained and throw them down repeatedly at random. Look at the actual way the events occurred and measure how extreme it is, on the spectrum of all possible ways in which they could have been thrown down.
Notice that a test of statistical significance does not prove anything conclusively. It can't rule out luck as the generator of the result that we observe. The best it can do is place the observed result on a par with a specified amount of luck. In our particular hypothetical example, it was on a par with two out of 10,000 random guessers. When we say that an effect is statistically significant, we must always specify a so-called p- value. This is the probability that a purely random process would have generated a result at least as impressive as the actual result. A p-value of 2 in 10,000 is pretty impressive, but it is still possible that there is no genuine pattern there. The beauty of doing a proper statistical test is that we know how probable it is that there is no genuine pattern there.
Conventionally, scientists allow themselves to be swayed by p-values of 1 in 100, or even as high as 1 in 20: far less impressive than 2, in 10,000. What p-value you accept depends upon how important the result is, and upon what decisions might follow from it. If all you are trying to decide is whether it is worth repeating the experiment with a larger sample, a p- value of 0. 05, or 1 in 20, is quite acceptable. Even though there is a 1 in 20 chance that your interesting result would have happened anyway by chance, not much is at stake: the error is not a costly one. If the decision is a life and death matter, as in some medical research, a much lower p- value than 1 in 20 should be sought. The same is true of experiments that purport to show highly controversial results, such as telepathy or 'paranormal' effects.
As we briefly saw in connection with DNA fingerprinting, statisticians distinguish false positive from false negative errors, sometimes called
type 1 and type 2 errors respectively. A type 2 error, or false negative, is
a failure to detect an effect when there really is one. A type 1 error, or false positive, is the opposite; concluding that there really is something going on when actually there is nothing but randomness. The p-value is the measure of the probability that you have made a type 1 error. Statistical judgement means steering a middle course between the two kinds of error. There is a type 5 error in which your mind goes totally blank whenever you try to remember which is which of type 1 and type 2. I still look them up after a lifetime of use. Where it matters, therefore, I shall use the more easily remembered names, false positive and false negative. I also, by the way, frequently make mistakes in arithmetic. In practice I should never dream of doing a statistical test from first principles as I did for the hypothetical handwriting case. I'd always look up in a table that somebody else - preferably a computer - had calculated.
Skinner's superstitious pigeons made false positive errors. There was in fact no pattern in their world that truly connected their actions to the deliveries of the reward mechanism. But they behaved as if they had detected such a pattern. One pigeon 'thought' (or behaved as if it thought) that left stepping caused the reward mechanism to deliver. Another 'thought' that thrusting its head into the corner had the same beneficial effect. Both were making false positive errors. A false negative error is made by a pigeon in a Skinner box who never notices that a peck at the key yields food if the red light is on, but that a peck when the blue light
is on punishes by switching the mechanism off for ten minutes. There is a genuine pattern waiting to be detected in the little world of this Skinner box, but our hypothetical pigeon does not detect it. It pecks indiscriminately to both colours, and therefore gets a reward less frequently than it could.
A false positive error is made by a farmer who thinks that sacrificing to the gods brings longed-for rain. In fact, I presume (although I haven't investigated the matter experimentally), there is no such pattern in his world, but he does not discover this and persists in his useless and wasteful sacrifices. A false negative error is made by a farmer who fails to notice that there is a pattern in the world relating manuring of a field to the subsequent crop yield of that field. Good farmers steer a middle way between type 1 and type 2 errors.
It is my thesis that all animals, to a greater or lesser extent, behave as intuitive statisticians, choosing a middle course between type 1 and type 2 errors. Natural selection penalizes both type 1 and type 2 errors, but the penalties are not symmetrical and no doubt vary with the different ways of life of species. A stick caterpillar looks so like the twig it is sitting on that we cannot doubt that natural selection has shaped it to resemble a twig. Many caterpillars died to produce this beautiful result. They died because they did not sufficiently resemble a twig. Birds, or other predators, found them out. Even some very good twig mimics must have been found out. How else did natural selection push evolution towards the pitch of perfection that we see? But, equally, birds must many times have missed caterpillars because they resembled twigs, in some cases only slightly. Any prey animal, no matter how well camouflaged, can be detected by predators under ideal seeing conditions. Equally, any prey animal, no matter how poorly camouflaged, can be missed by predators under bad seeing conditions. Seeing conditions can vary with angle (a predator may spot a well-camouflaged animal when looking straight at it, but will miss a poorly camouflaged animal out of the corner of its eye). They can vary with light intensity (a prey may be overlooked at twilight, whereas it would be seen at noon). They can vary with distance (a prey which would be seen at six inches range may be overlooked at a range of 100 yards).
Imagine a bird cruising around a wood, looking for prey. It is surrounded by twigs, a very few of which might be edible caterpillars. The problem is to decide. We can assume that the bird could guarantee to tell whether an apparent twig was actually a caterpillar if it approached the twig really close and subjected it to a minute, concentrated examination in a good light. But there isn't time to do that for all twigs. Small birds with high turnover metabolism have to find food alarmingly often in order to stay alive. Any bird that scanned every individual twig with the equivalent of a magnifying glass would die of starvation before it found its first caterpillar. Efficient searching demands a faster, more cursory and rapid scanning, even though this carries a risk of missing some food. The bird has to strike a balance. Too cursory and it will never find anything. Too detailed and it will detect every caterpillar it looks at, but it will look at too few, and starve.
It is easy to apply the language of type 1 and type 2 errors. A false negative is committed by a bird that sails by a caterpillar without giving it a closer look. A false positive is committed by a bird that zooms in on a suspected caterpillar, only to discover that it is really a twig. The penalty for a false positive is the time and energy wasted flying in for the close inspection: not serious on any one occasion, but it could mount up fatally. The penalty for a false negative is missing a meal. No bird outside Cloud Cuckooland can hope to be free of all type 1 and type 2 errors. Individual birds will be programmed by natural selection to adopt some compromise policy calculated to achieve an optimum intermediate level of false positives and false negatives. Some birds may be biased towards type 1 errors, others towards the opposite extreme. There will be some intermediate setting which is best, and natural selection will steer evolution towards it.
Which intermediate setting is best will vary from species to species. In
our example it will also depend upon conditions in the wood, for example, the size of the caterpillar population in relation to the number of twigs. These conditions may change from week to week. Or they may vary from wood to wood. Birds may be programmed to learn to adjust their policy
as a result of their statistical experience. Whether they learn or not, successfully hunting animals must usually behave as if they are good statisticians. (I hope it is not necessary, by the way, to plod through the usual disclaimer: No, no, the birds aren't consciously working it out with calculator and probability tables. They are behaving as if they were calculating p-values. They are no more aware of what a p-value means than you are aware of the equation for a parabolic trajectory when you catch a cricket ball or baseball in the outfield. )
Angler fish take advantage of the gullibility of little fish such as gobies. But that is an unfairly value-laden way of putting it. It would be better not to speak of gullibility and say that they exploit the inevitable difficulty the little fish have in steering between type 1 and type 2 errors. The little fish themselves need to eat. What they eat varies, but it often includes small wriggling objects such as worms or shrimps. Their eyes and nervous systems are tuned to wriggling things. They look for wriggling movement and if they see it they pounce. The angler fish exploits this tendency. It has a long fishing rod, evolved from a modified spine, commandeered by natural selection from its original location at the front of the dorsal fin. The angler fish itself is highly camouflaged and it sits motionless on the sea bottom for hours at a time, blending perfectly with the weeds and rocks. The only part of it which is conspicuous is a 'bait', which looks like a worm, a shrimp or a small fish, at the end of its fishing rod. In some deep-sea species the bait is even luminous. In any case, it seems to wriggle like something worth eating
when the angler waves its rod. A possible prey fish say, a goby, is attracted. The angler 'plays' its prey for a little while to hook its attention, then casts the bait down into the still unsuspected region in front of its own invisible mouth, and the little fish often follows. Suddenly that huge mouth is invisible no longer. It gapes massively, there is a violent inrushing of water, engulfing every floating object in the vicinity, and the little fish has pursued its last worm.
From the point of view of a hunting goby, any worm may be overlooked or it may be seen. Once the worm has been detected, it may turn out to be a real worm or an angler fish's lure, and the unfortunate fish is faced with a dilemma. A false negative error would be to refrain from attacking a perfectly good worm for fear that it might be an angler fish lure. A false positive error would be to attack a worm, only to discover that it is really a lure. Once again, it is impracticable in the real world to get it right all the time. A fish that is too risk-averse will starve because it never attacks worms. A fish that is too foolhardy won't starve but it may be eaten. The optimum in this case may not be halfway between. More surprisingly, the optimum may be one of the extremes. It is possible that angler fish are sufficiently rare that natural selection favours the extreme policy of attacking all apparent worms. I am fond of a remark of the philosopher and psychologist William James on human angling:
There are more worms unattached to hooks than impaled upon them; therefore, on the whole, says Nature to her fishy children, bite at every worm and take your chances. (1910)
Like all other animals, and even plants, humans can and must behave as intuitive statisticians. The difference with us is that we can do our calculations twice over. The first time intuitively, as though we were birds or fish. And then again explicitly, with pencil and paper or computer. It is tempting to say that the pencil and paper way gets the right answer, so long as we don't make some publicly detectable blunder like adding in the date, whereas the intuitive way may yield the wrong answer.
Moreover, if somebody bought a watch and found it to be engraved not with her mother's initials but with her own, she might consider it an even greater coincidence and more worthy to be embraced within the (ever-growing) petwhac.
The late Arthur Koestler, as I have already mentioned, was a great enthusiast of coincidences. Among the stories that he recounts in The Roots of Coincidence (1972) are several originally collected by his hero, the Austrian biologist Paul Kammerer (famous for publishing a faked experiment purportedly demonstrating the 'inheritance of acquired characteristics' in the midwife toad). Here is a typical Kammerer story quoted by Koestler:
On September 18, 1916, my wife, while waiting for her turn in the consulting rooms of Prof. Dr J. v. H. , reads the magazine Die Kunst; she is impressed by some reproductions of pictures by a painter named Schwalbach, and makes a mental note to remember his name because she would like to see the originals. At that moment the door opens and the receptionist calls out to the patients: 'Is Frau Schwalbach here? She is wanted on the telephone. '
It probably isn't worth trying to estimate the odds against this coincidence, but we can at least write down some of the things that we'd need to know. 'At that moment the door opens' is a little vague. Did the door open one second after she made the mental note to look up Schwalbach's paintings or 20 minutes? How long could the interval have been, leaving her still impressed by the coincidence? The frequency of the name Schwalbach is obviously relevant: we'd be less impressed if it had been Schmidt or Strauss; more impressed if it had been Twistleton- Wykeham-Fiennes or Knatchbull-Huguesson. My local library doesn't
have the Vienna telephone book, but a quick look in another large Germanic telephone directory, the Berlin one, yields half a dozen Schwalbachs: the name is not particularly common, therefore, and it is understandable that the lady was impressed. But we need to think further about the size of the petwhac. Similar coincidences could have happened to people in other doctors' waiting rooms; and in dentists' waiting rooms, government offices and so on; and not just in Vienna but anywhere else. The quantity to keep bearing in mind is the number of opportunities for coincidence that would have been thought, if they had occurred, just as remarkable as the one that actually did occur.
Now let's take another kind of coincidence, where it is even harder to know how to start calculating odds. Consider the often-quoted experience of dreaming of an old acquaintance for the first time in years and then getting a letter from him, out of the blue, the next day. Or of learning that he died in the night. Or of learning that he didn't die in the night but his father did. Or that his father didn't die but won the football pools. See how the petwhac grows out of control when we relax our vigilance?
Often, these coincidence stories are gathered together from a large field. The correspondence columns of popular newspapers contain letters sent in by individual readers who would not have written but for the amazing coincidence that had happened to them. In order to decide whether we should be impressed, we need to know the circulation figure for the newspaper. If it is 4 million, it would be surprising if we did not read daily of some stunning coincidence, since a coincidence only has to happen to one of the 4 million in order for us to have a good chance of being told about it in the paper. It is hard to calculate the probability of a particular coincidence happening to one person, say a long-forgotten old friend dying during the night we happen to dream about him. But whatever this probability is, it is surely far greater than one in 4 million.
So, there really is no reason for us to be impressed when we read in the newspaper of a coincidence that has happened to one of the readers, or to somebody, somewhere in the world. This argument against being impressed is entirely valid. Nevertheless, there may be something lurking here that still bothers us. You may be happy to agree that, from the point of view of a reader of a mass-circulation newspaper, we have no right to be impressed at a coincidence that happens to another of the millions of readers of the same newspaper who bothers to write in. But it is much harder to shake the feeling of spine-chilled awe when the coincidence happens to you yourself. This is not just personal bias. One can make a serious case for it. The feeling occurs to almost everybody I meet; if you ask anybody at random, there is a good chance that they will have at least one pretty uncanny story of coincidence to relate. On the face of it,
this undermines the sceptic's point about newspaper stories having been culled from a millions-strong readership - a huge catchment of opportunity.
Actually it doesn't undermine it, for the following reason. Each one of us, though only a single person, none the less amounts to a very large population of opportunities for coincidence. Each ordinary day that you or I live through is an unbroken sequence of events, or incidents, any of which is potentially a coincidence. I am now looking at a picture on my wall of a deep-sea fish with a fascinatingly alien face. It is possible that, at this very moment, the telephone will ring and the caller will identify himself as a Mr Fish. I'm waiting . . .
The telephone didn't ring. My point is that, whatever you may be doing in any given minute of the day, there probably is some other event - a phone call, say - which, if it were to happen, would with hindsight be rated an eerie coincidence. There are so many minutes in every individual's lifetime that it would be quite surprising to find an individual who had never experienced a startling coincidence. During this particular minute, my thoughts have strayed to a schoolfellow called Haviland (I don't remember his Christian name, nor what he looked like) whom I haven't seen or thought of for 45 years. If, at this moment, an aeroplane manufactured by the de Haviland company were to fly past the window, I'd have a coincidence on my hands. In fact I have to report that no such plane has been forthcoming, but I have now moved on to think about something else, which gives yet another opportunity for coincidence. And so the opportunities for coincidence go on throughout the day and every day. But the negative occurrences, the failures to coincide, are not noticed and not reported.
Our propensity to see significance and pattern in coincidence, whether or not there is any real significance there, is part of a more general tendency to seek patterns. This tendency is laudable and useful. Many events and features in the world really are patterned in a non-random way and it is helpful to us, and to animals generally, to detect these patterns. The difficulty is to navigate between the Scylla of detecting apparent pattern when there isn't any, and the Charybdis of failing to detect pattern when there is. The science of statistics is quite largely concerned with steering this difficult course. But long before statistical methods were formalized, humans and indeed other animals were reasonably good intuitive statisticians. It is easy to make mistakes, however, in both directions.
Here are some true statistical patterns in nature which are not totally obvious, and which humans have not always known.
True pattern
Sexual intercourse is statistically followed by birth about 266 days later
Reason difficult to detect
The exact interval varies around the average of 266 days. Intercourse more often than not fails to result in conception. Intercourse is often frequent anyway, so it is not obvious that conception results from that rather than from, say, eating, which is also frequent.
True pattern
Conception is relatively probable in the middle of a woman's cycle, and relatively improbable near menstruation
Reason difficult to detect
See above. In addition, women who don't menstruate don't conceive. This is a spurious correlation which gets in the way and even, to a naive mind, suggests the opposite of the truth.
True pattern
Smoking causes lung cancer
Reason difficult to detect
Plenty of people who smoke don't get lung cancer. Many people get lung cancer who never smoked.
True pattern
In a time of bubonic plague, proximity to rats, and especially their fleas, tends to lead to infection
Reason difficult to detect
Lots of rats and fleas around anyway. Rats and fleas are associated with so many other things, such as dirt and 'bad air', that it is hard to know which of the many correlated factors is the important one. i. e. again, there are spurious correlations that get in the way.
Now here are some false patterns which humans have mistakenly thought they detected.
False pattern
Droughts can be brought to an end by a rain dance (or human sacrifice, or sprinkling goats' blood on a ferret's kidneys, or whatever arbitrary custom the particular theology lays down)
Reason easy to he misled
Occasionally, rains do chance to follow upon a rain dance (etc. ), and these rare lucky strikes lodge in the memory. When the rain dance, say, is not followed by rain, it is assumed that some detail went wrong with the ceremony, or that the gods are angry for some other reason: it is always easy enough to find a sufficiently plausible excuse.
False pattern
Comets and other astronomical events portend crises in human affairs
Reason easy to he misled
See above. Also, it is in the interests of astrologers to foster the myth, just as it is no doubt in the interests of priests and witch-doctors to foster the myths about rain dances and ferrets' kidneys.
False pattern
After a run of ill-luck, good luck becomes more likely
Reason easy to he misled
If bad luck persists, we assume that the run of bad luck hasn't ended yet, and we look forward all the more to its eventual end. If bad luck does not persist, the prophecy is seen as fulfilled. We subconsciously define a
'run' of bad luck in terms of its end. Therefore it obviously has to be followed by good luck.
We are not the only animals to seek statistical patterns of non- randomness in nature, and we are not the only animals to make mistakes of the kind that might be called superstitious. Both these facts are neatly demonstrated in the apparatus called the Skinner box, after the famous American psychologist B. F. Skinner. A Skinner box is a simple but versatile piece of equipment for studying the psychology of, usually, a rat or a pigeon. It is a box with a switch or switches let into one wall which the pigeon (say) can operate by pecking. There is also an electrically operated feeding (or other rewarding) apparatus. The two are connected in such a way that pecking by the pigeon has some influence on the feeding apparatus. In the simplest case, every time the pigeon pecks the key it gets food. Pigeons readily learn the task. So do rats and, in suitably enlarged and reinforced Skinner boxes, so do pigs.
We know that the causal link between key peck and food is provided by electrical apparatus, but the pigeon doesn't. As far as the pigeon is concerned, pecking a key might as well be a rain dance. Moreover, the link can be quite a weak, statistical one. The apparatus may be set up so that, far from every peck being rewarded, only one in 10 pecks is rewarded. This can mean literally every tenth peck. Or, with a different setting of the apparatus, it can mean that one in 10 pecks on average is rewarded, but on any particular occasion the exact number of pecks
required is determined at random. Or there may be a clock which determines that one tenth of the time, on average, a peck will yield reward, but it is impossible to tell which tenth of the time. Pigeons and rats learn to press keys even when, one might think, you'd need to be quite a good statistician to detect the cause-effect relationship. They can be worked up to a schedule in which only a very small proportion of pecks is rewarded. Interestingly, habits learned when pecks are only occasionally rewarded are more durable than habits learned when all pecks are rewarded: the pigeon is less swiftly discouraged when the rewarding mechanism is switched off altogether. This makes intuitive sense if you think about it.
Pigeons and rats, then, are quite good statisticians, able to pick up slight, statistical laws of patterning in their world. Presumably this ability
serves them in nature as well as in the Skinner box. Life out there is rich in pattern; the world is a big, complicated Skinner box. Actions by a wild animal are frequently followed by rewards or punishments or other important events. The relationship between cause and effect is frequently not absolute but statistical. If a curlew probes mud with its long, curved bill, there is a certain probability that it will strike a worm. The relationship between probe events and worm events is statistical but real. A whole school of research on animals has grown up around so-called Optimal Foraging Theory. Wild birds show quite sophisticated abilities to assess, statistically, the relative food-richness of different areas and they switch their time between the areas accordingly.
Back in the laboratory, Skinner founded a large school of research using Skinner boxes for all kinds of detailed purposes. Then, in 1948, he tried
a brilliant variant on the standard technique. He completely severed the causal link between behaviour and reward. He set up the apparatus to 'reward' the pigeon from time to time no matter what the bird did. Now
all that the birds actually needed to do was sit back and wait for the reward. But in fact this is not what they did. Instead, in six out of eight cases, they built up -exactly as though they were learning a rewarded habit - what Skinner called 'superstitious' behaviour. Precisely what this consisted of varied from pigeon to pigeon. One bird spun itself round like a top, two or three turns anticlockwise, between 'rewards'. Another bird repeatedly thrust its head towards one particular upper corner of the box. A third bird showed 'tossing' behaviour, as if lifting an invisible curtain with its head. Two birds independently developed the habit of rhythmic, side-to-side 'pendulum swinging' of the head and body. This last habit, incidentally, must have looked rather like the courtship dance of some birds of paradise. Skinner used the word superstition because the birds behaved as if they thought that their habitual movement had a causal influence on the reward mechanism, when actually it didn't. It was the pigeon equivalent of a rain dance.
A superstitious habit, once established, might persist for hours, long after the reward mechanism had been switched off. The habits did not, however, remain unchanged in form. They drifted, like the progressive improvisations of an organist. In one typical case the pigeon's superstitious habit began as a sharp movement of the head from the middle position towards the left. As time went by, the movement became more energetic. Eventually the whole body moved in the same direction and a step or two would be taken with the legs. After many hours of 'topographic drift', this leftward stepping movement became the predominant feature of the habit. The superstitious habits themselves may have been derived from the species' natural repertoire, but it is still fair to say that performing them in this context, and performing them repeatedly, is unnatural for pigeons.
Skinner's superstitious pigeons were behaving like statisticians, but statisticians who have got it wrong. They were alert to the possibility of links between events in their world, especially links between rewards that they wanted and actions that it was in their power to take. A habit, such as shoving the head up into the corner of the cage, began by chance. The bird just happened to do it at the moment before the reward mechanism was due to clunk into action. Understandably enough, the bird developed the tentative hypothesis that there was a link between the two events. So it shoved its head into the comer again. Sure enough, by the luck of Skinner's timing mechanism, the reward came again. If the bird had tried the experiment of not shoving its head into the corner, it would have found that the reward came anyway. But it would have needed to be a better and more sceptical statistician than many of us humans are in order to try this experiment.
Skinner makes the comparison with human gamblers developing little lucky 'tics' when playing cards. This kind of behaviour is also a familiar spectacle on bowling greens. Once the 'wood' (ball) has left the bowler's hand there is nothing more he can do to encourage it to move towards the 'jack' (target ball). Nevertheless, expert bowlers nearly always trot after their wood, often still in the stooped position, twisting and turning their bodies as if to impart desperate instructions to the now indifferent ball, and often speaking futile words of encouragement to it. A one-arm bandit in Las Vegas is nothing more nor less than a human Skinner box. 'Key-pecking' is represented not just by pulling the lever but also, of course, by putting money in the slot. It really is a fool's game because the odds are known to be stacked in favour of the casino - how else would the casino pay its huge electricity bills? Whether or not a given lever pull will deliver a jackpot is determined at random. It is a perfect recipe for superstitious habits. Sure enough, if you watch gambling addicts in Las Vegas you see movements highly reminiscent of Skinner's superstitious
pigeons. Some talk to the machine. Others make funny signs to it with their fingers, or stroke it or "pat it with their hands. They once patted it and won the jackpot and they've never forgotten it. I have watched computer addicts, impatient for a server to respond, behaving in a similar way, say, knocking the terminal with their knuckles.
My informant about Las Vegas has also made an informal study of London betting shops. She reports that one particular gambler habitually runs, after placing his bet, to a certain tile in the floor, where he stands on one leg while watching the race on the bookmaker's television. Presumably he once won while standing on this tile and conceived the notion that there was a causal link. Now, if somebody else stands on 'his' lucky tile (some other sportsmen do this deliberately, perhaps to try to hijack some of his 'luck' or just to annoy him) he dances around it, desperately trying to get a foot on the tile before the race ends. Other gamblers refuse to change their shirt, or to cut their hair, while they are 'on a lucky streak'. In contrast one Irish punter, who had a fine head of hair, shaved it completely bald in a desperate effort to change his luck. His hypothesis was that he was having rotten luck on the horses and he had lots of hair. Perhaps the two were connected somehow; perhaps these facts were all part of a meaningful pattern! Before we feel too superior, let us remember that large numbers of us were brought up to believe that Samson's fortunes changed utterly after Delilah cut off his hair.
How can we tell which apparent patterns are genuine, which random and meaningless? Methods exist, and they belong in the science of statistics and experimental design. I want to spend a little more time explaining a few of the principles, though not the details, of statistics. Statistics can largely be seen as the art of distinguishing pattern from randomness. Randomness means lack of pattern. There are various ways of explaining the ideas of randomness and pattern. Suppose I claim that I can tell girls' handwriting from boys'. If I am right, this would have to mean that there is a real pattern relating sex to handwriting. A sceptic might doubt this, agreeing that handwriting varies from person to person but denying that there is a sex-related pattern to this variation. How should you decide whether my claim, or the sceptic's, is right? It is no use just accepting my word for it. Like a superstitious Las Vegas gambler, I could easily have mistaken a lucky streak for a real, repeatable skill. In any case, you have every right to demand evidence. What evidence should satisfy you? The answer is evidence that is publicly recorded, and properly analysed.
The claim is, in any case, only a statistical claim. I do not maintain (in this hypothetical example - in reality I am not claiming anything) that I can infallibly judge the sex of the author of a given piece of handwriting. I claim only that among the great variation that exists among handwriting,
some component of that variation correlates with sex. Therefore, even though I shall often make mistakes, if you give me, say, 100 samples of handwriting I should be able to sort them into boys and girls more accurately than could be achieved purely by guessing at random. It follows that, in order to assess any claim, you are going to have to calculate how likely it is that a given result could have been achieved by guessing at random.
Once again, we have an exercise in calculating the odds of coincidence.
Before we get to the statistics, there are some precautions you need to take in designing the experiment. The pattern - the non-randomness we seek - is a pattern relating sex to handwriting. It is important not to confound the issue with extraneous variables. The handwriting samples that you give me should not, for instance, be personal letters. It would be too easy for me to guess the sex of the writer from the content of the letter rather than from the handwriting. Don't choose all the girls from one school and all the boys from another. The pupils from one school might share aspects of their handwriting, learning either from each other or from a teacher. These could result in real differences in handwriting, and they might even be interesting, but they could be representative of different schools, and only incidentally of different sexes. And don't ask the children to write out a passage from a favourite book. I should be influenced by a choice of Black Beauty or Biggies (readers whose childhood culture is different from mine will substitute examples of their own).
Obviously, it is important that the children should all be strangers to me, otherwise I'd recognize their individual writing and hence know their sex. When you hand me the papers they must not have the children's names on them, but you must have some means of keeping track of whose is which. Put secret codes on them for your own benefit, but be careful how you choose the codes. Don't put a green mark on the boys' papers and a yellow mark on the girls'. Admittedly, I won't know which is which, but I'll guess that yellow denotes one sex and green the other, and that would be a big help. It would be a good idea to give every paper a code number. But don't give the boys the numbers 1 to 10 and the girls 11 to 20; that would be just like the yellow and green marks all over again. So would giving the boys odd numbers and the girls even. Instead, give the papers random numbers and keep the crib list locked up where I cannot find it. These precautions are those named 'double blind' in the literature of medical trials.
Let's assume that all the proper double blind precautions have been taken, and that you have assembled 20 anonymous samples of handwriting, shuffled into random order. I go through the papers, sorting them into two piles for suspected boys and suspected girls. I may have
some 'don't knows', but let's assume that you compel me to make the best guess I can in such cases. At the end of the experiment I have made two piles and you look through to see how accurate I have been.
Now the statistics. You'd expect me to guess right quite often even if I was guessing purely at random. But how often? If my claim to be able to sex handwriting is unjustified, my guessing rate should be no better than somebody tossing a coin. The question is whether my actual performance is sufficiently different from a coin-tosser's to be impressive. Here is how to set about answering the question.
Think about all possible ways in which I could have guessed the sex of the 20 writers. List them in order of impressiveness, beginning with all 20 correct and going down to completely random (all 20 exactly wrong is nearly as impressive as all 20 exactly right, because it shows that I can discriminate, even though I perversely reverse the sign). Then look at the actual way I sorted them and count up the percentage of all possible sortings that would have been as impressive as the actual one, or more. Here's how to think about all possible sortings. First, note that there is only one way of being 100 per cent right, and one way of being 100 per cent wrong, but there are lots of ways of being 50 per cent right. One could be right on the first paper, wrong on the second, wrong on the third, right on the fourth . . , There are somewhat fewer ways of being 60 per cent right. Fewer ways still of being 70 per cent right, and so on. The number of ways of making a single mistake is sufficiently few that we can write them all down. There were 20 scripts. The mistake could have been made on the first one, or on the second one, or on the third one . . . or on the twentieth one. That is, there are exactly 20 ways of making a single mistake. It is more tedious to write down all the ways of making two mistakes, but we can calculate how many ways there are, easily enough, and it comes to 190. It is harder still to count the ways of making three mistakes, but you can see that it could be done. And so on.
Suppose, in this hypothetical experiment, two mistakes is actually what I did make. We want to know how good my score was, on a spectrum of all possible ways of guessing. What we need to know is how many possible ways of choosing are as good as, or better than, my score. The number as good as my score is 190. The number better than my score is 20 (one mistake) plus 1 (no mistakes). So, the total number as good as or better than my score is 211. It is important to add in the ways of scoring better than my actual score because they properly belong in the petwhac, along with the 190 ways of scoring exactly as well as I did.
We have to set 211 against the total number of ways in which the 20 scripts could have been classified by penny-tossers. This is not difficult to calculate. The first script could have been boy or girl; that is two
possibilities. The second script also could have been boy or girl. So, for each of the two possibilities for the first script, there were two possibilities for the second. That is 2 x 2 = 4 possibilities for the first two scripts. The possibilities for the first three scripts are 2 x 2 x 2 = 8. And the possible ways of classifying all 20 scripts are 2 x 2 x 2 . . . 2. 0 times, or 2 to the power 20. This is a pretty big number, 1,048,576.
So, of all possible ways of guessing, the proportion of ways that are as good as or better than my actual score is 211 divided by 1,048,576, which is approximately 0. 0003, or 0. 02 per cent. To put it another way, if 10,000 people sorted the scripts entirely by tossing pennies, you'd expect only two of them to score as well as I actually did. This means that my score is pretty impressive and, if I performed as well as this, it would be strong evidence that boys and girls differ systematically in their handwriting. Let me repeat that this is all hypothetical. As far as I know, I have no such ability to sex handwriting. I should also add that, even if there was good evidence for a sex difference in handwriting, this would say nothing about whether the difference is innate or learned. The evidence, at least if it came from the kind of experiment just described, would be equally compatible with the idea that girls are systematically taught a different handwriting from boys - perhaps a more 'ladylike' and less 'assertive' fist.
We have just performed what is technically called a test of statistical significance. We reasoned from first principles, which made it rather tedious. In practice, research workers can call upon tables of probabilities and distributions that have been previously calculated. We therefore don't literally have to write down all possible ways in which things could have happened. But the underlying theory, the basis upon which the tables were calculated, depends, in essence, upon the same fundamental procedure. Take the events that could have been obtained and throw them down repeatedly at random. Look at the actual way the events occurred and measure how extreme it is, on the spectrum of all possible ways in which they could have been thrown down.
Notice that a test of statistical significance does not prove anything conclusively. It can't rule out luck as the generator of the result that we observe. The best it can do is place the observed result on a par with a specified amount of luck. In our particular hypothetical example, it was on a par with two out of 10,000 random guessers. When we say that an effect is statistically significant, we must always specify a so-called p- value. This is the probability that a purely random process would have generated a result at least as impressive as the actual result. A p-value of 2 in 10,000 is pretty impressive, but it is still possible that there is no genuine pattern there. The beauty of doing a proper statistical test is that we know how probable it is that there is no genuine pattern there.
Conventionally, scientists allow themselves to be swayed by p-values of 1 in 100, or even as high as 1 in 20: far less impressive than 2, in 10,000. What p-value you accept depends upon how important the result is, and upon what decisions might follow from it. If all you are trying to decide is whether it is worth repeating the experiment with a larger sample, a p- value of 0. 05, or 1 in 20, is quite acceptable. Even though there is a 1 in 20 chance that your interesting result would have happened anyway by chance, not much is at stake: the error is not a costly one. If the decision is a life and death matter, as in some medical research, a much lower p- value than 1 in 20 should be sought. The same is true of experiments that purport to show highly controversial results, such as telepathy or 'paranormal' effects.
As we briefly saw in connection with DNA fingerprinting, statisticians distinguish false positive from false negative errors, sometimes called
type 1 and type 2 errors respectively. A type 2 error, or false negative, is
a failure to detect an effect when there really is one. A type 1 error, or false positive, is the opposite; concluding that there really is something going on when actually there is nothing but randomness. The p-value is the measure of the probability that you have made a type 1 error. Statistical judgement means steering a middle course between the two kinds of error. There is a type 5 error in which your mind goes totally blank whenever you try to remember which is which of type 1 and type 2. I still look them up after a lifetime of use. Where it matters, therefore, I shall use the more easily remembered names, false positive and false negative. I also, by the way, frequently make mistakes in arithmetic. In practice I should never dream of doing a statistical test from first principles as I did for the hypothetical handwriting case. I'd always look up in a table that somebody else - preferably a computer - had calculated.
Skinner's superstitious pigeons made false positive errors. There was in fact no pattern in their world that truly connected their actions to the deliveries of the reward mechanism. But they behaved as if they had detected such a pattern. One pigeon 'thought' (or behaved as if it thought) that left stepping caused the reward mechanism to deliver. Another 'thought' that thrusting its head into the corner had the same beneficial effect. Both were making false positive errors. A false negative error is made by a pigeon in a Skinner box who never notices that a peck at the key yields food if the red light is on, but that a peck when the blue light
is on punishes by switching the mechanism off for ten minutes. There is a genuine pattern waiting to be detected in the little world of this Skinner box, but our hypothetical pigeon does not detect it. It pecks indiscriminately to both colours, and therefore gets a reward less frequently than it could.
A false positive error is made by a farmer who thinks that sacrificing to the gods brings longed-for rain. In fact, I presume (although I haven't investigated the matter experimentally), there is no such pattern in his world, but he does not discover this and persists in his useless and wasteful sacrifices. A false negative error is made by a farmer who fails to notice that there is a pattern in the world relating manuring of a field to the subsequent crop yield of that field. Good farmers steer a middle way between type 1 and type 2 errors.
It is my thesis that all animals, to a greater or lesser extent, behave as intuitive statisticians, choosing a middle course between type 1 and type 2 errors. Natural selection penalizes both type 1 and type 2 errors, but the penalties are not symmetrical and no doubt vary with the different ways of life of species. A stick caterpillar looks so like the twig it is sitting on that we cannot doubt that natural selection has shaped it to resemble a twig. Many caterpillars died to produce this beautiful result. They died because they did not sufficiently resemble a twig. Birds, or other predators, found them out. Even some very good twig mimics must have been found out. How else did natural selection push evolution towards the pitch of perfection that we see? But, equally, birds must many times have missed caterpillars because they resembled twigs, in some cases only slightly. Any prey animal, no matter how well camouflaged, can be detected by predators under ideal seeing conditions. Equally, any prey animal, no matter how poorly camouflaged, can be missed by predators under bad seeing conditions. Seeing conditions can vary with angle (a predator may spot a well-camouflaged animal when looking straight at it, but will miss a poorly camouflaged animal out of the corner of its eye). They can vary with light intensity (a prey may be overlooked at twilight, whereas it would be seen at noon). They can vary with distance (a prey which would be seen at six inches range may be overlooked at a range of 100 yards).
Imagine a bird cruising around a wood, looking for prey. It is surrounded by twigs, a very few of which might be edible caterpillars. The problem is to decide. We can assume that the bird could guarantee to tell whether an apparent twig was actually a caterpillar if it approached the twig really close and subjected it to a minute, concentrated examination in a good light. But there isn't time to do that for all twigs. Small birds with high turnover metabolism have to find food alarmingly often in order to stay alive. Any bird that scanned every individual twig with the equivalent of a magnifying glass would die of starvation before it found its first caterpillar. Efficient searching demands a faster, more cursory and rapid scanning, even though this carries a risk of missing some food. The bird has to strike a balance. Too cursory and it will never find anything. Too detailed and it will detect every caterpillar it looks at, but it will look at too few, and starve.
It is easy to apply the language of type 1 and type 2 errors. A false negative is committed by a bird that sails by a caterpillar without giving it a closer look. A false positive is committed by a bird that zooms in on a suspected caterpillar, only to discover that it is really a twig. The penalty for a false positive is the time and energy wasted flying in for the close inspection: not serious on any one occasion, but it could mount up fatally. The penalty for a false negative is missing a meal. No bird outside Cloud Cuckooland can hope to be free of all type 1 and type 2 errors. Individual birds will be programmed by natural selection to adopt some compromise policy calculated to achieve an optimum intermediate level of false positives and false negatives. Some birds may be biased towards type 1 errors, others towards the opposite extreme. There will be some intermediate setting which is best, and natural selection will steer evolution towards it.
Which intermediate setting is best will vary from species to species. In
our example it will also depend upon conditions in the wood, for example, the size of the caterpillar population in relation to the number of twigs. These conditions may change from week to week. Or they may vary from wood to wood. Birds may be programmed to learn to adjust their policy
as a result of their statistical experience. Whether they learn or not, successfully hunting animals must usually behave as if they are good statisticians. (I hope it is not necessary, by the way, to plod through the usual disclaimer: No, no, the birds aren't consciously working it out with calculator and probability tables. They are behaving as if they were calculating p-values. They are no more aware of what a p-value means than you are aware of the equation for a parabolic trajectory when you catch a cricket ball or baseball in the outfield. )
Angler fish take advantage of the gullibility of little fish such as gobies. But that is an unfairly value-laden way of putting it. It would be better not to speak of gullibility and say that they exploit the inevitable difficulty the little fish have in steering between type 1 and type 2 errors. The little fish themselves need to eat. What they eat varies, but it often includes small wriggling objects such as worms or shrimps. Their eyes and nervous systems are tuned to wriggling things. They look for wriggling movement and if they see it they pounce. The angler fish exploits this tendency. It has a long fishing rod, evolved from a modified spine, commandeered by natural selection from its original location at the front of the dorsal fin. The angler fish itself is highly camouflaged and it sits motionless on the sea bottom for hours at a time, blending perfectly with the weeds and rocks. The only part of it which is conspicuous is a 'bait', which looks like a worm, a shrimp or a small fish, at the end of its fishing rod. In some deep-sea species the bait is even luminous. In any case, it seems to wriggle like something worth eating
when the angler waves its rod. A possible prey fish say, a goby, is attracted. The angler 'plays' its prey for a little while to hook its attention, then casts the bait down into the still unsuspected region in front of its own invisible mouth, and the little fish often follows. Suddenly that huge mouth is invisible no longer. It gapes massively, there is a violent inrushing of water, engulfing every floating object in the vicinity, and the little fish has pursued its last worm.
From the point of view of a hunting goby, any worm may be overlooked or it may be seen. Once the worm has been detected, it may turn out to be a real worm or an angler fish's lure, and the unfortunate fish is faced with a dilemma. A false negative error would be to refrain from attacking a perfectly good worm for fear that it might be an angler fish lure. A false positive error would be to attack a worm, only to discover that it is really a lure. Once again, it is impracticable in the real world to get it right all the time. A fish that is too risk-averse will starve because it never attacks worms. A fish that is too foolhardy won't starve but it may be eaten. The optimum in this case may not be halfway between. More surprisingly, the optimum may be one of the extremes. It is possible that angler fish are sufficiently rare that natural selection favours the extreme policy of attacking all apparent worms. I am fond of a remark of the philosopher and psychologist William James on human angling:
There are more worms unattached to hooks than impaled upon them; therefore, on the whole, says Nature to her fishy children, bite at every worm and take your chances. (1910)
Like all other animals, and even plants, humans can and must behave as intuitive statisticians. The difference with us is that we can do our calculations twice over. The first time intuitively, as though we were birds or fish. And then again explicitly, with pencil and paper or computer. It is tempting to say that the pencil and paper way gets the right answer, so long as we don't make some publicly detectable blunder like adding in the date, whereas the intuitive way may yield the wrong answer.