ESP and the significance of significance

As of the 23rd May 2022 this website is archived and will receive no further updates. was produced by the Winton programme for the public understanding of risk based in the Statistical Laboratory in the University of Cambridge. The aim was to help improve the way that uncertainty and risk are discussed in society, and show how probability and statistics can be both useful and entertaining.

Many of the animations were produced using Flash and will no longer work.

A controversy about experiments in extra-sensory perception throws some light, and maybe some confusion, on the idea of statistical significance. This article discusses a common misinterpretation of the results of significance tests, and investigates some criticisms of significance tests in general.

A psychological dust-up

Recently (March 2011), the highly respected Journal of Personality and Social Psychology published in its printed edition a new paper by the distinguished psychologist Daryl Bem, professor emeritus at Cornell University in the USA. (The paper is on the journal’s website, behind a paywall, or there is a version on Bem’s own website.) You might be wondering what’s unusual about that, and why it’s relevant to a website about uncertainty.

The answer is that the paper reports a series of experiments which, Professor Bem claims, provide evidence for some types of extra-sensory perception (ESP). These can occur only if the generally accepted laws of physics are not all true. That’s a pretty challenging claim. And the claim is based largely on the results of a very common (and very commonly misunderstood) statistical procedure called significance testing.

Most psychological experiments involve people. People are variable and differ from one another in many ways. Their responses to psychological tasks will differ. Thus there is variability in the experimental results, often very considerable variability. This inevitably leads to uncertainty about the patterns and theories that lie behind the results. Psychologists make heavy use of statistical methods to help deal with this uncertainty.

Though Bem’s paper has only recently appeared in print, controversy about it has been going on for months. The journal published the paper online in January 2011, but even before that, it was available on the Internet on the author’s website. Because of the challenging nature of Bem’s claims, there was vigorous debate, not only in the blogosphere, but also in more traditional media including the science press and the New York Times. (Two New York Times articles on the issue are available here and here.)

A lot of this debate centred on the statistical methods used. Although in most ways these methods were quite standard, they had led to conclusions that many scientists simply could not believe to be true. One possible reason for that might be that there is something wrong with the statistical methods, and that’s the reason for examining them on this website. Bem’s experiments provide an excellent way into looking at how significance testing works, and at what’s problematic about it.

The Journal of Personality and Social Psychology recognised that the statistical aspects are crucial. They took the unusual step of publishing an editorial alongside Bem’s paper, explaining their reasons for publishing it, and also including in the same journal another paper by the Dutch psychologist Eric-Jan Wagenmakers and colleagues, called Why Psychologists Must Change the Way They Analyze Their Data (paywall). This paper is also available on the author’s website; a direct link is here.

Wagenmakers and colleagues argue that other ways of analysing Bem’s data, particularly a certain Bayesian approach, would have led to other results that did not provide evidence for ESP. Hence, Wagenmakers argues, what Bem’s paper really shows is that psychologists should change the way they do experiments and in particular the way they analyse the resulting data. That is, he argues that there’s something badly wrong with the way psychologists usually deal with uncertainty.

Needless to say, the controversy hasn’t stopped there. Daryl Bem, with statistician colleagues Jessica Utts and Wesley Johnson of the University of California, has made available here a response to Wagenmakers and his colleagues, arguing that there are things wrong with their Bayesian analysis. Wagenmakers and his colleagues have
responded here. Several other researchers and commentators have joined in too, on statistical, psychological and other issues. This one will run and run.

However, it’s not my intention her to pick apart every aspect of this controversy. I want to make some quite specific points. So here’s a more detailed description of part of what Bem did.

Bem’s experiments and what he did with the data

Bem’s article reports the results of nine different experiments. They differ from one another quite a lot in detail, but what they have in common is that they start with a pretty standard experimental psychology setup, and then change something so that an effect will be observed only if the laws of causality are reversed, so that something happens before the event causing it has taken place.

Let’s look at Bem’s Experiment 2. This is based on well-established psychological knowledge about perception. Images that are flashed up on a screen for an extremely short time, so short that the conscious mind does not register that they have been seen at all, can still affect how an experimental subject behaves. Such images are said to be presented subliminally, or to be subliminal images. For instance, experimental participants (people) can be “trained” to choose one item rather than another, by presenting them with a pleasant, or at any rate not unpleasant (neutral) subliminal image after they have made the “correct” choice, and a very unpleasant one after they have made the “wrong” choice.

Although in this kind of experiment the participants have no conscious recall of seeing the subliminal images, there is nothing very mysterious about this. It fits in with psychologists’ understanding of how the perceptual system works.

Bem, however, did something rather different in his Experiment 2. As in a standard experiment of this sort, his participants (undergraduate students from Cornell University) had to choose between two closely matched pictures (projected clearly on a screen, side by side). In fact the pictures were mirror images of one another. Then they were presented a neutral subliminal image if they had made the “correct” choice, and an unpleasant subliminal image if they had made the “wrong” choice. The process was then repeated with a different pair of pictures to choose between. Each participant made their choice 36 times, and there were 150 participants in all.

But the new feature of Bem’s experiment was that, when the participants made their choice between the two pictures in each pair, nobody, not the participants, not the experimenters, could know which was the “correct” choice. The “correct” choice was determined by a random mechanism, after a picture had been chosen by the respondent. If this random mechanism said the participant’s choice was “correct”. They were shown the neutral subliminal image, but if it said their choice was “wrong”, they were shown the unpleasant image.

If the experiment was working as designed, and if the laws of physics, relating to causes and effects, are as we understand them, then the subliminal images could have no effect at all on the participants’ choices of picture, because at the time they made their choice, there was no “correct” image to choose. Which image was “correct” was determined afterwards. Therefore, given the way the experiment was designed - I have not explained every detail - one would expect each participant to be “correct” in their choice half the time, on average. Because of random variability, some would get more than 50% right, some would get less, but on average, people would make the right choice 50% of the time.

What Bem found was that the average percentage correct, across his 150 participants, was not 50%. It was slightly higher: 51.7%.

There are several possible explanations for this finding, including the following.

  1. The rate was higher than 50% just because there is random variability, both in the way people respond and in the way the “correct” image was selected. That is, nothing very interesting happened.
  2. The rate was higher than 50% because the laws of cause and effect are not as we understand them conventionally, and somehow the participants could know something about which picture was “correct” before the random system had decided which was “correct”.
  3. The rate was higher than 50% because there was something wrong with the experimental setup, and in fact the participants could get an idea about which picture was “correct” when they made their choice, without the laws of cause and affect being broken.
  4. More subtly, these results are not typical, in the sense that actually more experiments were done than are reported in the paper, and the author chose to report the results that were favourable to the hypothesis that something happened that casts doubt on the laws of cause and effect, and not to report the others. Or perhaps more and more participants kept being added to the experiment until the results happened to look favourable to that hypothesis.

I won’t consider all these in detail. Various forms of Explanation 4 have been raised by some of the critics of Bem’s work. (The point about the last possibility it mentions, or putting it another way, that the experiment is continued until things look favourable enough to the ESP hypothesis, is rather subtle. Behind it lies a mathematical result which ensures that, even if all that is going on is chance variability, if you go on adding experimental participants long enough, you will be sure to reach a stage (possibly after a very long time) where it looks as if something beyond chance is involved.) However, Bem himself, in the original paper and in subsequent discussion, is adamant that this selection of favourable results did not occur. We won’t investigate Explanation 4 any further here.

Explanation 3 is important, because the setup for this experiment and all the others had to be reasonably complicated to avoid this kind of bias, and it remains possible that some aspect was not dealt with properly. But whether Explanation 3 is plausible is not, mostly, a matter of uncertainty or randomness, so again I won’t pursue these details further.

Instead I’ll concentrate on how and why Bem ruled out Explanation 1 - or to be more precise, why he decided it was not a likely explanation.

He carried out a significance test. Actually he made several different significance tests, making slightly different assumptions in each case, but they all led to more or less the same conclusion, so I’ll discuss only the simplest. He did what’s known as a t test (more precisely, a one-sided (or one-tailed) one-sample t test). The resulting p value was 0.009. Because this value is small, he concluded that Explanation 1 (“it’s all just chance and random variability”) was not appropriate, and that the result was “statistically significant”.

This is a standard statistical procedure, very commonly used. But what does it actually mean, and what is this “p value”?

All significance tests involve a null hypothesis, which (typically) is a statement that nothing very interesting has happened. In a test comparing the effects of two drugs, the usual null hypothesis would be that, on average, the drugs do not differ in their effects. In a test of whether a particular coin is fair, the null hypothesis would be that its chance of coming up Heads when tossed is 50%. In Bem’s Experiment 2, the null hypothesis is that Explanation 1 is true - that is, that the true average proportion of “correct” answers is 50%, and any difference from 50% that is observed is simply due to random variability.

The p value for the test is found as follows. One assumes that the null hypothesis is really true. One then calculates the probability of observing the data that were actually observed, or something more extreme, under this assumption. That probability is the p value. So in this case, Bem used standard methods to calculate the probability of getting an average proportion correct of 51.7%, or greater, on the assumption that all that was going on was chance variability. He found this probability to be 0.009. (I’m glossing over a small complication here, that in this case the definition of “more extreme” depends on the variability of the data as well as their average, but that’s not crucial to the main ideas.)

Well, 0.009 is quite a small probability. So we have two possibilities here. Either the null hypothesis really is true, but nevertheless an unlikely event has occurred. Or the null hypothesis isn’t true. Since unlikely events do not occur often, we should at least entertain the possibility that the null hypothesis isn’t true. Other things being equal (which they usually aren’t), the smaller the p value is, the more doubt it casts on the null hypothesis. How small the p value needs to be in order for us to conclude that there’s something really dubious about the null hypothesis (and hence that, in the jargon, the result is statistically significant and the null hypothesis is rejected) depends on circumstances. Sometimes the values of 0.05 or 0.01 are used as boundaries, and a p value less than that would be considered a significant result.

This line of reasoning, though standard in statistics, is not at all easy to get one’s head round. In my view, anyone who thinks they understood what a p value is, the first time they met the idea, has actually misunderstood. (No doubt this is an overgeneralisation.) The whole idea is expressed very pedantically. Assume the null hypothesis is true, then under this assumption, calculate the probability, not of the data that were observed, but of observing data at least that extreme. All that is a bit of a mouthful. But, in order to make sense of what the p value is actually telling you, the pedantic details are all needed.

In an experiment like this one often sees the p value interpreted as follows.

(a) “The p value is 0.009. The probability of getting these results by chance is 0.009.”

That is WRONG. Or, at least, wrong in the way it’s usually understood.

A better way of putting it would be

(b) “The p value is 0.009. The probability of getting these results, by chance, is 0.009.”

How pedantic is that! Just adding commas makes a big difference.

What’s wrong with (a) is that people usually take it to mean something like the following: “Given that we’ve got these results, the probability that chance alone is operating is 0.009.” Or, putting it another way: “Given that we’ve got these results, the probability that the null hypothesis is true is 0.009.” But that’s not what is meant at all. Version (b) indicates more clearly that it’s the other way round: “Given that chance alone is operating, the probability of getting results like these is 0.009”, or “Given that the null hypothesis is true, the probability of getting results like these is 0.009.” That’s not quite the whole picture, because it doesn’t include the part about “results at least as extreme as these”, but it’s close enough for most purposes.

So the difference between (a) and (b), as they are usually understood, is that (a) says “Given that we’ve got these results, the probability that the null hypothesis is true is 0.009”, and (b) says “Given that the null hypothesis is true, the probability of getting results like these is 0.009”. These differ in that the “given” part and the part that has probability of 0.009 are swapped round.

It may well not be obvious why that matters. The point is that the answers to the two questions “Given A, what’s the probability of B?” and “Given B, what’s the probability of A” might be quite different. An example I sometimes use to explain this is to imagine that you’re picking a random person off the street in London. Given that the person is a Member of (the UK) Parliament, what’s the probability that they are a British citizen? Well that probability would be high. If after choosing this random person, it turned out they were an MP, then it’s very likely that they are a British citizen. Actually one doesn’t have to be a British citizen to be a UK MP; Irish and Commonwealth citizens are eligible as well. But in fact most MPs are British citizens, so the probability is high.

What about the other way round? Given that this random person is a British citizen, what’s the probability that they are an MP? I hope it’s clear to you that that probability would be very low. The great majority of the British citizens in London are not MPs.

So it’s fairly obvious (I hope!) that the answers to “Given this person is an MP, what’s the probability they are a British citizen?” and “Given this person is a British citizen, what’s the probability they are an MP?” cannot be the same. Swapping the “given” part and the probability changes things. And it’s just the same with p values. In Bem’s Experiment 2, the fact that the p value is 0.009 does not mean that “Given the results, the probability that only chance is operating is 0.009”. The p value does not tell you what this probability is.

Nevertheless, Bem is interpreting his significance test in the commonly used way when he deduces, from a p value of 0.009, that the result is “significant” and that there may well be more going on that simply the effects of chance. But, just because this is common, that doesn’t mean it’s always correct.

In all nine experiments that Bem reports, he found “significant” p values in all but one (where he seems to be taking “significant” as meaning “p less than 0.05” - the values range from 0.002 to 0.039, with the one he regards as non-significant having a p value of 0.096). Conventionally, these would indeed usually be taken as Bem interprets them, pointing in the direction that most of the null hypotheses are probably not true, which in the case of these experiments means that there is considerable evidence of the laws of physics not applying everywhere. Can that really be the case?

What’s wrong with p values?

Despite the very widespread use of statistical significance testing (particularly in psychology), Bem’s work is very far from being the first situation in which doubt has been cast on the significance test approach. Significance testing has been heavily criticised, by psychologists themselves as well as by some statisticians and other scientists.

One common criticism relates to the central role of the null hypothesis. In many circumstances, it’s simply not plausible that the null hypothesis would be exactly true. For instance, in comparing two drugs, it’s not really credible that they would be exactly equal in their average effects. It’s practically certain that there would be some difference between them, even if that difference is very tiny. The interesting question is what the size of the average difference is, not whether the difference is exactly zero. Thus many statisticians and scientists would recommend that significance testing is routinely replaced by some sort of procedure to estimate the size of differences, and to give some idea of the possible variability of the difference.

In fact, however, research on ESP is (arguably) one area where this particular criticism does not carry so much force. If the laws of physics are operating as we understand them, then the null hypothesis simply has to be true. In Bem’s Experiment 2, the average rate of “correct” answers has to be 50%, with any deviation from that arising solely by chance. So in this situation, maybe the general approach of looking at an exact null hypothesis is justified. (This leaves out the possibility that there may still be some fault with the experimental setup, so that the rate would differ from 50% without the laws of physics being broken - but let’s ignore that for now.)

If we allow, for now, that it makes sense to think about an exact null hypothesis, that doesn’t bring the criticisms of significance testing to an end. For instance, if the size of the sample involved is large, a significance test may lead us to reject the null hypothesis even though the effect involved is too small to be of any scientific or practical interest.

This is just one aspect of what might be seen as a gap in the logic of significance testing. Remember the conclusion that was reached when the p value in Bem’s experiment was 0.009? “Either the null hypothesis really is true, but nevertheless an unlikely event has occurred. Or the null hypothesis isn’t true.” This is OK as far as it goes, but it says nothing about how likely the data were if the null hypothesis isn’t true. Maybe they are still unlikely if even if the null hypothesis is false. Surely we can’t just throw out the null hypothesis without further investigation of what the probabilities are if the null hypothesis is false?

This is an issue that must be dealt with if one is trying to use the test to decide whether the null hypothesis is true or false. It should be said that the great statistician and geneticist R.A. Fisher, who invented the notion of significance testing, would simply not have used the result of a significance test on its own to decide whether a null hypothesis is true or false - he would have taken other relevant circumstances into account. (Fisher did change his mind on several statistical things during his life, but the view I attribute to him here is how he described things in his later writings at least.) But unfortunately not every user of significance tests follows Fisher’s approach.

The usual way to deal with the situation where the null hypothesis is false is to define a so-called alternative hypothesis. In the case of the Bem experiment, this would be the hypothesis that the average rate of correct answers is greater than 50%. (His analysis effectively doesn’t take account of the possibility that the rate might be less than 50%. It’s either exactly 50% (the null hypothesis) or it’s greater than 50% (the alternative hypothesis). However, putting in the possibility of a rate below 50% doesn’t change the essential point here.)

You might think that we could just deal with the alternative hypothesis in the same way that the null hypothesis is dealt with. That is, we could just calculate the probability of getting the data Bem did get, on the assumption that the alternative hypothesis is true. But there’s a snag. The alternative hypothesis simply says that the average rate is more than 50%. It doesn’t say how much more than 50%. If the real average rate were, let’s say, 99%, then getting an observed rate of 51.7% (as Bem did) isn’t very likely, but if the real average rate were 51.5%, then getting an observed rate of 51.7% is quite likely. But real averages of 99% and of 51.5% are both covered by the alternative hypothesis. So this isn’t going to get us off the hook.

Let’s do Bayes

One possibility is to meet head on the issue of misinterpreting the p value. I said that many people think that a p value of 0.009 actually means that the probability that the null hypothesis is true, given the data that were observed, is 0.009. Well, I explained why that’s not correct - but why do people interpret it that way? In my view it’s because what people actually want to know is how likely it is that the null hypothesis is true, given the data that were observed. The p value does not tell them this, so they just act as if it did.

The p value does not tell people what they want to know, because in order to find the probability that the null hypothesis is true (given the data), one needs to take a Bayesian approach to statistics. There’s more than one way to do that, but the one I’ll describe is as follows. (This involves more mathematical notation that we’ve had so far.)

It uses the odds form of Bayes’ Theorem, the theorem behind Bayesian approaches to statistics. There’s much more on this in our article on The Maths of Paul the “Psychic” Octopus. The odds form of Bayes’ Theorem, as I’ll use it here, says that

$$ \frac{ p( \hbox{alternative hypothesis} | \hbox{data} )}{ p( \hbox{null hypothesis} | \hbox{data} )} =
\frac{p( \hbox{data}| \hbox{alternative hypothesis})}{ p(\hbox{data}| \hbox{null hypothesis} )} \times \frac{ p( \hbox{alternative hypothesis})}{ p(\hbox{null hypothesis})}. $$

In this expression, $$ \frac{ p(\hbox{alternative hypothesis} | \hbox{data} )}{ p( \hbox{null hypothesis} | \hbox{data} )}$$ is known as the posterior odds for the alternative hypothesis.

We get to it by multiplying $$ \frac{ p( \hbox{alternative hypothesis})}{ p(\hbox{null hypothesis})},$$ the prior odds for the alternative hypothesis, by $$\frac{p( \hbox{data}| \hbox{alternative hypothesis})}{ p(\hbox{data}| \hbox{null hypothesis} )},$$ a quantity which is known as the Bayes factor.

Let’s look at this a bit more closely. We’re trying to find the probability that I said people really seem to want to know, the probability that the null hypothesis is true, given the data, or $p(\hbox{null hypothesis} | \hbox{data} )$. Now if the null hypothesis isn’t true, the alternative hypothesis must be true, so that $p( \hbox{alternative hypothesis} | \hbox{data} )= 1-p( \hbox{null hypothesis} | \hbox{data} )$.

This means that the posterior odds for the alternative hypothesis is actually $$ \frac{ 1 - p( \hbox{null hypothesis} | \hbox{data} )}{ p( \hbox{null hypothesis} | \hbox{data} )},$$ and if you know these posterior odds, it’s straightforward to work out $p(\hbox{null hypothesis} | \hbox{data} )$. If the posterior odds is 1, for instance, then $p(\hbox{null hypothesis} | \hbox{data} )=1/2$, and if the posterior odds is 5, then $p(\hbox{null hypothesis} | \hbox{data} )$ is $1/6$.

So far, so good. But to find the posterior odds for the alternative hypothesis, we need to know the prior odds for the alternative hypothesis, and the Bayes factor. It is the existence of the prior odds in the formula that puts some people off the Bayesian approach entirely.

The prior odds is a ratio of probabilities of hypotheses before the data have been taken into account. That is, it is supposed to reflect the beliefs of the person making the calculation before they saw any of the data, and people’s beliefs differ in a subjective way. One person may simply not believe it possible at all that ESP exists. In that case, they would say, before any data were collected, that $p( \hbox{alternative hypothesis})=0$ and $p( \hbox{null hypothesis})=1$. This means that, for this person, the prior odds for the alternative hypothesis is 0 divided by 1, which is just 0. It follows that the posterior odds must also be 0, whatever the value of the Bayes factor. Hence, for this person, $p(\hbox{alternative hypothesis} | \hbox{data} ) = 0$, whatever the data might be. This person started believing that ESP could not exist, and his or her mind cannot be changed by the data.

Another person might think it is very unlikely indeed that ESP exists, but not want to rule it out as being absolutely impossible. This may lead them to set the prior odds for the alternative hypothesis, not as zero, but as some very small number, say 1/10,000. If the Bayes factor turned out to be big enough, the posterior odds for the alternative hypothesis might nevertheless be a reasonably sized number, so that Bayes’ theorem is telling this person that, after the experiment, they should consider the alternative hypothesis to be reasonably likely.

Thus different people can look at the same data and come to different conclusions about how likely it is that the null hypothesis (or the alternative hypothesis) is true. Also, the probability that the null hypothesis is true might or might not be similar to the p value — it all depends on the prior odds as well as on the Bayes factor. (In many cases, actually it turns out that the probability of the null hypothesis is very different from the p value for a wide range of plausible values of the prior odds, drawing attention yet again to the importance of being pedantic when saying what the p value is the probability of. This phenomenon is sometimes called Lindley’s paradox, after the renowned Bayesian statistician Dennis Lindley who drew attention to it.)

You might think that the issue of people having different prior odds could be avoided by concentrating on the Bayes factor. If I could tell you the Bayes factor for one of Bem’s experiments, you could decide what your prior odds were, and multiply them by the Bayes factor to give your own posterior odds.

In fact, the criticism by Wagenmakers and his colleagues of Bem’s work takes exactly that line, of concentrating on the Bayes factors. For Bem’s Experiment 2, for instance, Wagenmakers and colleagues calculate the Bayes factor as about 1.05. Therefore the posterior odds for the alternative hypothesis are not much larger than the prior odds, or putting it another way, they would say that the data provide very little information to change one’s prior views. They therefore conclude that this experiment provides rather little evidence that ESP exists, and certainly not enough to overturn the established laws of physics. They come to similar conclusions about several more of Bem’s experiments, and for others, they calculate the Bayes factor as being less than 1. In these cases, the posterior odds for the alternative hypothesis will be smaller than the prior odds; that is, one should believe less in ESP after seeing the data than one did beforehand.

Well, that’s an end of it, isn’t it? Despite all Bem’s significance tests, the evidence provided by his experiments for the existence of ESP is either weak or non-existent.

But no, it’s not quite as simple as that. The trouble is that there’s more than one way of calculating a Bayes factor. Remember that the Bayes factor is defined as$$\frac{p( \hbox{data}| \hbox{alternative hypothesis})}{ p(\hbox{data}| \hbox{null hypothesis} )}.$$ It’s reasonably straightforward to calculate $p(\hbox{data}| \hbox{null
hypothesis} )$, but $p(\hbox{data}| \hbox{alternative hypothesis} )$ is harder. The alternative hypothesis includes a range of values of the quantity of interest. In Bem’s experiment 2, it includes the possibility that the average percentage correct is 50.0001%, or that it is 100%, or anything in between. Different individuals will have different views on which values in this range are most likely. Putting another way, the Bayes factor also depends on subjective prior opinions. Avoiding the issue of the prior odds, by concentrating on the Bayes factor, has not made the subjectivity go away.

Wagenmakers and his colleagues used one standard method of calculating the Bayes factors for Bem’s experiment, but this makes assumptions that Bem (with Utts and Johnson) disagrees with. For Experiment 2 (where Wagenmakers and colleagues found a Bayes factor of 1.05, you will recall), Bem, Utts and Johnson calculate Bayes factors by four alternative methods, all of which they claim to be more appropriate than that of Wagenmakers, and they result in Bayes factors ranging from 2.04 to 6.09.

They calculate Bayes factors around 2 when they make what they describe as “sceptical” prior assumptions, that is, assumptions which they feel would be made by a person who is sceptical about the possibility of ESP, though evidently not quite as sceptical as Wagenmakers. The Bayes factor of about 6 comes from assumptions that Bem and his colleagues regard as being based on appropriate prior knowledge about the size of effect typically observed in psychological experiments of this general nature.

These Bayes factors indicate considerably greater evidence in favour of ESP, from the same data, that Wagenmakers considered to be the case. Bem, Utts and Johnson report similar results for the other experiments.

Since then, Wagenmakers has come back with further arguments as to why his original approach is better (and as to why Bem, Utts and Johnson’s “knowledge-based” assumptions aren’t based on the most appropriate knowledge). Who is right?

Well, in my view that’s the wrong question. It would be very nice if experiments like these could clearly and objectively establish, one way or the other, whether ESP can exist. But the hope that the data can “speak for themselves”, in an unambiguous way, is in vain. If there is really some kind of ESP effect, it is not large (otherwise it would have been discovered years ago). So any such effect must be relatively small, and thus not straightforward to observe against the inevitable variability between individual experimental participants. Since the data are therefore not going to provide really overwhelming evidence one way or the other, it seems to me that people will inevitably end up believing different things in the light of the evidence, depending on what else they know and to some extent on their views of the world. Just looking at the numbers from Bem’s experiments is not going to make this subjective element disappear.



At I try to put your important notes in a more general decision-making setting and link it to the work of Jack Good and 'weights of evidence'. Historically ESP was seen as a case where one needs 'modern Bayesianism' which is what real Bayesians seem to do in such cases, which seems a bit more reasonable than 'dogmatic Bayesianism': 'first, think of a prior probability distribution ... '.

There are Bayesian ways to assess null values that do not involve hypothesis comparison and Bayes factors. Here's a post regarding Bayesian meta-analysis of data from 63 ESP experiments: The post also includes discussion of the file drawer problem and the limits of statistical analysis.

Let's assume, for the sake of argument, that there wasn't any cheating and that the 9 experiments were set up properly. Null hypothesis: the data occurred purely by chance Experimental hypothesis: the data were not caused purely by chance If I apply Fisher's Method to the 9 P values, I get a Chi Squared value of 82.9, which, with 2k (18) degrees of freedom, corresponds to a meta-P value of <0.000000001 (it was beyond the sensitivity of the calculator I was using). That's going to trump some pretty hefty Bayesian priors. The probability that the experiments were set up incorrectly or that the experimenter was dishonest is surely going to dwarf the probability that the data occurred by chance. So why is the question of whether the data occurred by chance still being debated?