Absence of evidence and confidence intervals

Here are some statistical interpretation anti-patterns:

This study found no significant evidence that X had any effect. Therefore…
(Sub-pattern of the first: because of conservation of expected evidence, this is evidence that the true effect is zero.)
Study A found significant evidence for X, but study B didn’t. Therefore, they disagree or there’s heterogeneity.

There are a couple problems with claims like these.

First, absence of evidence can be only very slight evidence of absence. For instance, if I posted a large sign over my front door saying “BEN LIVES HERE,” that would be strong evidence that my name was Ben—but its absence is only very weak evidence that my name isn’t Ben, because hardly anyone posts such a sign anyway. Mathematically, this cashes out to the fact that Bayes factor P(~E|H)/P(~E) is very close to 1, so seeing ~E is only a very small negative update. (Conservation of expected evidence says that the Bayes factor can’t exceed 1 if E is evidence for H, but it doesn’t say how close to 1 it can get!) So even if the old saying “absence of evidence is not evidence of absence” is not technically true, it’s a useful approximation in many cases.

Second, reporting merely whether a statistical hypothesis test is significant is barely informative at all. That’s because, if the test has low statistical power, it may have been extremely unlikely to come up significant even if the hypothesis were true—a situation analogous to my name-sign example above.

For this reason, you shouldn’t really care about the mere fact that X’s effect isn’t significantly different from 0. Instead, you should care about the confidence interval around the effect. The confidence interval basically answers the question: “what’s a set of potential true effect sizes where it wouldn’t be too improbable to see the observed effect size that we actually saw?” (More precisely, if we call the error the difference between the true effect size and the observed effect size; then the 95% confidence interval is the set of true effect sizes where one would see an error of the observed size or smaller 95% of the time.)

(Be careful! This is different from being “95% certain” that the true effect size lies in that interval!)

The confidence interval around the effect size tells you a lot more than just whether the test was significant. In particular, instead of just telling you that the observed effect size is consistent with a true effect size of zero, it tells you the entire range of effect sizes that the true effect size is consistent with. That’s way more, and way more useful, information.

So how does this play into evidence-of-absence issues? Most importantly, knowing the confidence interval gives you some information about P(~E|H). If the data are consistent with every plausible effect size—say that you’re examining the effect of donating blood on heart disease and your estimated risk ratio is 1.0 with a CI of (0.1-10.0)—then it’s very unlikely that you would have observed a significant outcome even if the true effect was positive, so the absence of evidence is only very weak evidence of absence.

Similarly, if you run two studies and one turns out significant while the other doesn’t, then you might interpret this as having “mixed evidence” for donating blood helping with heart disease, or you might start worrying that the studies were in different populations or used different procedures (heterogeneity). But confidence intervals give you much richer data:

If you get CIs of (0.1-10) and (0.7-0.8) then there’s pretty strong evidence for an effect; the first study is very weak evidence against an effect because it was so underpowered.
If you get CIs of (0.1-0.9) and (0.5-1.5), then there’s moderate evidence for an effect, and moderate to weak evidence for heterogeneity, since both studies are consistent with values in the range (0.5-0.9).
If you get CIs of (0.1-0.2) and (0.9-1.1), then there’s strong evidence of heterogeneity because the CIs don’t overlap at all.

As you can see, reporting and interpreting confidence intervals, instead of mere significance/non-significance, can be extremely helpful when drawing conclusions from statistical studies.

Related

What happened to all the non-programmers?

Is treating a cold with zinc still evidence-backed?

Startup options are much better than they look

Comments