What are we to make then of the ubiquitous ‘statistically significantly related’. Not very much I highly suspect. Even if you don’t read academic manuscripts you will have heard it in news reports of medical research or adverts for anti-aging skin creams, typically to indicate that the probability of observing the data observed under the null hypothesis is less than 5%.
You may have heard Hans Rosling joke in his TED talk about an experiment to test whether undergraduates knew which of five pairs of countries had the higher infant mortality rates. The students got 1.8 pairs on average correct, leading him to conclude that ‘Swedish top-level students know statistically significantly less than chimpanzees’ (who choosing randomly could be expected to get 2.5 pairs correct).
‘Statistically significant’ is a tremendously ugly phrase but unfortunately that is the least of its shortcomings. What is far worse is that it is misleading. Significant has a plain language meaning of ‘important’ (a significant breakthrough in an investigation) or ‘large’ (a significant amount of money) or ‘meaningful’ (a statement is significant). Statistical significance, on the other hand, may not correspond with any of those things since it is a descriptor of the p-value alone and for a given association the magnitude of the p-value is a simple function of the sample size. If statistical significance is all you want, just increase your sample size. In other words, our ability to detect differences is strongly associated with how hard we are looking.
Imagine if an environmentalist said that oil contamination was detectable in a sample of water from a protected coral reef. The importance of that statement would change drastically depending on whether they were referring to a naked-eye assessment of a water sample or an electron microscope examination. The smaller the amount of oil, the harder we would have to look. The same is true for a clinical study that detects a statistically significant treatment effect. If the study is huge, then issues of statistical significance become unimportant, since even tiny and clinically unimportant differences can be found to be statistically significant. The meaning of the p-value must be interpreted with reference to the sample size and ideally to the effect size (which summarises how large the effect is). The International Journal of Epidemiology actively discourages the use of the term statistically significant in submitted manuscripts, but this is far from the norm.
What we mean by a ‘statistically significant’ difference is that the difference is ‘unlikely to be zero’. This is a phrase that is unlikely to catch on. An alternative term, statistically discernible, has a number of advantages (even if it still includes an adverb). Firstly, discernibility has none of the aforementioned unhelpful meanings of significance. It merely implies that the effect is distinguishable from randomness. Moreover, when space is restricted in an abstract or manuscript it is often tempting to drop the ‘statistically’ from 'statistically significant' which results in a potentially misleading statement. Discernible has none of this baggage. No value is placed on the importance, magnitude or meaning of the result. It is just observable.
Statistically discernible is still 50% adverb however. What alternatives do we have? Distinguishable from zero? Inconsistent with random variation? Unattributable to chance? All of these sound more definitive than they have any right to be. Perhaps a sufficiently statistically literate audience can just be presented with the p-value and allowed to add their own qualitative labels to different thresholds depending on their personal attitude toward probability. HG Wells might approve of that approach having said ‘Statistical thinking will one day be as necessary for efficient citizenship as the ability to read and write’. What seems certain however is that ‘significance’ with its multiple meanings is a great name for a magazine, but a poor choice for a scientific term.