Review badges
0 pre-pub reviews
1 post-pub reviews

Statistical thresholding (i.e. P-values) in fMRI research has become increasingly conservative over the past decade in an attempt to diminish Type I errors (i.e. false alarms) to a level traditionally allowed in behavioral science research. In this article, we examine the unintended negative consequences of this single-minded devotion to Type I errors: increased Type II errors (i.e. missing true effects), a bias toward studying large rather than small effects, a bias toward observing sensory and motor processes rather than complex cognitive and affective processes and deficient meta-analyses. Power analyses indicate that the reductions in acceptable P-values over time are producing dramatic increases in the Type II error rate. Moreover, the push for a mapwide false discovery rate (FDR) of 0.05 is based on the assumption that this is the FDR in most behavioral research; however, this is an inaccurate assessment of the conventions in actual behavioral research. We report simulations demonstrating that combined intensity and cluster size thresholds such as P < 0.005 with a 10 voxel extent produce a desirable balance between Types I and II error rates. This joint threshold produces high but acceptable Type II error rates and produces a FDR that is comparable to the effective FDR in typical behavioral science articles (while a 20 voxel extent threshold produces an actual FDR of 0.05 with relatively common imaging parameters). We recommend a greater focus on replication and meta-analysis rather than emphasizing single studies as the unit of analysis for establishing scientific truth. From this perspective, Type I errors are self-erasing because they will not replicate, thus allowing for more lenient thresholding to avoid Type II errors.


Lieberman, Matthew D.;  Cunningham, William A.

Publons users who've claimed - I am an author
Contributors on Publons
  • 2 authors
  • 1 reviewer
Followers on Publons
Publons score (from 1 score)
Web of Science Core Collection Citations
  • There is clearly nothing fundamental about an alpha level of 0.05. Particle physics and cognitive science already choose different conventions, and this paper is free to argue that a particular discipline might be better served by a greater concern about type II errors at the cost of an inflated type I error rate (though see Johnson, 2013, for an argument that 0.05 is already too lax).

    However, in the context of the massive multiple testing problem in imaging, this paper does not propose a relaxation of a well-defined (i.e. principled, see Bennett et al., 2009) significance level, such as the control of family-wise error or false discovery rate at 0.1 or 0.2 instead of 0.05. It instead puts forward the arbitrary combination of an uncorrected p < 0.005 threshold with a cluster-extent of 10 voxels. This is arbitrary in a far more serious way than the arbitrariness of the 0.05 level for a single test. Specifying a meaningless extent threshold without consideration of voxel-size or image smoothness is more akin to saying that the "p < 0.05" criterion should be replaced with a fixed "statistic > 6" criterion – without considering whether the statistic in question is a t, F, or any other!

    The authors admit in footnote 4, that "these values will change with different voxel sizes and smoothing kernels, however the conceptual implications of these simulations remain", but this note sweeps aside this issue far too casually. While they accept that their prosed criterion should not "be reified as a 'gold standard'", they do suggest that it is a "reasonable" criterion, and many of the (hundreds of) citing articles are indeed using exactly this criterion. Importantly, the paper (particularly the abstract) does not seem to recommend that authors perform their own simulations for every study, but rather that they could use the proposed 10 voxel extent threshold with uncorrected p < 0.005 to obtain "a desirable balance between type I and type II error rates". Such use of an arbitrary threshold cannot hope to achieve a "desirable" balance, since the actual balance will vary wildly between studies, as illustrated next.

    Lieberman and Cunningham obtain their 10 or 20 voxel extent thresholds (rounded from 8 and 18) in the context of a simulated smoothness of 6mm full-width at half-maximum (FWHM), with a voxel-size of 3.5mm x 3.5mm x 5mm. They refer to these in the abstract as "relatively common imaging parameters", though the paper offers no further detail or reference supporting this choice. An evaluation of 241 fMRI articles (Carp, 2012) found that the most commonly applied smoothing kernel was 8mm (more than twice as common as the second most commonly used kernel of 6mm). Importantly, this applied smoothness is not actually the correct smoothness to use (as also noted by Bennett et al., 2009), as fMRI images have a non-trivial intrinsic smoothness (see e.g. Chumbley and Friston, 2009) meaning that the resultant FWHM will be higher than that of the applied kernel. For example, Woo et al., (2014) report that the average estimated FWHM from 9 studies in Nichols & Hayasaka (2003) is 16.6mm. Regarding the voxel-size, the default for images spatially normalised with the most common software (SPM, based again on Carp, 2012) is 2mm isotropic. Using a brain-mask (cf. footnote 4 in the reviewed paper), with 2mm cubic voxels and with 8mm FWHM smoothness, the cluster-extent that controls the family-wise error rate at 5% is not 18 voxels, but 164 (based on 1000 simulations with rest_Alphasim from the REST toolbox, Song et al., 2011). Worse still, the estimated family-wise error rate for a 20 voxel threshold is close to 100%.

    One of several reasonable points that the paper makes is that the practice in neuroimaging of only reporting significant findings is highly detrimental to meta-analysis. To reiterate, this valid point could justify the choice of less stringent (but still principled) significance levels, but it does not justify the blanket use of a 10 voxel extent threshold without consideration of the smoothness, voxel-size or analysis mask. It would also probably be dealt with much more successfully by encouraging the sharing of unthresholded statistical maps themselves (see e.g. NeuroVault) rather than the publication of results tables with larger numbers of local maxima that would result from less stringent thresholds. Regarding the paper’s concerns about power, this should be contrasted against the arguments made by Button et al. (2013), but is clearly still open to debate.

    Reviewed by
    Ongoing discussion (1 comment - click to toggle)
    • Thomas E. Nichols | 6 years, 1 month ago

      The paper's take on meta-analysis is well-intended, but the solution should be full sharing of data, not arbitrary inference procedures that leave the reader wondering what's a valid finding and what are likely to be false positives.

All peer review content displayed here is covered by a Creative Commons CC BY 4.0 license.