Content of review 1, reviewed on May 29, 2014
There is clearly nothing fundamental about an alpha level of 0.05. Particle physics and cognitive science already choose different conventions, and this paper is free to argue that a particular discipline might be better served by a greater concern about type II errors at the cost of an inflated type I error rate (though see Johnson, 2013, for an argument that 0.05 is already too lax).
However, in the context of the massive multiple testing problem in imaging, this paper does not propose a relaxation of a well-defined (i.e. principled, see Bennett et al., 2009) significance level, such as the control of family-wise error or false discovery rate at 0.1 or 0.2 instead of 0.05. It instead puts forward the arbitrary combination of an uncorrected p < 0.005 threshold with a cluster-extent of 10 voxels. This is arbitrary in a far more serious way than the arbitrariness of the 0.05 level for a single test. Specifying a meaningless extent threshold without consideration of voxel-size or image smoothness is more akin to saying that the "p < 0.05" criterion should be replaced with a fixed "statistic > 6" criterion – without considering whether the statistic in question is a t, F, or any other!
The authors admit in footnote 4, that "these values will change with different voxel sizes and smoothing kernels, however the conceptual implications of these simulations remain", but this note sweeps aside this issue far too casually. While they accept that their prosed criterion should not "be reified as a 'gold standard'", they do suggest that it is a "reasonable" criterion, and many of the (hundreds of) citing articles are indeed using exactly this criterion. Importantly, the paper (particularly the abstract) does not seem to recommend that authors perform their own simulations for every study, but rather that they could use the proposed 10 voxel extent threshold with uncorrected p < 0.005 to obtain "a desirable balance between type I and type II error rates". Such use of an arbitrary threshold cannot hope to achieve a "desirable" balance, since the actual balance will vary wildly between studies, as illustrated next.
Lieberman and Cunningham obtain their 10 or 20 voxel extent thresholds (rounded from 8 and 18) in the context of a simulated smoothness of 6mm full-width at half-maximum (FWHM), with a voxel-size of 3.5mm x 3.5mm x 5mm. They refer to these in the abstract as "relatively common imaging parameters", though the paper offers no further detail or reference supporting this choice. An evaluation of 241 fMRI articles (Carp, 2012) found that the most commonly applied smoothing kernel was 8mm (more than twice as common as the second most commonly used kernel of 6mm). Importantly, this applied smoothness is not actually the correct smoothness to use (as also noted by Bennett et al., 2009), as fMRI images have a non-trivial intrinsic smoothness (see e.g. Chumbley and Friston, 2009) meaning that the resultant FWHM will be higher than that of the applied kernel. For example, Woo et al., (2014) report that the average estimated FWHM from 9 studies in Nichols & Hayasaka (2003) is 16.6mm. Regarding the voxel-size, the default for images spatially normalised with the most common software (SPM, based again on Carp, 2012) is 2mm isotropic. Using a brain-mask (cf. footnote 4 in the reviewed paper), with 2mm cubic voxels and with 8mm FWHM smoothness, the cluster-extent that controls the family-wise error rate at 5% is not 18 voxels, but 164 (based on 1000 simulations with rest_Alphasim from the REST toolbox, Song et al., 2011). Worse still, the estimated family-wise error rate for a 20 voxel threshold is close to 100%.
One of several reasonable points that the paper makes is that the practice in neuroimaging of only reporting significant findings is highly detrimental to meta-analysis. To reiterate, this valid point could justify the choice of less stringent (but still principled) significance levels, but it does not justify the blanket use of a 10 voxel extent threshold without consideration of the smoothness, voxel-size or analysis mask. It would also probably be dealt with much more successfully by encouraging the sharing of unthresholded statistical maps themselves (see e.g. NeuroVault) rather than the publication of results tables with larger numbers of local maxima that would result from less stringent thresholds. Regarding the paper’s concerns about power, this should be contrasted against the arguments made by Button et al. (2013), but is clearly still open to debate.
© 2014 the Reviewer (CC BY-SA 3.0).