# Question 6

Please comment on the method. Key elements to consider:

- objective errors or fundamental flaws in the methodology
- purpose of new method or technique
- appropriateness of context
- comprehensive description of procedures
- quality of figures and tables

# Reviewer comment

I should start off by saying that I liked reading this paper, and I applaud the authors approach of highlighting a much-neglected phenomenon that is usually present in b-CFS data whilst simultaneously presenting a solution to potential problems this effect might pose. Nevertheless, I do have some remarks/suggestions regarding the way in which this phenomenon (i.e., the correlation) is contextualized as well as the solution that is being proposed. I leave it up to the authors to consider whether these suggestions are sufficiently important to (considerably) revise the text or not.

The authors' basic starting observation is the correlation between effect size and overall response times as well as trial-to-trial variability. Given that some researchers talk about effect sizes in the standardized sense (i.e., Cohen's d), is there any particular motivation for the authors to use RT differences as effect size? I can imagine the correlations would disappear for standardized effect sizes. That is, when mean RT differences scale with the variance, this implies standardized effect sizes would remain constant. Basically, I am interested in why mean RT differences are considered the appropriate effect size measure in this context.

Related to my previous comment; In the Results section, the authors introduce the correlations between overall RT and the RT difference as well as the within-condition SD and the RT difference. However, only at the end of the "within-condition variability" section, it is mentioned that it is a well-established fact that mean RTs correlate with the SD of RTs (Wagenmakers & Brown, 2007). Only in the discussion this is touched upon again. Personally, I think what we see here is the result of this "law". That is,

*under the assumption of a constant standardized difference between conditions (i.e., stable Cohen's d at the participant level)*, the RT difference needs to increase if overall RT and SD of the RT also increase. As can be read in the introduction of Wagenmakers & Brown (2007), not a lot of research has been performed on this linear relationship, but it has been consistently observed. My point is that this "general law of response times" could be a starting point of the observations, rather than being merely mentioned in passing (i.e., only in the General Discussion it is said that this correlation in probably inherent to response time data). If this aspect of RT data is the starting point, the authors could show the correlations that would be predicted from this, and then start to discuss potential problems associated with this aspect of the data. As I highlighted, this is how I interpret the general pattern of results, and I do not know whether the authors agree with me on this aspect.In the Results section ("Overall response times"), the authors describe that the observed correlation could impose a potential power problem. However, this argument is only spelled out in words, rather than providing a proof or simulation what the impact could be on Type 1 and/or Type 2 errors. I think it would therefore be helpful for future readers to get an idea of how strongly this between-subject variability impacts the results of statistical analyses. A demonstration could be shown just for the simple paired-sample t-test with different levels of between-subject variability and different effect sizes. Alternatively, the authors could point to some references that highlight this aspect and have worked out the details of how this influences the properties of the statistical procedure that is typically used.

In the Results section ("Overall response times"), the authors propose a normalization method that removes the between-subject variability as well as reduces the skewness of the dependent variable. I agree this method removes between-subject variability, but is there any particular motivation to use such a normalization method over normalization through z-scores, for example? In psycholinguistics, researchers sometimes transform individual response time distributions to the same scale using the z-score transformation (see Faust, Mark E.; Balota, David A.; Spieler, Daniel H.; Ferraro, F. Richard, Psychological Bulletin, Vol 125(6), Nov 1999, 777-799). Would this type of transformation yield better or worse performance compared to the normalization based on proportional differences? Again, in this context I think it could be helpful for the future reader to be able to compare the statistical power of the analysis after normalization and before normalization. It is true that the normality assumption is better met after normalization, but it has been shown that of all assumptions normality is least critical in losing statistical power.

# Question 13

Is prior work properly and fully cited?

# Reviewer comment

No I was slightly surprised to see the following statement by the authors: "This view seems at odds with conclusions drawn from recent b-CFS studies showing that semantic or conceptual information can drive the conscious access of initially suppressed visual input".

As the authors are both authorities in the field, I am sure they are aware of the fact that even in the b-CFS literature these observations have been challenged (i.e., there is not necessarily a discrepancy between binocular rivalry and CFS). I think two of my own studies have shown this for scene integration (Moors et al., 2016, Psych Sci) and semantic processing of words (Heyman and Moors, 2014, PLoS ONE). There are other recent b-CFS studies challenging such a view (e.g., Rabovsky et al., 2016).

# Question 14

Please add here any further comments on this manuscript.

# Reviewer comment

In their Introduction, the authors discuss that b-CFS might be a relatively sensitive measure to uncover differences in processing strength between stimulus conditions because larger response time differences are observed when baseline response times increase. I think the sensitivity would indeed increase only if the variability of the distributions does not scale with the mean of the distribution. That is, in statistical tests we always calculate some form of signal-to-noise ratio. If the mean difference between distributions increases irrespective of the variability of these distributions, this would greatly increase the signal-to-noise ratio. However, if the variability of the distributions increases along with the mean difference (as the authors observe), I would argue the signal-to-noise ratio remains more or less the same, irrespective of the size of the mean difference between distributions, no? So my question is, do larger raw response time differences always imply that a method is more sensitive? I think this question mostly derives from the fact that I often think about effect sizes in the standardized sense (e.g., Cohen's d) rather than the raw RT differences (that, obviously, have a more straightforward interpretation).

In the Introduction, the authors discuss paradigms consisting of manipulations of stimulus content and stimulus context. They say that some content manipulations like face inversion are interesting because only spatial orientation differs, ruling out many low-level confounds. Isn't it the case that the manipulations of context provide an even better low-level control because the invisible stimulus is not changed at all?

Very minor: small typo in the last paragraph (Kanwisher, 2001) rather than (Kaniwsher, 2001).