Content of review 1, reviewed on February 09, 2023
Thank you for the interesting manuscript, which is very clearly structured and written in a very understandable way. The approach to make more aware of power in interactions is very welcome. In fact, this issue has been addressed rather poorly so far. The authors' approach is well thought out and on target. The argumentation is convincing. The presented app is very clear and user-friendly and can usefully expand the repertoire of power software.
Nevertheless, I see some need for revision of the manuscript in several points.
The main criticism I see is the use of a power of 80%. I know that this is a commonly used value in the past, dating back to Jacob Cohen himself. However, Cohen never had in mind that this recommendation would be used in such a universal way as has happened in psychology in recent decades. Setting the power at 80% has no meaningful basis whatsoever. It means accepting a beta error four times as large as the alpha error, for which there is no comprehensible justification. It makes sense and is much more comprehensible to set equal values for alpha and beta, which leads to a power of 95%. At least since the replication crisis, it has been argued anyway that studies should have a power of 95%. So I would strongly recommend, especially in a seminal journal like AMPPS, to refrain from this outdated and unsubstantiated strategy to work with a power of 80% and instead to work exclusively with 95%. This applies to the manuscript, simulations and tables as well as the app.
Secondly, I was very irritated by the discussion on the question of one-sided or two-sided testing. I think it should go without saying that directed hypotheses should be formulated and tested whenever possible. Undirected hypotheses contribute little to the advancement of scientific knowledge and should be a minor exception. The authors claim that undirected hypotheses are formulated in the majority of cases and that this is also required by reviewers or editors. This would be very surprising to me and I would like to see an empirical basis for this claim. Especially for interactions, there should be concrete expectations. The classic case I see for an interaction is testing the effect of an intervention, where no change from measurement A to measurement B is expected for the control group, but it is expected for the intervention group. Here it would be surprising if such an interaction were not formulated in a directed way. I would therefore strongly recommend redesigning the manuscript so that one-sided testing is the normal case, rather than a "strategy" that can be used to increase power. This sounds like a strange logic. Of course, the results for two-sided testing can still remain in the table and the app.
The reference to small, medium, and large effects is not sufficiently embedded in the current state of the literature. Especially in the cited paper by Schäfer and Schwarz (2019), it was shown that the use of uniform conventions across all psychological subdisciplines is not acceptable because effects differ considerably. This point should be addressed and discussed in a much more critical and detailed manner. Otherwise, there is a great danger that researchers will repeatedly fall into automatism and apply such conventions in an unquestioned and prescriptive manner.
Here follows a critical point that the authors themselves address on page 12: "researchers often have no clear sense of the effect size they should expect." This is precisely one of the central problems of scientific psychology. It simply means that one is in the dark and has no idea what one is actually studying and which effects would be of theoretical and practical interest. This is a very unsatisfactory state of affairs, which then consists only in the hunt for significances, but has no substance. I would strongly suggest that the authors address this point more specifically and explain precisely that it should be best practice to think about or justify the theoretically or practically relevant size of interaction effects, for instance, based on the results of prior studies. The use of conventions, as suggested here, should be the absolute ultima ratio and an exception. This should be made very clear. This is especially true for the authors' suggestion on page 13 about large, medium, and small effects.
The term "strategies" is misleading. After all, all three scenarios are not strategies that can be applied optionally and serve the sole purpose of increasing power. Rather, decisions on these three points should result from scientific and logical considerations. With respect to one-sided and two-sided testing, I already noted this above. One-sided testing is not a strategy in terms of data analysis, but is determined by theory and the research hypothesis. It is the same with deciding whether to use a mixed design. Considering formulating a contrast - which I strongly support - could possibly be called a strategy, but here again I would suggest a more neutral wording and simply speak of three "cases."
Table 3 - which is the heart of the paper - I would suggest to be structured as follows: The column with the analytical solution can be omitted, since these data have already been reported earlier. This way, the entire table would refer to the results of the simulation data, which would be easier to comprehend. For the pure between designs and the mixed designs, the results for the one-sided and the two-sided testing should then be reported EACH. It is otherwise very confusing if for the mixed designs only the data for two-sided testing is shown (see above).
Some minor points:
In the example on page 6, two-sided testing is used, although a directed hypothesis was previously formulated. This would need to be changed. Likewise on page 8.
On page 7, it is claimed that researchers would intuitionally assume a sample size twice as large to achieve sufficient power. Is there any empirical evidence for this claim?
On page 24, the authors give confidence intervals for the percentages. This frequentist information seems to me to make little sense, since we are not dealing with a random sample of data from a population. The descriptive information should be sufficient here.
In the spirit of optimal data-ink ratio, Figures 3 and 4 could be removed. This information is sufficient in the body text.
I advocate open science and sign my review:
Thomas Schäfer
Source
© 2023 the Reviewer.
Content of review 2, reviewed on April 12, 2023
I thank the authors very much for their very thorough revision of the manuscript and the very detailed justifications in the cover letter. It is wonderful to see that the authors have given so much thought to all the comments and suggestions. Although I do not completely agree with all the rationales presented (such as the issue of one-sided testing), I can understand the authors' reasoning well and find them all sufficient for the strategies chosen. I am also very pleased that the authors have not spared the effort to adapt the app and now allow for power adjustment, for example. However, the entries for a power of 95% are now still missing in Table 4. The authors should add that information. I see a clear improvement of the manuscript and the app and can recommend the publication.
Source
© 2023 the Reviewer.
References
Nicolas, S., L., W. D., Nicolas, C., J., E. A. 2023. How Many Participants Do I Need to Test an Interaction? Conducting an Appropriate Power Analysis and Achieving Sufficient Power to Detect an Interaction. Advances in Methods and Practices in Psychological Science.
