Content of review 1, reviewed on February 06, 2024

Dear Author,
Thank you for writing this tutorial, which is easy to follow and encourages researchers to use statistical tests when comparing predictive models. It is my impression that this is indeed needed as there are many studies in the literature comparing predictive models without any statistical tests, even though it isn’t very time consuming. This is also a very relevant tutorial for the special issue honouring Tormod Naes, as it builds nicely on e.g. Naes’ CVANOVA papers (ref 5,6 in manuscript) which I, and probably many others, have used as a guide for statistical comparisons previously. However, this paper also provides a overview considering both quantitative and qualitative calibrations, which is a useful addition and presents a broader picture. I recommend this manuscript for publication, and have a few suggestions below for further improvements/clarifications to the manuscript:

Overall
1. One of my recommendations is that, since this is a tutorial, the practical aspect of doing the different tests could be weighted more. For instance, by adding some examples of function calls in one or two programming languages e.g. Matlab/R, and explaining the inputs to the function calls. As one of the motivations given by the author is to encourage more researchers to do the tests where needed, I believe this would decrease the threshold even further for the reader to do their “first ever” statistical comparison and get the ball rolling.

Section 2
1. Checking if normality assumptions hold, might be one of the most ignored aspects of doing statistical tests. The author has mentioned non-parametric tests as an alternative when normality is heavily breached. It could be helpful with one “summarizing” section or table with what is the normality assumption for the different tests. Is there a way to check the normality assumption more carefully than plotting reference vs target and qualitatively judging? Could be worth discussing the normality assumptions a bit more and what happens if they are breached for the different tests. Is it always better to do the nonparametric versions to safeguard?

  1. The description of ANOVA in 2.3 could benefit from a few more details, e.g. what choices when formulating the ANOVA model is important to note/choose correctly; what are the difference when considering effects to be “fixed” and “random” for instance and what happens if you forget to consider the sample effect as “random”. What is the normality assumption in the ANOVA case? Also, the ANOVA approach is presented as a method for when you have more than 2 classes. Could it not also be used in the case of r=2?

Source

    © 2024 the Reviewer.

Content of review 2, reviewed on March 13, 2024

Dear author,
Thank you for addressing my concerns. I consider this an excellent tutorial and recommend that the manuscript is accepted for publication. A few minor comments below, which the author can consider at his convenience.

Matlab examples
• It could be nice to include information on what Matlab version you are running, in case of any previous or future function adjustments.

Grammar
• Quote: “(…) carry out routine normality tests but to plot the data is some way”, should likely be “in some way”.

Source

    © 2024 the Reviewer.

References

    Tom, F. 2024. Testing differences in predictive ability: A tutorial. Journal of Chemometrics.