Content of review 1, reviewed on February 06, 2024

My overall impression is that the use of a Subjective Evidence Evaluation Survey (SEES) is a very good idea for those running many-analysts studies. I think the paper illustrates this, but fails to really make the point empirically. Instead it is more of a showcase of how to use SEES. I can imagine this is a missed opportunity to make a simpler claim. I would think that in addition to all they have argued for here, that make a final point about comparing the many-analysts evidence they way it has been done in the past to how they do it, would be useful. In the past, many-analysts studies have used single or very basic questioning to evaluate results and other teams’ models. Here, the focus is on each team subjectively evaluating the data and their own models and results using 18 questions. Why not make a direct comparison of the many-analysts from the Religion project without these 18 questions and with them, and see how the evidence and possibly conclusions differ? This, I would think, would be a key point to make in the article. It could come in a section right before Limitations, which appears too suddenly in my opinion. I would like to hear a concrete summary of results in easy-to-understand language before going into the limitations. Overall what I am saying is: We want to know a method is useful before seeing how to use it in other words. I really am a fan of the research and method proposed here, therefore I am going to be relatively harsh in my comments with the expectation that this is published.
1. The accepted or most common phrase to describe these studies is “many-analysts” or “many analysts”. A Google Scholar search reveals 126,000 papers using “many-analysts” many of them with this in the title. A search for “multi-analysts” reveals 3,400 and “multi-analyst” reveals 28. They are using “multi-analyst” the lest common of all. I think the authors are promoting the wrong word and leading to confusion. The Silberzahn and Uhlmann 2015 reference does not use the term “multi-analyst” for example. They use “crowdsourcing”, I think this is a word worth keeping in the loop because it reflects the truth of what is taking place. Crowdsourcing science.

  1. The distinction between “economics” and “cognitive sciences” is disturbing if not offensive. Especially in a homogenous group of authors such as this, I would hope that economics is put in its rightful place as one among the “social and behavioral sciences” and nothing worthy of standing out on its owne. Science has already suffered enough from economist tricking everyone into thinking they are something mor scientific or ‘better’ than other (social) sciences.

  2. Our study (Breznau et al., 2022) had extensive usage of sub-group analyses, interactions & non-linear effects, divergent numbers of models per team and one team even did not run any analyses because they decided that the items in the scale were not measurement invariant. I guess if one were to cite studies that expose this ‘hidden universe’ ours is a central candidate. Yes, of course I am biased in self-promoting here, but am also not wrong.

  3. The authors present their argument that attention only to effect sizes is “incomplete” is not at all novel although they present it that way at the start of the section “Assessment of Subjective Evidence” (Auspurg & Brüderl, 2021; Liu et al., 2021; Mathur et al., 2022; Young & Holsteen, 2017).

  4. The authors state “The idea of collecting a subjective assessment of research evidence is uncommon in the quantitative social and behavioral sciences.” I don’t understand. Almost every single many-analysts study, including the ‘first’ one from Silberzahn et al., that fits into the social and behavioral sciences, collected subjective feedback on the other teams’ models. This suggests exactly the opposite of this statement, but in a way that is cross-subjective, not self-subjective. It would seem to me that because every study tried to use subjective feedback, but reduced this to a single metric, or were unclear how to incorporate this evidence, it would improve many-analysts studies greatly if their principal investigators took a more systematic approach – and in particular had the authors reporting on themselves. They point a bit more towards this logic in the section about the development of the SEES, but the previous section to this felt very inaccurate or incorrect. Outside of psychology (sociology and political science in particular) there are sometimes very long winded discussions about theory and results.

  5. The authors make a big deal about measurement but use a 4-point scale without a midpoint. Although there are good reasons to omit a midpoint, a 4-point scale is sub-optimal in nearly all studies of psychometrics. Their choice of this scale should be defended, or the failure (such a large team) of the authors to take into account best practices in scaling should be openly disclosed (see the life’s work of Krosnik for example), or both – in my opinion.

  6. Table 1 is too sudden or poorly labeled in my opinion, I actually think they should throw it out (see next point). What does “Plausibility” mean (and all following constructs)? I guess the 8 items go with the single items above? If they really reflect the 8-items from above (single questions), why is question two labeled “Robustness” when it is really about “Power” (that there are not enough cases, see question 2 in the text). This comment will no doubt frustrate the authors, but I am trying to relay my honest experience as a reader, so that the text become crystal clear. Maybe put the questions near the table in the text.

  7. I think Tables 1 and 2 and Figures 1 and 3 can be left out of the paper. If the authors label Figure 2 appropriately, it contains all the information (note that there is a discrepancy in error bar caps in the left panel of Figure 2 [FYI]). I also think that if the authors insist on keeping in Tables 1 and 2 and figures 1 and 3, they should present them in identical format. Otherwise the colorful and visually appealing Figures seem to sub-consciously promote Bayesian inference over simple descriptives. Not a fan of this subtle attempt at making Bayes appear somehow better. This leads me to the next point:

  8. I am wholly not convinced that getting posterior distributions for a single items is optimal; or if it is, the authors have not convinced me why we need this. Look at Figure 2. We gain basically nothing from using the Bayes approach. I think the real issue here is about credible intervals (standard errors), but this is not defended. When these Bayes intervals are presented they are called “item truths”. Really? The only “truth” we can rely on is what we observe. Everything else (standard errors, Bayes intervals) is an educated guess about the true population relating to different aspects of sampling theory. I suggest using a term that reflects what was calculated. Also, relatedly: What about a measurement model? What about a Bayesian measurement model? It sounds very possible that latent variables are present that reflect the participants overall impression of features of their work that are not exactly captured by the 8 questions, but none the less tapped into. Just because some other studies used this cultural scaling method does not convince me. What is the underlying argument not to use more basic and better understood approaches to measurement… also to check that the items measure what they say or that the investigators think? The authors conclude with “We recommend using the outlined Bayesian cultural consensus theory model to analyze the SEES data, but also acknowledge that our analysis strategy is not necessary when employing the SEES. Instead, project leaders may opt to calculate sum scores per subscale and/or overall sum scores for the entire survey, especially when the number of participating analysis teams is low.” If I understand correctly, this argument has it backwards. As the N gets smaller, we stray from the central limit theorem’s reliability, therefore, a Bayes approach might be preferable to calculate credible intervals. As the N gets larger, the simple descriptive (I do not use the term “frequentist” as it was invented by orthodox Bayesians to promote their religion) approach is more and more reliable.

  9. “In a multi-analyst project which aims to reduce ambiguity as much as possible, capturing the subjective evaluations of the analysis teams may not be advised.” I think that the authors should clarify what is meant by “ambiguity” here. Research itself is not confined to certain variables or strict control, so anything that imposes this may reduce the ambiguity in model specification, but certainly not in the reality of real-world research. They mostly defend this in the next paragraph, but I still see that two different types of ambiguity are being discussed somewhat interchangeably.

  10. As per my argument in the introductory paragraph, I think the presentation of this method would achieve its goals if they quickly mentioned how this approach would have improved previous many-analysts studies. Like the football referee study for example, I think in one paragraph of less they could outline how much more we could have learned in this and possibly other studies. This would be the key argument to help convince future many-analysts studies to design their research differently. However, I think the SEES approach is applicable outside of many-analysts studies and this is a great potential…. Worth mentioning and linking to the movement toward better specifying estimands (Lundberg et al., 2021).

Auspurg, K., & Brüderl, J. (2021). Has the Credibility of the Social Sciences Been Credibly Destroyed? Reanalyzing the “Many Analysts, One Data Set” Project. Socius, 7, 1–14. https://doi.org/10.1177/23780231211024421
Breznau, N., Rinke, E. M., Wuttke, A., Nguyen, H. H. V., Adem, M., Adriaans, J., Alvarez-Benjumea, A., Andersen, H. K., Auer, D., Azevedo, F., Bahnsen, O., Balzer, D., Bauer, G., Bauer, P. C., Baumann, M., Baute, S., Benoit, V., Bernauer, J., Berning, C., … Żółtak, T. (2022). Observing many researchers using the same data and hypothesis reveals a hidden universe of uncertainty. Proceedings of the National Academy of Sciences, 119(44), e2203150119. https://doi.org/10.1073/pnas.2203150119
Liu, Y., Kale, A., Althoff, T., & Heer, J. (2021). Boba: Authoring and Visualizing Multiverse Analyses. IEEE Transactions on Visualization and Computer Graphics, 27(2), 1753–1763. https://doi.org/10.1109/TVCG.2020.3028985
Lundberg, I., Johnson, R., & Stewart, B. M. (2021). What Is Your Estimand? Defining the Target Quantity Connects Statistical Evidence to Theory. American Sociological Review, 86(3), 532–565. https://doi.org/10.1177/00031224211004187
Mathur, M. B., Covington, C., & VanderWeele, T. J. (2022). Variation across analysts in statistical significance, yet consistently small effect sizes. Proceedings of the National Academy of Sciences.
Young, C., & Holsteen, K. (2017). Model Uncertainty and Robustness: A Computational Framework for Multimodel Analysis. Sociological Methods & Research, 46(1), 3–40. https://doi.org/10.1177/0049124115610347

Source

    © 2024 the Reviewer.

Content of review 2, reviewed on April 08, 2024

The revisions are impressive, and adequate in my opinion. I think the authors can be satisfied with the paper as is. But I would push for stronger language. For example, "The implementation of
SEES may facilitate the identification of such nuanced insights that otherwise remain hidden and may prompt team leaders to delve deeper into analytic results underlying skeptical responses." is a very standoffish sentence. It is clear that the SEES facilitates identification to me. Not that it "may". But either way, I am happy the point is brought forward, and if the authors are satisfied being so cautious in their wording, I think that is their prerogative. There were some small grammatical errors that I assume will get hammered out in final production (plus a citation that appeared in a strange fully bold single word).

Source

    © 2024 the Reviewer.

References

    Alexandra, S., Suzanne, H., Don, v. d. B., Balazs, A., J., A. C., Tim, A., Rotem, B., A., B. N., M., C. A., Berna, D., N., v. D. N. N., Anna, D., I., F. E., Rink, H., Sabine, H., Felix, H., Juergen, H., Nick, H., John, I., Magnus, J., Michael, K., Eric, L., Jan-Francois, M., Dora, M., J., M. A., Gustav, N., Don, v. R., Martin, S., Hannah, S., R., S. D., J., S. D., A., S. B., H., S. A., Barnabas, S., Darinka, T., Francis, T., L., U. E., Wolf, V., Jelte, W., Eric-Jan, W. 2024. Subjective evidence evaluation survey for many-analysts studies. Royal Society Open Science.