Content of review 1, reviewed on March 28, 2023

The paper reports two studies that try to replicate and extend Williams & Bargh (2008). No replication is found, and the results on possible moderations are inconclusive.

In general, I very much believe that replications of classic findings are important. There is much debate on how to select the right findings to attempt replication, and I think W&B2008; is a worthy target of replication. Unfortunately, I believe that the current studies and the paper as a whole does not live up to the standards of how such replications should be done, and the null finding cannot tell us much. I will lay out my reasoning:

The introduction provides a list of various theoretical arguments on how “embodiment” effects such as W&B2008; could be explained or predicted. Unfortunately, the arguments are mostly mutually contradicting, rather than providing a coherent argument. Spreading activation accounts are incompatible with Barsalou’s original work, Proffitt’s theory on hill slant judgment uses a completely different mechanisms, and IJzerman’s homeostasis arguments are not compatible either. The introduction does not provide a solid foundation to predict anything, really.

The authors report a power analysis, but the paper doesn’t seem to follow it. First, W&B2008; was obviously vastly underpowered, so their effect size estimate is barely usable. The authors conclude that a sample of 125 would be suitable, but then actually implement another factor (awareness), for which no effect would be expected. No power analysis is provided for the interaction with personality.

On p. 7, the authors mention that the gender of the target was manipulated – this definitely needs to be a factor in the analysis.

It’s not clear how the “statistical control for both the collection round and the location” were done (not in Table 2). We never learn the results of that.

p. 7 reports values for a scale on awareness, but I did not find out what those items or their scale were.

Table 1 and the associated report many analyses of the target effects in various subsamples – without any clear reason for why this should be done, or any statistical correction for multiple tests. Instead, interaction terms should be tested in the regression on the whole sample, and only if there is an interaction, then simple slopes should be estimated.

Table 1 suggests to me that Awareness is entered as a main effect – but it should be an interaction with the temperature factor?

It’s not clear what the estimates in Table 1 are. Betas? Bs? Note that reporting betas for contrast-coded factor doesn’t really make sense.

The discussion of Study 1 and the intro to Study 2 seem to suggest that there was an effect that is worthy of exploration, but there clearly wasn’t. Surprisingly, I understood that Study 2 dropped the temperature manipulation completely, and just used warm cups with an additional factor.

Apparently, blood pressure and heart rate were measured in Study 2, but we never learn a rationale, nor results.

We never learn whether the Norwegian participants were tested with scales translated into Norwegian, or the original English. I would not collect data with scales in which participants are not native speakers.

In sum, I cannot recommend this paper for publication. While the replication target study may be worth investigating further (although I personally think it’s not a stringent test of any theory), the study seems to be conducted with many additional factors that introduce noise, the analyses are reported in an unclear way and may not be suitable at all, and only Study 1 really attempts to replicate the effect of the warmth manipulation.

Source

    © 2023 the Reviewer.

Content of review 2, reviewed on November 07, 2023

I copy here again the comments I made on the analysis above, because I wasn't sure whether all of these get forwarded to the authors.

The page numbers below refer to the version with track changes.

This paper revised the original analyses by adding analyses of the combined sample and by adding Bayes factors. Study 2 was removed.

The paper reports analyses of the subsamples at the two locations in detail, before combining them and reporting no effect.

I see two issues: the handling of the power analysis, and the reporting of effects in subsamples.

The presented power analysis (p. 10) first states that N = 62 would be needed for a one-tailed test with 80% power of the effect size re-computed from W&B;'s data, which was reported here as eta square = .09. I am wondering: why one-sided? Is this eta or partial eta square? Wouldn't a Cohen's d be more appropriate given that we later see t-tests? But the cover letter actually explains that the power analysis was not done before the collection of these data, so I recommend to simply report a sensitivity analysis for the collected sample. This is then actually done on p. 11, but this reports the total sample as 125. I don't understand this, because half of those participants were made aware of the manipulation AND hypothesis. I would expect them NOT to show the effect, so I would not include them in the sensitivity analysis.

The Results section starts with a claim that exploratory analysis "indicated that location ... moderated the main effect of cup condition ...", but these analysis are never reported. The Combined Samples analysis (p. 18) adds location as a random factor and omits any interactions of location x temperature. We thus never learn whether there really is a significant interaction that would motivate splitting the sample. If there is not, I would not split the sample, but keep location as a (fixed) factor.

I fail to understand why the subsample analyses investigate the main effect of temperature on perceived warmth with a t-test across the awareness manipulation. Why not right away report the 2 x 2 Anova on p. 15? That Anova is then actually run in the section "Interaction between beverage temperature and awareness" on p. 15, but only the non-significant interaction effect is reported, not the main effects. I would never bet that the temperature effect could be replicated, but it's weird to test it with a sample that consists of 50% participants who were made aware of it. I intepret the Introduction to hypothesize that making participants aware of the hypothesis should counteract the priming (p. 7, citing Firestone).

By splitting across location and having the awareness manipulation, the cell size that actually replicates W&B;'s original becomes really small. Omitting the aware participants for whom we do not expect an effect, location-splitted analyses test the temperature effect with 30 participants in two between conditions, substantially lower than the 125 advertised in the sensitivity analysis, and in fact lower than Williams and Bargh original study. But simply combining them with the aware participants doesn't solve the problem in my opinion.

The following interaction analyses with raters' personalities and gender don't report whether they were run on the full sample or only on unaware participants. The first analysis doesn't report the F-test. I guess these were all done with aware participants in the sample.

In sum, I believe the location-split analyses are neither motivated nor useful. What they report isn't even a stringent test of the replication hypothesis.

The analysis of combined samples (p. 18) then finally reports on the whole sample, but it again has some decisions that I don't understand. As mentioned before, there are no interactions with location. All the other moderators and their interactions are added simultaneously; I suspect that they would correlated somewhat (gender, warmth, dominance?). I would rather test those separately. In any case, there is no evidence left for any effect of warmth, nor any interaction.

The Discussion (p. 29) claims that no effect was found in a "sample almost three times as large" as the original sample, which is wrong, because it again omits the awareness factor.

The next sentences are also not quite correct: "... we found an effect of warm cup on perceptions of warmth emerged among participants with higher levels of warm personality ... in one of the locations" - simple slopes or estimates at high/low rater warmth were not presented, and Figure 1 rather suggests that an effect emerged in the opposite direction for participants with very low levels of warm personality.

I understand that the authors want to publish these data to contribute to the replicability debate. I think it would suffice to simply show that there is no effect, rather than splitting and testing for moderators. The sample size doesn't allow for that because half of the sample underwent a manipulation that wasn't part of the original study, and the sample was collected in two very different locations and populations.

Source

    © 2023 the Reviewer.

Content of review 3, reviewed on February 26, 2024

Thank you for revising the manuscript so thoroughly.

I have one last suggestion: on p. 11, you report the awareness manipulation check. That check is not reported in the Materials section. You mention a sample question in brackets, but the scale is not presented. It is therefore unclear what the reported average numbers mean (or even whether a high number means low or high awareness). Furthermore, an interaction and one simple effect is reported, but it's not clear what the Norwegian sample showed (no difference?). Please remedy this.

Source

    © 2024 the Reviewer.

References

    Haavard, K., Thorvald, H., Lewend, M., Jaime, P. 2024. Physical and social warmth. Royal Society Open Science.