Content of review 1, reviewed on October 01, 2020

The current study provides an analysis of the effect of different interindividual differences in memory contextualization (MC) of different valence material, during two critical time-points (the acute phase and the recovery phase). [Assessed with a face recognition and fear conditioning task at either 30 min (acute phase) or 2h (recovery)]. To this end, several models were tested with MC as the dependent variable in two different approaches: a theory-driven and a data-driven (and a third hybrid). The study is very interesting and the experiments appear to have been well designed and executed, resulting in a rich dataset with indeed very nice data to explore. I think this paper can be a nice addition to the stress literature but in my opinion certain methodological issues need to be addressed before.

A) The lack of correction for multiple comparisons or the use of a less restrictive alpha significance threshold are justified by the authors with the explorative nature of this study. This is understandable but it should be made clearer: for example, in the abstract, introduction or even title.

B) Methodological questions:
1. Collinearity tested in the linear models for the theory-driven analysis?
2. On average 3.2% missing values per variable. However, what was the percentage of participants with at least one missing variable? Or the resulting sample size? Would it make sense to run the same analysis on the dataset comprised only of complete cases?
3. Do the significant LM terms include interactions?

C) Results:
1. Why weren’t the effects of stress (acute and recovery) on the SAM & HPA (cortisol and alpha-amylase) presented?
2. The study focused in predicting MC but were other relationships, that were mentioned in this paper, verified? For example: Life adversity and trait anxiety shaping stress-response; Emotional memories are better encoded then neutral in the acute phase; acute phase reduces MC, recovery phase increases MC.
3. In the abstract the following statement was done “In the theory-driven model we postulated that life adversity and trait anxiety shape the stress-response, which impacts memory contextualization following acute stress.”. Hence, the theory-driven posits that MC is determined by the individual’s responsivity of the SAM-axis and HPA-axis, and that reactivity of these systems is shaped by an individual’s trait anxiety and cumulative exposure to live adversity. I was expecting a mediation analysis where the stress response was tested as the mediator between (life adversity and trait anxiety) and memory contextualization. Why wasn’t this analysis done, or at least the study of this indirect path between individual traits -> stress response -> MC.
4. Why wasn’t a model with the stress group (Group 1, 2 and 3) as a predictor (or as an instrumental variable) considered?
I have a particular problem with the data-driven results, and the way they are being presented, but maybe the authors argue or justify why some choices were made (feel free to reply to more than one point at the same time):
5. The data-driven results are presented without their corresponding statistics in text in figure. Furthermore, the direction of the effect is attributed to the Boruta but this algorithm doesn’t inform about this. In fact, the direction is obtained from the scatter plot with a regression line fit to the data (without the corresponding statistic): “Boruta revealed that exposure to early life adversity was related to more contextualization of emotional fearful information directly after acute stress (Figure 4.A2, A4).”
6. In the Methods it is written “Because RF analyses do not indicate how the selected variables affect memory contextualization, a scatterplot for memory contextualization by each returned variable is created for interpretation purposes”. This can be very misleading since it’s not known how the RF is using the predictors. Packages are available for R or python to understand how machine learning methods like RFs use their predictors to produce an estimate (See for example Lundberg’s work on SHAP values or LIME: https://www.nature.com/articles/s42256-019-0138-9). Or (also available in R: https://pbiecek.github.io/ema/).
7. The points presented in section 4.2 are therefore not always supported but significant statistical tests. This is in my opinion a potential issue since other studies might cite these data-driven findings.
8. In this reviewer’s opinion, the Boruta choice shouldn’t replace a stronger validation of the features. This method compares the relevance of a feature against the best randomly shuffled feature so it might still choose features that are not optimal. It’s just a feature selection method; doesn’t mean the selected features will result in a predictive model. Either a statistical test for the pairwise association between these features and memory encoding or: any way to validate the performance of a model that uses these multiple features against a simpler model, in a test set or using random permutation statistics.
9. How can the reader interpret the magnitude of the RMSE in Figure 7? Lower RMSE values indicate better predictive accuracy as stated in the Figure legend but are those values representative of a good prediction? Could the reader see a plot between the model prediction and the actual dependent variable? Or a correlation coefficient.
10. There are other ways to evaluate models in regression problems. For example: mean absolute error, mean squared error, RMSE normalized by the interquartile range of the dependent variable or by the mean, mean absolute percentage error, R squared, etc.
11. After having found a predictive model, there are ways to assess feature importance and how the model uses them (SHAP values for example).
12. Why wasn’t a test/validation set used to evaluate how well the data-driven model generalizes to unseen data? I acknowledge what the authors wrote “Note, the size of the experimental groups did not allow us to split the dataset in separate training, test and validation sets). However, the feature selection with Boruta was done on the same dataset than the RMSE calculations. This might lead to overfit that could only be ruled out with a test dataset. Could you justify your approach better? (other studies where this was done for example).
13. Without a strong validation of these predictors, I would discourage claims like the following in the abstract: “Newly identified predictors sparked novel hypotheses about non-anxious personality traits, age, mood and states during retrieval of context-related information”. It worries me that the data driven results create more noise in the literature if they are not properly validated, or if it’s not clear how they were obtained.
D) Minor remarks:
1. “stress response” or “stress-response”: hyphenate to be consistent with the rest throughout the document.
2. Page 6: what is surprise retrieval and how was it used here?
3. Check if the number of digits when reporting statistics is the adequate one (Example 4 digits for P values).

Source

    © 2020 the Reviewer.

Content of review 2, reviewed on November 19, 2020

Thank you for taking my suggestions into account in your recent revision. I see a great deal of modifications were made which make this work very satisfying to read!
All my remarks were addressed by the changes in the analysis, results, discussion and title of the manuscript.
From my side, I have no further points to raise so I recommend this manuscript for publication.

Source

    © 2020 the Reviewer.

References

    C., S. M. S., Marian, J., Elbert, G. 2022. Individual differences in the encoding of contextual details following acute stress: An explorative study. European Journal of Neuroscience.