Content of review 1, reviewed on February 01, 2023
This manuscript presents a method to improve the identification of substitutions in viral genomes associated with zoonotic spillovers. The authors propose to apply an established tool capable of accounting for population stratification (due to patterns associated with shared ancestry, due to uneven sampling - a very common issue) in genome-wide association studies, in order to reduce the risk of spurious associations between viral genotypes and spillovers. This study addresses a challenging question of great relevance as there are more and more studies reporting virus diversity in potential zoonotic virus reservoirs, and increased interest in identifying the genetic drivers of zoonotic spillovers. The introduction is brillant. It is pedagogical and pleasant to read - which is also true for the discussion. The context is very well presented, and the study very well justified. The study itself combines a simulation study with an application to a real data set, enriching the literature with both methodological contributions, and novel biological insights regarding the potential genetic drivers of Lassa virus spillovers, an important but neglected infectious disease. The study also includes an hybrid analysis, adding a simulated component to a real data set, ensuring the realism of the explored scenarios, while presenting a procedure to assess the performance of the analytical pipeline. I find the results clear, convincing, and not oversold, as the authors honestly discuss the limitations of the proposed method. I do have to admit that I have limited experience with methodologies for genetic analyses (my background being primarily in ecology and epidemiology) and cannot thoroughly review the methods. I provide suggestions below aiming at clarifying some aspects of the methods to make them more accessible to a broader audience, as I believe this study as the potential to be of interest to people beyond the genetic community. Finally, the discussion also includes some practical recommendations. Overall, I believe this study has the potential to be an impactful one on all fronts by proposing a new methodology with a potentially large community of users, reporting novel biological results and guiding future studies in the very dynamic field of genotype-to-phenotype mapping of zoonotic diseases.
Minor comments
Line 50 - I would suggest citing Figure 1 only later as at this stage of the manuscript, I did not have the background to properly interpret it. The section starting line 101 does a way better job at introducing the figure.
Line 74 - Could you explain a little bit more (a sentence or two) how EIGENSTRAT corrects for stratification / accounts for shared ancestry?
Line 81 - What is a "site" exactly? A nucleotide base? A codon?
Line 99 - Is the "state" of the host reservoir versus human? If yes, I would move that statement (which is line 98 at the moment) there to clarify. If not, please clarify what this is referring to.
Line 111 - I do not get what the "binary character state" corresponds to. What are the two states considered?
Lines 119-124 - Regarding host attribution, I would have found it more intuitive (and closer to the actual biological process?) to control the state of the host at the end of the simulation of the genetic data, rather than changing the state of the allele - if I understood well the procedure. Why did you pick this option? If this was to avoid the final proportion of samples from the reservoir vs humans to be "forced" by the genetic data, why not simply simulating more genomes (e.g., 2000) and then sampling a subset, as this would happen in real life anyway? Is it because of the computation time?
Lines 125-126 - For epidemiologists reading the paper, I would suggest adding that Type II and Type I errors correspond to false negative and false positive respectively.
Line 140 - Does limiting the data set to genomes from Sierra Leone artificially decrease stratification? From the results and the discussion, I imagine that yes and that you are aware of it (and that it was actually done on purpose), but at this stage of the manuscript, it does read like a potential bias. I suggest either adding something like "for the sake of simplicity for this study case, we restricted our analysis to Sierra Leone", or "to avoid really high stratification, we restricted our analysis to Sierra Leone (see Discussion)". Related to this, I do not understand why you did not run the analyses on a complete data set, including for instance Nigerian data. It looks like you are speculating on the outcomes of such an analysis (lines 301-305 - although I agree that the prediction looks easy and safe), while it would not have costed much more to run it? Could you explain why you did not run it?
Line 149 - I would add here the number of human and rodent genomes used in the study.
Line 199 - I would refer to Supplementary Figure 1 in the legend of Figure 1 saying "The panel presenting the relationship between error and lambda_A is presented in Supplementary Figure 1" and remove that sentence (it is too mysterious regarding the content of the figure to be useful here). Also please add 0 and 1 on the x-axis of Supp. Fig. 1B, and the A and B labels.
Line 200 - Please remind us what phi stands for here.
Lines 223-226 - I would remove this paragraph as it is redundant with the following ones while bringing less information.
Line 282 - Do you think that 8 of the loci are not significantly associated with spillover in the data subset because of a lack of power due to reduced sample size, or because of those associations were spurious - or is it impossible to say anyway?
In the discussion, I would suggest discussing the following points...
In the recommendations (around line 298), I would add a sentence or two on the important of good metadata (following up on the results reported lines 131-150).
The simulations are run on a scenario with only 1 locus associated with spillover. Is there any reason to believe that having more would change the performance of the method?
Could you discuss how applicable this approach would be to a system involving a bridge host (e.g., Hendra virus going from bats to humans to horses), which can be sampled or not (or even identified or not)?
The results of the analyses of the Lassa data points towards two SNPs associated with the polymerase structure and interaction with host mRNA. I find this very interesting as most of the discussion regarding host range and adaptation recently has focused on receptor binding, neglecting all the other processes at play, notably replication. I would highlight this in the discussion (with just one sentence saying that interestingly, the results points towards replication / within-cell processes as potentially key).
Source
© 2023 the Reviewer.
References
B., W. A. O., H., B. B. H., Bruno, G., J., D. A. J., Joseph, H., Jenna, N., Matej, V., Emmanuel, A., James, B., G., L. E. G., C., K. M. C., T., K. O. T., Anna, S., H., R. C. H., L., N. S. L. 2023. Identifying the genetic basis of viral spillover using Lassa virus as a test case. Royal Society Open Science.
