Content of review 1, reviewed on July 15, 2019

This paper presents the results of analyzing several datasets with a range of short read aligners and variant callers. The analysis is exhaustive and the results are important for researchers conducting these type of analyses, especially when using a single reference genome. The results seem to confirm results seen by others, specifically Bertels et al. (PMID:24600054) and Sahl et al. (PMID:28348869), neither of which are cited. The RealPhy paper suggests using multiple reference genomes and merging the results to mitigate the effects of a distant reference.

The goal of the paper is to analyze 'SNP pipelines', although only a single 'self contained' SNP pipeline (Snippy) is included. I would argue that the rest of the analyses are based on aligner/variant caller pairs and not complete SNP pipelines. While this could be a semantic issue, comparing Snippy with these other methods could be considered an apples to oranges comparison. Out of the dozens of 'self contained' pipelines, why was only Snippy used? The fact that Snippy is performing much better than its corresponding aligner/variant caller pairs suggests that it is doing additional work not performed by other 'pipelines'.

For introduced SNPs, it would be nice to know which SNPs are in paralogs and tandem repeats. These regions could be problematic and may be introducing false positives due to mismapping. While the authors discuss that using long reads could fix some of these problems, the effects of including these regions on the results should be considered. For example, the true positive SNPs in the real data analyses are based on MUMmer and Parsnp, neither of which filter paralogous regions. The nature of the alignment algorithm would likely control how many false SNPs were reported in these regions and could impact overall performance.

Some discussion on how these effects could impact data interpretation would be helpful. In the case of transmission events, one would assume that a closely related reference would be chosen, which would mitigate biases, any may not be sensitive to the aligner/caller used. How would these results affect large, population genomics studies?

Declaration of competing interests Please complete a declaration of competing interests, considering the following questions: Have you in the past five years received reimbursements, fees, funding, or salary from an organisation that may in any way gain or lose financially from the publication of this manuscript, either now or in the future? Do you hold any stocks or shares in an organisation that may in any way gain or lose financially from the publication of this manuscript, either now or in the future? Do you hold or are you currently applying for any patents relating to the content of the manuscript? Have you received reimbursements, fees, funding, or salary from an organization that holds or has applied for patents relating to the content of the manuscript? Do you have any other financial competing interests? Do you have any non-financial competing interests in relation to this paper? If you can answer no to all of the above, write 'I declare that I have no competing interests' below. If your reply is yes to any, please give details below. I declare that I have no competing interests.

I agree to the open peer review policy of the journal. I understand that my name will be included on my report to the authors and, if the manuscript is accepted for publication, my named report including any attachments I upload will be posted on the website along with the authors' responses. I agree for my report to be made available under an Open Access Creative Commons CC-BY license (http://creativecommons.org/licenses/by/4.0/). I understand that any comments which I do not wish to be included in my named report can be included as confidential comments to the editors, which will not be published. I agree to the open peer review policy of the journal.

Authors' response to reviews: Reviewer #1: This paper presents the results of analyzing several datasets with a range of short read aligners and variant callers. The analysis is exhaustive and the results are important for researchers conducting these type of analyses, especially when using a single reference genome.

The results seem to confirm results seen by others, specifically Bertels et al. (PMID:24600054) and Sahl et al. (PMID:28348869), neither of which are cited. The RealPhy paper suggests using multiple reference genomes and merging the results to mitigate the effects of a distant reference.

Response: We have expanded the text to discuss other approaches to overcoming issues that arise when using a single reference genome, and have added the two references suggested by the reviewer. Specifically, we have added, from line 516, the text: “An alternative approach to reducing errors introduced when using a single reference genome could be to merge results from multiple reference genomes (the approach taken by REALPHY to reconstruct phylogenies from bacterial SNPs [98]) or from multiple aligners and/or callers, obtaining consensus calls across a set of methods. This is the approach taken by the NASP pipeline [99], which can integrate data from any combination of the aligners Bowtie2, BWA-mem, Novoalign and SNAP, and the callers GATK UnifiedGenotyper, mpileup, SolSNP and VarScan (ensemble approaches have similarly been used for somatic variant calling, for example by SomaticSeq [100]).”

The goal of the paper is to analyze 'SNP pipelines', although only a single 'self contained' SNP pipeline (Snippy) is included. I would argue that the rest of the analyses are based on aligner/variant caller pairs and not complete SNP pipelines. While this could be a semantic issue, comparing Snippy with these other methods could be considered an apples to oranges comparison. Out of the dozens of 'self contained' pipelines, why was only Snippy used? The fact that Snippy is performing much better than its corresponding aligner/variant caller pairs suggests that it is doing additional work not performed by other 'pipelines'.

Response: We had used ‘pipeline’ as shorthand for ‘aligner/caller combination’, but we agree they are not synonymous. To that end, we now state early in the introduction (line 87) that:

“SNP calling pipelines are typically constructed around a read aligner (which takes FASTQ as input and produces BAM as output) and a variant caller (which takes BAM as input and produces VCF as output), often with several pre- and post-processing steps (for instance, cleaning a raw FASTQ prior to alignment, or filtering a BAM prior to variant calling). For the purpose of this study, when evaluating the two core components of aligner and caller, we use ‘pipeline’ to mean ‘an aligner/caller combination, with all other steps in common’.”

Further to the description of each aligner and caller used in this study, we now also note (line 106) that: “where possible, we applied a common set of pre- or post-processing steps to each aligner/caller combination, although note that these could differ from those applied within an ‘all-in-one’ tool (discussed further in Supplementary Text 1).”

The advantage to users (especially less experienced users) of having “all-in-one/self-contained” SNP analysis pipelines is clear, however, in that they potentially substantially streamline bioinformatics workflows; we therefore believe that they are useful to include in our study. We have now expanded the evaluation to contain two other ‘all-in-one’ pipelines, SpeedSeq and SPANDx, and discuss in the supplementary text (line 719) why some others could not reasonably be used – in certain cases, because they offer the user a choice of aligner and/or caller (such as PHEnix) and so cannot be easily be evaluated as a single entity. Specifically in line 436 of the main text, we have added: “in this study we sought to use all aligners and callers uniformly, with equivalent quality-control steps applied to all reads. To that end, while direct comparisons of any aligner/caller pipeline with ‘all-in-one’ tools (such as Snippy, SPANDx and SpeedSeq) are possible, the results should be interpreted with caution. This is because it is in principle possible to improve the performance of the former through additional quality control steps – that is, compared to an ‘all-in-one’ tool, it is not necessarily the aligner or caller alone to which any difference in performance may be attributed. For instance, although Snippy and SpeedSeq employ BWA-mem and Freebayes, both tools are distinct from the BWA-mem/Freebayes pipeline used in this study (Figure 7 and Supplementary Table 10). This is because they implement additional steps between the BWA and Freebayes components, as well as altering the default parameters relative to standalone use. Snippy, for example, employs samclip (https://github.com/tseemann/samclip) to post-process the BAM file produced by BWA-mem, removing clipped alignments in order to reduce false positive SNPs near structural variants”.

For introduced SNPs, it would be nice to know which SNPs are in paralogs and tandem repeats. These regions could be problematic and may be introducing false positives due to mismapping. While the authors discuss that using long reads could fix some of these problems, the effects of including these regions on the results should be considered. For example, the true positive SNPs in the real data analyses are based on MUMmer and Parsnp, neither of which filter paralogous regions. The nature of the alignment algorithm would likely control how many false SNPs were reported in these regions and could impact overall performance.

Response: We agree that the retention of paralogous regions would likely increase the rate of read mis-mapping and thereby the number of false positive calls, although assuming this to be a systematic error, it should not affect the rank order of pipelines. In the ‘study limitations’ section of the discussion, we have added this point to the main text (line 365): “For the strain-to-representative genome alignments in this study, we considered SNP calls only within one-to-one alignment blocks and cannot exclude the possibility that repetitive or highly mutable regions within these blocks have been misaligned. However, we did not seek to identify and exclude SNPs from these regions as, even if present, this would have a systematic negative effect on the performance of each pipeline. To demonstrate this, we re-calculated each performance metric for the 209 pipelines evaluated using real sequencing data after identifying, and masking, repetitive regions of the reference genome with self-self BLASTn (as in [77]). As we already required reference bases within each one-to-one alignment block to be supported by both nucmer and ParSnp calls (that is, implicitly masking ambiguous bases), we found that repeat-masking the reference genome had negligible effect on overall F-score although marginally improved precision (see Supplementary Text 1).”

Within Supplementary Text 1, we added the following text at line 662: “To demonstrate the effect of additional repeat-masking, we re-calculated precision, recall and F-score for each of the 209 pipelines evaluated using real sequencing data (i.e., when aligning 18 sets of non-simulated reads against one of the six representative Gram-negative genomes detailed in Supplementary Table 8). We did not test the effect of repeat-masking using the simulated E. coli datasets (as above) because this represents only one reference genome (i.e., E. coli K-12 substr. MG1655). Repetitive regions in each genome were first identified by self-self BLASTn (as in [78]), using BLAST+ v2.7.1 with default parameters, and considered those with alignments of ≥ 95% identity over length ≥ 100bp, with no more than 1 gap, and an E-value < 0.05 (not including the match of the entire genome against itself).” We also illustrate the effect of additional masking on the F-score, precision and recall distributions with a new figure within Supplementary Text 1 (on page 33).

Some discussion on how these effects could impact data interpretation would be helpful. In the case of transmission events, one would assume that a closely related reference would be chosen, which would mitigate biases, any may not be sensitive to the aligner/caller used. How would these results affect large, population genomics studies?

Response: We agree that this is a useful point to include, but would note that many transmission studies use a single reference so that when mapping all isolates (i.e. both putative outbreak and non-outbreak isolates), the reference is typically most similar to the outbreak isolates of interest, or is chosen because a particular genome has widespread prior use in similar evaluations. We have added to the discussion (line 478): “More closely related genomes would have lower Mash distances and so be more suitable as reference genomes for SNP calling. This would be particularly appropriate if, for example, studying transmission events as a closely-related reference would increase specificity, irrespective of the aligner or caller used. For larger studies that require multiple samples to be processed using a common reference, the choice of reference genome could be one which ‘triangulates’ between the set of samples – that is, has on average a similar distance to each sample, rather than being closer to some and more distant from others.”

Reviewer #2: In this paper, Bush et al. evaluate a large number of bacterial SNP calling pipelines against variously divergent references. Their main conclusion is that different pipelines perform very differently as the reference diverges, and that Jaccard similarity is a good way to choose the "best" (closest) reference for mapping.

This paper is full of nice figures and analyses, and moreover we have seen the same thing in our work, so I agreed with the major points of the paper in advance!

The only real weakness I see in the paper is that the authors use simulated data, which comes with many advantages but also means that oddball sequencer mistakes are not necessarily measured. This is an acceptable tradeoff to me, but I wanted to mention it...

Response: We initially used simulated data from 10 species, although the latter half of the results section employed real data from 16 environmentally-sourced samples plus 2 reference strains (detailed from line 730 onwards and made available as Supplementary Dataset 2). The “real-world” isolates used are members of the Enterobacteriaceae bacterial family, and are typically genetically complex (i.e. having multiple orthologs/paralogs, repeats etc), thus representing, in our minds, an appropriate analytical challenge.

I think the general conclusion that Jaccard similarity (or, really, ANI) is the best way to choose reference genomes is both important and indisputable, so it's nice to see a thorough evaluation of it.

I encourage the authors to make their evaluation code, scripts, notebooks, figure generation, etc. available. I could not seem to find it. Reproducibility is minimal but acceptable given Supp Text 1.

Response: We agree that reproducibility is critical to benchmarking studies and to that end have supplemented the pseudocode of Supplementary Text 1 by: (a) Making the full set of evaluation and figure creation scripts available as a public archive, Supplementary Dataset 2 (https://ora.ox.ac.uk/objects/uuid:8f902497-955e-4b84-9b85-693ee0e4433e). This archive also contains both the raw data necessary for evaluation (i.e. reads and indexed reference genomes) alongside example output (i.e. VCFs and summary tables). (b) Adding an additional ‘operating notes’ section to Supplementary Text 1, detailing our specific experience with certain tools, with particular regard to bugs and workarounds. This section may be considered a ‘laboratory notebook’.

Source

    © 2019 the Reviewer (CC BY 4.0).

Content of review 2, reviewed on December 17, 2019

The authors did a good job at addressing my previous comments as well as expanding the analyses to cover a more diverse suite of tools. The authors still use 'pipeline' to sometimes describe an aligner/variant caller and also an all-in-one method, which may cause confusion, but is ultimately their decision. The authors still mention Snippy as one of the best performing tools, which seems odd considering the performance in Supplementary Table 10 using real data. Perhaps the authors could state that snippy did well on simulated data, while other tools performed better on real data. The captions on the supplementary tables could also be updated to differentiate between simulated and real data. Additionally, the authors include an analysis that masks repeats using BLAST. However, the thresholds chosen for BLAST will likely only mask very similar paralogs, while the more divergent paralogs are expected to have a greater impact on mis-mapping and variant discovery (this could just be a discussion point). Some additional thoughts that may improve the manuscript:

L306: The authors should mention that they also now include 2 additional "all-in-one" pipelines L1127-1128: Please check this link. I received a 404 error when I tried to access it. The link in the response to reviewers did work for me

Figure 7: The x-axis labels don't line up with the bars, which makes it difficult to interpret. Would staggering the labels between the top and bottom of the graph help with this?

Declaration of competing interests Please complete a declaration of competing interests, considering the following questions: Have you in the past five years received reimbursements, fees, funding, or salary from an organisation that may in any way gain or lose financially from the publication of this manuscript, either now or in the future? Do you hold any stocks or shares in an organisation that may in any way gain or lose financially from the publication of this manuscript, either now or in the future? Do you hold or are you currently applying for any patents relating to the content of the manuscript? Have you received reimbursements, fees, funding, or salary from an organization that holds or has applied for patents relating to the content of the manuscript? Do you have any other financial competing interests? Do you have any non-financial competing interests in relation to this paper? If you can answer no to all of the above, write 'I declare that I have no competing interests' below. If your reply is yes to any, please give details below. I declare that I have no competing interests.

I agree to the open peer review policy of the journal. I understand that my name will be included on my report to the authors and, if the manuscript is accepted for publication, my named report including any attachments I upload will be posted on the website along with the authors' responses. I agree for my report to be made available under an Open Access Creative Commons CC-BY license (http://creativecommons.org/licenses/by/4.0/). I understand that any comments which I do not wish to be included in my named report can be included as confidential comments to the editors, which will not be published. I agree to the open peer review policy of the journal.

Authors' response to reviews:

For the Perl scripts we would recommend to put these in a code repository and include a software section at the end of the paper that is structured as follows:

Availability of supporting source code and requirements Project name: e.g. My bioinformatics project Project home page: e.g. https://github.com/ISA-tools Operating system(s): e.g. Platform independent Programming language: e.g. Java Other requirements: e.g. Java 1.3.1 or higher, Tomcat 4.0 or higher License: e.g. GNU GPL, FreeBSD etc. RRID: if applicable, e.g. RRID: SCR_014986 (see below)

Response: we have added the Perl scripts to a GitHub repository, https://github.com/oxfordmmm/GenomicDiversityPaper, now referred to on line 1130. Lines 1133-1141 specify the requirements: Project name: “Genomic diversity affects the accuracy of bacterial SNP calling pipelines” Project home page: https://github.com/oxfordmmm/GenomicDiversityPaper Operating system(s): platform-independent Programming language: Perl (v5.22.1) Other requirements: third-party software prerequisites are detailed in documentation provided with Supplementary Dataset 2 (https://ora.ox.ac.uk/objects/uuid:8f902497-955e-4b84-9b85-693ee0e4433e). License: GNU GPL.

Reviewer reports: Reviewer #1: The authors did a good job at addressing my previous comments as well as expanding the analyses to cover a more diverse suite of tools. The authors still use 'pipeline' to sometimes describe an aligner/variant caller and also an all-in-one method, which may cause confusion, but is ultimately their decision. The authors still mention Snippy as one of the best performing tools, which seems odd considering the performance in Supplementary Table 10 using real data. Perhaps the authors could state that snippy did well on simulated data, while other tools performed better on real data. The captions on the supplementary tables could also be updated to differentiate between simulated and real data.

Response: we removed from the abstract (line 47) the statement that “across the full range of genomes, among the consistently highest performing pipelines was Snippy” as this conclusion was drawn from its performance across both simulated and real datasets, when n=41 pipelines. However, with the expansion of the number of pipelines to 209, and the testing of these additional pipelines only on real data, we sought to keep the conclusions drawn based on real data distinct from those based on simulated data. To that end, we also amended line 549 to read “Nevertheless, Snippy, which employs Freebayes, is particularly robust to this, being among the most sensitive pipelines when evaluated using simulated data (Figure 5 and Supplementary Figure 4).” We have also amended the titles of Figure 5 and Supplementary Figure 4, and Supplementary Tables 3, 4, 6, 7, 13, 14, 15, 16 and 17 to emphasise their use of simulated data (the supplementary tables containing results from real data, numbers 9 and 10, were already so labelled).

Additionally, the authors include an analysis that masks repeats using BLAST. However, the thresholds chosen for BLAST will likely only mask very similar paralogs, while the more divergent paralogs are expected to have a greater impact on mis-mapping and variant discovery (this could just be a discussion point).

Response: we agree that the parameters used for repeat-masking are especially important and have expanded the discussion to include this. We have added, at line 377: “it is important to note that the parameters used for repeat-masking will determine which paralogues will be successfully masked. For the purpose of this study, we used reasonably conservative parameters (detailed in Supplementary Text 1) and so expect to have primarily masked more similar paralogues. The likelihood of mis-mapping (and thereby false positive SNP calling) would increase among more divergent paralogues, although optimising parameters to detect these is non-trivial. More lenient repeat-masking parameters, in masking more divergent positions, would also reduce the number of true SNPs it is possible to call.” This has also been added to the supplementary text, at lines 680-686.

Some additional thoughts that may improve the manuscript:

L306: The authors should mention that they also now include 2 additional "all-in-one" pipelines

Response: we have revised the sentence to read “we next expanded the scope of the evaluation to 209 pipelines (representing the addition of 12 aligners, 4 callers, and 2 ‘all-in-one’ pipelines, SpeedSeq and SPANDx)…”

L1127-1128: Please check this link. I received a 404 error when I tried to access it. The link in the response to reviewers did work for me

Response: I’m afraid we can’t replicate this 404 – we’ve re-checked the link (https://ora.ox.ac.uk/objects/uuid:8f902497-955e-4b84-9b85-693ee0e4433e) and do find it accessible.

Figure 7: The x-axis labels don't line up with the bars, which makes it difficult to interpret. Would staggering the labels between the top and bottom of the graph help with this?

Response: we have re-drawn with Figure 7 with better-positioned x-axis labels.

Response: We agree that reproducibility is critical to benchmarking studies and to that end have supplemented the pseudocode of Supplementary Text 1 by: (a) Making the full set of evaluation and figure creation scripts available as a public archive, Supplementary Dataset 2 (https://ora.ox.ac.uk/objects/uuid:8f902497-955e-4b84-9b85-693ee0e4433e). This archive also contains both the raw data necessary for evaluation (i.e. reads and indexed reference genomes) alongside example output (i.e. VCFs and summary tables). (b) Adding an additional ‘operating notes’ section to Supplementary Text 1, detailing our specific experience with certain tools, with particular regard to bugs and workarounds. This section may be considered a ‘laboratory notebook’.

Source

    © 2019 the Reviewer (CC BY 4.0).

References

    J., B. S., Dona, F., W., E. D., L., C. E., Nicola, D. M., P., S. L., Nicole, S., A., P. T. E., W., C. D., Sarah, W. A. Genomic diversity affects the accuracy of bacterial single-nucleotide polymorphism–calling pipelines. GigaScience.