Review of 16GT: a fast and sensitive variant caller using a 16-genotype probabilistic model

Content of review 1, reviewed on May 15, 2017

Luo, R. etc described a new 16GT variant caller optimized for Illunima sequencing data that uses a new 16-genotype probabilistic model to unify SNP and indel calling. They demonstrated the improved sensitivity for SNPs and comparable accuracy for indels comparing to GATK HaplotypeCaller, using genome of NA12878 in GIAB project. 16GT more comprehensively models 16 genotypes to unify SNP and indel calling in the same algorithm. 16GT appears to be a useful alternative tool for analyzing germline sequencing using Illumina platform. A few comments: 1. Need to emphasize that at least at the moment, 16GT can only be applied to germline sequencing using Illumina sequencing platform, and not appropriate for cancer genome sequencing, especially clinical cancer samples, where tumor cellularity varies greatly and not fit those models. 2. Can authors comment on whether increased sensitivity of SNPs is due to incorporation of indels into the model, or are those additional SNPs called have indel as the 2nd allele? 3. Can authors discuss the limitations of 16GT? What's the indel size limit? Should sex chromosomes be treated differently if gender is known? 4. I'm not keen to highlight better indel performance over GATK's UnifiedGenotyper, as it's known to be not a good indel caller, and not widely used for indels nowadays. 5. Given the run time in Table 2, I'm not sure "16GT ran faster" should be in the abstract.

Level of interest Please indicate how interesting you found the manuscript:
An article whose findings are important to those with closely related research interests.

Quality of written English Please indicate the quality of language in the manuscript:
Acceptable.

Declaration of competing interests Please complete a declaration of competing interests, considering the following questions: Have you in the past five years received reimbursements, fees, funding, or salary from an organisation that may in any way gain or lose financially from the publication of this manuscript, either now or in the future? Do you hold any stocks or shares in an organisation that may in any way gain or lose financially from the publication of this manuscript, either now or in the future? Do you hold or are you currently applying for any patents relating to the content of the manuscript? Have you received reimbursements, fees, funding, or salary from an organization that holds or has applied for patents relating to the content of the manuscript? Do you have any other financial competing interests? Do you have any non-financial competing interests in relation to this paper? If you can answer no to all of the above, write 'I declare that I have no competing interests' below. If your reply is yes to any, please give details below.
I have developed an NGS variant caller, VarDict, for cancer research and was published in Nucleic Acids Research in 2016.

I agree to the open peer review policy of the journal. I understand that my name will be included on my report to the authors and, if the manuscript is accepted for publication, my named report including any attachments I upload will be posted on the website along with the authors' responses. I agree for my report to be made available under an Open Access Creative Commons CC-BY license (http://creativecommons.org/licenses/by/4.0/). I understand that any comments which I do not wish to be included in my named report can be included as confidential comments to the editors, which will not be published.
I agree to the open peer review policy of the journal.

Authors' response to reviews:

Editor

Comments:

One referee flags the comparisons you make, so please make sure there is sufficient comparisons and citation of the state-of-the-art in this field (e.g. its been highlighted on the pre-print that Scalpel, VarScan2, VarDict, Mutect2 and Strelka have not been included as benchmarks).

Response: The comparison to VarScan2 has been added to the manuscript. Table 2 now comprises comparisons to six germline variant callers including two state-of-the-art callers named GATK-HC and Freebayes, and four other callers named, GATK-UG, Fermikit, ISAAC and VarScan2. VarDict, Mutect2 and Strelka are somatic variant callers, thus not compared to 16GT. Scalpel is an indel caller that doesn’t detect SNPs, thus we did not compare it to 16GT. (Please note that one of us – MCS – is a co-author of Scalpel, so we know it well.)

Reviewer: 1

The authors present a new model that can call both SNPs and INDELs by expanding the number of possible allele states to 16. The paper is well written, the model is an interesting contribution, and the results are compelling. I would like to see a little more detail in a few sections of the paper.

The standard method for communicating the true positive / false negative trade off in variant calling is a ROC-style line plot. The shape of this curve can be insightful for readers who need place their experiments at different points along this plot depending on the particulars of their experiment. Since table 2 only reports a single point on that curve, the readers do not have this context. It is also not clear that these numbers represent comparable points along their curves.

Response: We have added 7 ROC curves to our analysis, all shown in supplementary figure 1.

I don't understand why the proportion of false positives in dbSNP v138 is interesting when calling against NA12878 and why having a higher proportion in dnSNP v183 is better. I recognize that these are polymorphic sites, but what about that property is relevant to this analysis?

Response: For a set of variants that are reported by any variant caller, previous studies show that variants found in dbSNP are much more likely to be true positives, because as you say these sites are known to be polymorphic in the population. Thus for any variant caller, a higher rate of overlap with dbSNP suggests a higher true positive rate. Similarly, if a "false positive" is also reported as a variant in dbSNP, previous studies suggest that it might not be false at all. This is why we mention how many of 16GT's "false" predictions are found in dbSNP - it suggests that some of them are true rather than false.

The idea has been utilized in multiple papers and presentations. Here I list and excerpt from three of them:

1) Screening the human exome: a comparison of whole genome and whole transcriptome sequencing, Cirulli et al., 2010. “SNVs called in the gDNA and cDNA were also compared with entries in dbSNP. It was found that 90% of the gDNA exonic SNVs corresponded to a dbSNP entry, while this was true of only 56% of the cDNA SNVs. However, a further breakdown revealed that 94% of the true positive cDNA SNVs corresponded to a dbSNP entry, while only 23% of the false positives did the same. The false negatives corresponded to dbSNP entries 89% of the time.” Link: https://dx.doi.org/10.1186%2Fgb-2010-11-5-r57

2) Sensitive detection of somatic point mutations in impure and heterogeneous cancer samples, Cibulskis et al., 2013. “Figure 3d: Somatic miscall error rate for true germ-line heterozygous single-nucleotide polymorphism sites by sequencing depth in the normal sample when the site is known to be variant in the population (in dbSNP) and previously unknown (not in dbSNP)” Link: http://www.nature.com/nbt/journal/v31/n3/full/nbt.2514.html

3) Improving the Specificity of SNP Calls in the 1000 Genomes Project, Melgar et al., 2009. “Slide 7: SNPs that passed filter have 91% dbSNP. SNPs removed by filter have 33% dbSNP” Link: https://www.broadinstitute.org/files/shared/diversity/summerprogram/2009/mmelgar_presentation.pdf

We agree that the suggestion is relative not absolute. Thus, we highlighted in the manuscript that further experimental validation would be required to confirm this observation.

The model has several "empirically defined" parameters. It would be nice to describe this analysis so that users could modify the parameters for their own experiments. For example, the model will need to be retuned for long reads.

Response: Empirically defined parameters include Ps: SNP error rate, Pd: Indel error rate, θ: rate of single nucleotide differences between two unrelated haplotypes, and ω: rate of single indel differences between two unrelated haplotypes. We found that the appropriate values for these appear to be stable across different species including human, thus we do not suggest that users modify them. For advanced users, we added comments to the code such that users can change the parameters easily. One thing that should change is ε, which is the transitions to transversions ratio, and we have now highlighted in the manuscript that ε is preset to the value for human and it needs to be changed for other species.

16GT does not appear to support multi-sample calling. I think the model presented here is good, but unless the software can handle many samples, or at least produce a GVCF, it may see little use.

Response: We highlighted in the discussion that our next step to extend 16GT’s functionality will include 1) supporting multi-sample variant calling and GVCF output, 2) supporting somatic variant detection, and 3) extending the model to support variant calling in species with more than two haplotypes

Ryan Layer, University of Utah

Reviewer: 2

Luo, R. etc described a new 16GT variant caller optimized for Illumina sequencing data that uses a new 16-genotype probabilistic model to unify SNP and indel calling. They demonstrated the improved sensitivity for SNPs and comparable accuracy for indels comparing to GATK HaplotypeCaller, using genome of NA12878 in GIAB project. 16GT more comprehensively models 16 genotypes to unify SNP and indel calling in the same algorithm. 16GT appears to be a useful alternative tool for analyzing germline sequencing using Illumina platform. A few comments:

Need to emphasize that at least at the moment, 16GT can only be applied to germline sequencing using Illumina sequencing platform, and not appropriate for cancer genome sequencing, especially clinical cancer samples, where tumor cellularity varies greatly and not fit those models.

Response: We now emphasize in the conclusion that, for now, 16GT can only be applied to germline variant detection. In the future, we will improve 16GT to support multi-sample variant calling and GVCF output, to support somatic variant detection and extend the model to support variant calling in species with more than two haplotypes.

Can authors comment on whether increased sensitivity of SNPs is due to incorporation of indels into the model, or are those additional SNPs called have indel as the 2nd allele?

Response: 16GT model performs better than the traditional 10-genotype model at a lower depth and when the authentic variant signals are mingled with noise of the other type. For example, investigation into the 3,710 Indels that detected by 16GT but missed in UnifiedGenotyper shows that 95.7% of them are lower than the mean depth and mingled with at least one mismatch. We observed additional SNPs with indels as the 2nd allele being called by 16GT than UnifiedGenotyper but not the HaplotypeCaller.

Can authors discuss the limitations of 16GT? What's the indel size limit? Should sex chromosomes be treated differently if gender is known?

Response: The largest indel 16GT can detect is bounded by the aligner used for input generation. 16GT’s algorithm has no limit on indel sizes. The 16GT implementation automatically detects the input gender and treats sex chromosomes differently.

I'm not keen to highlight better indel performance over GATK's UnifiedGenotyper, as it's known to be not a good indel caller, and not widely used for indels nowadays.

Response: We agree with the reviewer that UnifiedGenotyper is not widely used for indels after HaplotypeCaller has released. But since 16GT and UnifiedGenotyper are both Bayesian model based, a comparison between 16GT and UnifiedGenotyper can give readers some clues on how the better model improves the performance on indel calling. Note, also this is just one of the many comparisons we have included.

Given the run time in Table 2, I'm not sure "16GT ran faster" should be in the abstract.

Response: We removed “ran faster” from the abstract.

Source

References

Ruibang, L., C., S. M., L., S. S. 2017. 16GT: a fast and sensitive variant caller using a 16-genotype probabilistic model. GigaScience.

Pre-publication Review of

16GT: a fast and sensitive variant caller using a 16-genotype probabilistic model

Reviewed On May 15, 2017

Submitted to

Reviewed by

Actions