Content of review 1, reviewed on June 23, 2014

The authors present a pipeline for SV detection based on genome mapping technology. The results presented in this manuscript indicate a significant advancement in SV prediction. As such, the method described in this manuscript has the potential to be transformative to the field of SV discovery and personal genomics. However, the description of the methodology is extremely shallow - leaving the impression that the description is more of a "press release" than a scientific manuscript. Considerable revisions are required to bring the text inline with the level of detail expected from SV prediction methods.

Major Compulsory Revisions
  1. Many methodological pieces are missing from the manuscript. Since the Irys platform and genome maps are new to me, and I suspect to most readers, a more detailed discussion of these are important. For example, I understand that the Irys system does not actually involve sequencing but straightens large DNA molecules so they can be imaged intact. How much resolution of the DNA molecule is given? What are the common sources of error? How detailed are the DNA maps? In addition, the assembly pipeline is presented as a "black box" with no supporting citations or information. Since the problem of assembly with large single-molecule sequencing data is of scientific and computational interest, a more detailed description is needed. From my perspective the following questions remain about the assembly/in silico maps:

  2. What assembly procedure is used? What constitutes redundant and spurious edges?

  3. How are contigs mapped to the reference genome? (What is done consensus sequence with multiple mappings? Is a single position chosen?)

  4. What is meant by label positions? (This seems to be an attribute of the alignment/mapping to the reference sequence.)

  5. How - exactly - are contigs stitched together.

  6. Similar to the above comment, methodological procedures are missing from the description of SV detection. As the focus of the manuscript is on the highly accurate prediction of SVs, this description in particular needs further elaboration.

  7. What was the dynamic program used? And how does it compare to other such methods for SV detection (i.e., AGE). Are different DPs used to predict if there in an inversion, insertion or deletion?

  8. What is the likelihood model used? What parameters are used?

  9. How are repetitive/multiple mappings handled?

  10. How are breakpoints determined and reported?

  11. In particular, how are complex rearrangements classified? Even if the true alignment is known - it can be challenging to dissect deletions/inversions which interact.

  12. What is meant by "high scoring regions" - how are scoring thresholds determined?

  13. While not related to the method itself, discussing the theoretical false discovery rate would be helpful.

  14. The authors should provide some guidance to the sensitivity and specificity of their results as a function of sequence coverage and relevant parameters of their pipeline. For example, how do the results predicted vary with the coverage? What trade-offs are there in the likelihood model?

  15. The authors should compare their results to a different prediction process.

  16. "The authors write we applied the NGS discordant paired-end mapping and read depth based methods" - however, they in fact did not compare to published methods. They defined their own heuristic for determining when a region was detectable with NGS technologies. With the abundance of published implementations for SV methods, I recommend the authors compare to a published method. In particular, methods like AGE and CREST are designed to use a similar assembly based strategy for prediction.

  17. A useful first step might be to create a simulated genome and simulate both genome maps and NGS sequencing data and then run appropriate methods on each.

Minor Essential Revisions

Figure 1 is incredibly important, the authors should consider a revised version of the figure which more clearly illustrates the generation of a genome map.

Discretionary Revisions

  1. The authors have presented a pipeline to predict SVs from a particular type of data. However, it would be helpful to comment on the generality of their approach to including other sequencing platforms. For example, great success has been found when using NGS data to error correct the longer/higher error single molecule reads. Would such a procedure be helpful? Could the more accurate NGS reads be helpful with genome maps?

  2. It might also be beneficial to have some type of discussion of optimal strategy for a fixed cost. It seems - to me - that the coverage required for meaningful SV predictions from the Irys platform would have to be higher than with NGS data. If this is the case, then one might be better off with lower-coverage NGS data.

  3. Although the focus of the paper is on SV detection, since high coverage is obtained for a genome the same pipeline may be useful at SNP detection. Can the authors comment on if this is a possibility.

Minor issues not for publication (typos, spelling)

  1. Background, first paragraph. It is more correct to say that "SVs are associated with a number of human diseases" rather than "SVs account for human disease"

  2. Background, second paragraph. The first sentence ends awkwardly with "overall excellent performance". I would say something like, "A variety of experimental and computational approaches exist for SV detection. As we will detail below, each procedure has distinct biases and limitations in SV prediction."

  3. Background, second paragraph. I believe it not entirely fair to characterize the NGS methods as "tedious". The present pipeline requires extremely high sequence coverage (>90X's) and requires extensive pre and post-processing. It is not clear to me that this is less tedious than most NGS pipelines.

  4. Background, fourth paragraph. It's a bit presumptuous to use the phrase "superior technology" without describing in what ways the technology represents and improvement. Are there longer reads, lower-error rates, unbiased coverage, etc?

  5. Structural Variation Analysis, first paragraph. Grammatical error, should read "We present these 59 events separately..."

Level of interest An exceptional article

Quality of written English Needs some language corrections before being published

Statistical review No, the manuscript does not need to be seen by a statistician.

Declaration of competing interests I declare that I have no competing interests.

Source

    © 2014 the Reviewer (CC BY 3.0 - source).

Content of review 2, reviewed on September 20, 2014

The authors have successfully addressed my concerns with the previous manuscript. I have no further criticism to offer.

Level of interest An exceptional article

Quality of written English Needs some language corrections before being published

Statistical review No, the manuscript does not need to be seen by a statistician.

Declaration of competing interests I declare that I have no competing interests

Source

    © 2014 the Reviewer (CC BY 3.0 - source).

References

    Hongzhi, C., R., H. A., Dandan, C., T., L. E., Yuhui, S., Haodong, H., Xiao, L., Liya, L., Warren, A., Saki, C., Shujia, H., Xin, T., Michael, R., Thomas, A., Anders, K., Huanming, Y., Han, C., Xun, X. 2014. Rapid detection of structural variation in a human genome using nanochannel-based genome mapping technology. GigaScience.