Content of review 1, reviewed on June 30, 2015

The manuscript by Warren et al describes their new LINKS scaffolding algorithm for scaffolding de novo genome assemblies with long Nanopore reads, similar to other previously published algorithms for use with PacBio and Oxford Nanopore reads. The method is well-described, although the potential impact is limited considering their results only marginally advance the state of the art for Nanopore-based assembly and only for certain datasets.

By their own analysis, their code does not advance nor compete with the state of the art in E. coli: the published Nanopolish paper from Loman et al in Nature Methods as well as our own Nanocorr results both achieve essentially perfect de novo assemblies of the same E coli dataset. Their scaffold results achieve just a fraction of this contiguity and with no improvement to accuracy.

The S. typhi analysis is incomplete, and does not include an evaluation against SSPACE-LongRead, Nanocorr or Nanopolish which by their own analysis was the most effective approach for E.coli.

In yeast they have grossly misreported the Nanocorr results. Both the polished and raw assemblies available on the Nanocorr website since February have a contig NG50 size of 585kbp but are misreported in their figures:

http://labshare.cshl.edu/shares/schatzlab/www-data/nanocorr/W303_ONT_Assembly_CA_polished.fa.gz

http://labshare.cshl.edu/shares/schatzlab/www-data/nanocorr/W303_ONT_Assembly_CA.fa.gz

It is also not appropriate to evaluate the raw results of assembly for correctness. The polishing was performed with the Pilon algorithm, published last year (Walker et al 2014, PLOS One), analogous to how Quiver is used for evaluating PacBio assemblies.

Regardless, the manuscript is very misleading with the presentation of the assemblies considering the Nanocorr results consists of 95 contigs, while the LINKS scaffold consists of 6,580 scaffolds. When decomposed into contigs, the contig N50 size is only 79kbp separated by 308,358 ‘N’ characters and hundreds of gaps. This suggests that the “improvements” to the assembly come from masking out the most difficult regions with Ns. It is also not discussed how to filter the thousands of scaffolds to identify the useful sequences.

Note, I am considering the W303NANOlinks29.scaffolds.fa file posted on their website (Dated May 29, 2015, ftp://ftp.bcgsc.ca/supplementary/LINKS/GS/W303NANOlinks29.scaffolds.fa), since I was unable to run the program even after recompiling the program and updating the LINKS code as documented in their readme. The documentation should be improved to discuss strategies for solving this problem.

The extension to the white spruce genome should be benchmarked against other approaches to demonstrate the value of the approach – perhaps SSPACE or another scaffolder would perform equally well or better. It should also be completed on a genome with much higher quality reference genome available, especially since their validation rate decreases with the gap length. The longest gaps are exactly the ones of most interest and also the ones most vulnerable to mis-assemblies. Here we are left wondering if ~40% of the longest links are invalid.

The discussion point about 10X genomics is completely unnecessary and should be removed from the manuscript. The 10X data also differ in many important regards, especially that the read clouds are unordered while the long reads are sequential.

  • Michael Schatz Level of interest An article whose findings are important to those with closely related research interests Quality of written English Acceptable Statistical review Yes, and I have assessed the statistics in my report. Declaration of competing interests I declare that I have no competing interests.

Authors' response to reviewers: (http://www.gigasciencejournal.com/imedia/5046484471782784_comment.pdf)

Source

    © 2015 the Reviewer (CC BY 4.0 - source).

Content of review 2, reviewed on July 15, 2015

The manuscript is improved, especially the discussion to qualify this is a scaffolding approach and requires less coverage. I maintain my same reservations about truly advancing the state of the art for Nanopore assemblies, especially given that competing algorithms give greatly improved results with E. coli, and marginal to no improvement with the yeast W303 assembly.

Two items must be stricken from the manuscript:

1) The abstract claims: "... LINKS leverages long-range information in S. cerevisiae W303 nanopore reads to yield an assembly with less than half the errors of competing applications"

However, this claim is not supported by their results: Figure 3 and Table 2 both show the W303 assembly to have a tiny reduction in misassemblies: 166 in the Nanocorr result versus 161 with the LINKS-Nanocorr. Note we have recently revised the Nanocorr approach, and according to Quast the NG50 has been improved to 677,989bp although we have not reduced the number of misassemblies reported - many of these are true biological differences.

2) The discussion claims "To our knowledge, LINKS is the first publicly available scaffolder designed specifically for nanopore long reads". However, Ashton et 2015 (Nature Biotech) used the Nanopore reads to scaffold Illumina contigs of Salmonella Typhi using a public version of SPAdes. More recently, the report by Karlsson et al (2015) (Nature Scientific Reports) use the publicly available algorithm SSPACE-LR to scaffold a bacterial genome. As such the claim of being the first publicly available algorithm must be removed.

Level of interest An article of importance in its field Quality of written English Acceptable Statistical review No, the manuscript does not need to be seen by a statistician. Declaration of competing interests I declare that I have no competing interests.

Authors' response to reviews: (http://www.gigasciencejournal.com/imedia/2597824401816219_comment.pdf)

Source

    © 2015 the Reviewer (CC BY 4.0 - source).

References

    L., W. R., Chen, Y., P., V. B., Bahar, B., Albert, L., M., J. S. J., Inanc, B. 2015. LINKS: Scalable, alignment-free scaffolding of draft genomes with long reads. GigaScience.