Review of LINKS: Scalable, alignment-free scaffolding of draft genomes with long reads

Content of review 1, reviewed on June 12, 2015

The manuscript “Scaffolding draft genomes with nanopore reads” by Warren et al. describes a software dedicated to genome assembly improvement, using data provided by the Oxford Nanopore technology. The LINKS algorithm differs from existing methods, which are alignment-based, and is elegant. Moreover methods are very well described. The authors used several various datasets to access the performance of the LINKS tool, and I should mention that LINKS is, in my knowledge, the first study describing a scaffolding method based on nanopore reads.

The LINKS software is available online, easy to install and use, and it runs very fast. I would like to congratulate all authors.

My main concern about the manuscript is related to the low quality of the results. The improvements of the continuity with the nanopore reads are not convincing. Indeed, several peer-reviewed articles have already reported near-perfect genome assembly for bacterial genome using both a combination of short and long reads (Madoui MA et al, Genome assembly using nanopore-guided long and error free DNA reads, BMC Genomics, 2015), or only Pacific Biosciences long reads (Chin CS et al., Nonhybrid, finished microbial genome assemblies from long-read SMRT sequencing data, Nature Methods, 2013). Moreover, I’m dubious about the application of the algorithm presented here to scaffold more complex genomes using nanopore reads.

I do not feel adequately qualified to assess the statistics, especially the description of the statistical properties of errors, and the description of mixture models. Furthermore an existing study, which is not cited by Warren et al., already describes error rates of the MinION device (Jain M. et al, Improved data analysis for the MinION nanopore sequencer, Nature methods, 2015). Authors should mention and describe the differences with this existing study.

Major Compulsory Revisions 1) It has been reported, in the last six months, several high-quality genome assemblies for E. coli MG1655 using MinION and Illumina reads (http://www.genoscope.cns.fr/nas/ and http://schatzlab.cshl.edu/data/nanocorr/). The authors should compare their tool to these existing results and should highlight the benefits of using LINKS. As an example, the optimal coverage needed for the two approaches (denovo assembly vs scaffolding) could be compared. 2) The comparison with the SSPACE long read tool exhibits similar results. However, when LINKS is used in an iterative mode, final assemblies seem to be of higher continuity. I suggest the authors to make it the default mode of the method. Indeed, gradually increasing the distance between k-mer pairs is a good way to take advantage of long reads and is a fair comparison with alignment-based scaffolders. 3) The scaffolding of the white spruce genome assembly shows that LINKS scale well to larger genomes. Using a closely related genome to scaffold a second draft genome is unsafe, as you’ll miss specific structural variations. That’s why it should be combined with other kind of data, like sequencing data or maps. There is a very limited number of scaffolders (as an example, Gritsenko AA et al, GRASS: a generic algorithm for scaffolding next-generation sequencing assemblies, Bioinformatics, 2012) that can take into account this additional data, I suggest the authors to highlight this feature. Nevertheless, this part of the work is not in accordance with the title of the manuscript. Second, the continuity of final assembly is not very well documented. Authors only discuss the NG50 (P9;L11-12), I suggest to add a table with all descriptive metrics of both the input and final assembly. Finally, it’s not clear how the quality assessment using gap-filling software and MPET data is achieved. The fact that validation rate decreased at each iteration is expected, as mentioned by authors. As a consequence, the authors could not validate scaffold integrity when parameter d is larger than 12kb. In my opinion, the sentence P9L22-23 is a shortcut. 4) As described for example, in the Figure 1 of the following study (Schatz M. et al, Assembly of large genomes using second-generation sequencing, Genome Research, 2010), k-mer uniqueness is highly dependent of the k value and the input genome. I guess using k=15 is not enough for a large majority of genome, even with pairing information. So to deal with more complex genome users will have to increase the k parameter, which is not compatible with the current error rate of the nanopore technology. The algorithm itself could be applied on large and complex genomes, but probably not with nanopore reads, as suggested in the title. 5) One key aspect of a scaffolder is the estimation of the size of the gaps, however this point is never addressed in the manuscript. 6) Table1 first suggests that iterative mode provides better results (NG50 of 633Kb vs 293Kb, 1D and 1F). However, the NGA50 and NA50 are similar, suggesting that iterative mode produces longer scaffolds but with a higher error rate.

Minor Essential Revisions 1) P3L7: The cited manuscript has now been peer-reviewed, and the correct reference is from Nature Biotechnology. I think authors should at least mentioned two other peer-reviewed manuscript (Chin CS et al., Nonhybrid, finished microbial genome assemblies from long-read SMRT sequencing data, Nature Methods, 2013) and (Madoui MA et al, Genome assembly using nanopore-guided long and error free DNA reads, BMC Genomics, 2015). 2) P313: “At the moment, sequence reads ….and indels rates”. As mentioned previously, several methods described high-quality genome assemblies using the MinION device. 3) P6L20: ABySS reference is missing. 4) Table S2 is a good summary, I suggest the authors to use it in the main text. 5) P9L19: Not clear where the 58.6% comes from, could we find it in Figure3? 6) P10L3-8: The analogy with the 10X Genomics technology is not appropriate; indeed it generates linked short sequences, but saying that you are able to generate linked short sequences starting from long reads is not very informative. 7) P13L13: Using LAST with default parameters is not optimal, as already stated in (Quick J. et al, A reference bacterial genome dataset generated on the MinION™ portable single-molecule nanopore sequencer, GigaScience, 2014) and (Jain M. et al, Improved data analysis for the MinION nanopore sequencer, Nature methods, 2015). Moreover, in Figure S2, the authors used blastn to align nanopore reads, and surprisingly they found that 184 regions of E. coli are devoid of nanopore reads. It’s weird as denovo assembly of this dataset provides a perfect circle genome. 8) Figure1: If I well understand, contig3 and contig2 are overlapping. It’s not clear in the second part of the figure, as none k-mer pair establish a link. In the third part, what do black arrows mean? 9) Figure2 is a good way to show at the same time the quality and the continuity of several assemblies. However this figure is hard to read, I suggest the authors to create a figure for each genome. 10) Figure3: the panels are not order logically compared to the corresponding legend. In the legend, where does the 84,529 number come from? 11) P4L12: Why is Figure S2 cited here? 12) The authors use sometimes version 1.5+ (P9L5) and sometimes version 1.5 (P11L2) of LINKS, why this discrepancy?

Level of interest An article whose findings are important to those with closely related research interests Quality of written English Acceptable Statistical review Yes, but I do not feel adequately qualified to assess the statistics. Declaration of competing interests I declare that I have no competing interests; however I should mention that I am part of the MinION® Access Programme (MAP).

Authors' response to reviewers: (http://www.gigasciencejournal.com/imedia/5046484471782784_comment.pdf)

Source

Content of review 2, reviewed on July 22, 2015

The authors present an improved manuscript “Scaffolding draft genomes with long reads”, and my major compulsory revisions were taken into account. Moreover they added a new dataset, based on Pacific Biosciences data, to highlight the interest of their method. It’s clear that a lot of work and lots of experiments as well have been done. However, I still have some comments concerning this new benchmark. This section should be checked conscientiously.

Major Compulsory Revisions 1) The A. thaliana dataset was added during the revision, and the authors do not present the results adequately. As an example, the metrics are not clearly mentioned (what is the NG50 of each assembly?) or the comparison between LINKS and existing assemblers (HGAP, ECTools and PACBioToCA) is not presented in the main text. 2) If I well understand Figure S8, the NG50 of the LINKS assembly with raw PacBio data is 453.8kbp and the NG50 of the LINKS assembly with corrected PacBio data is 2650.7kbp. These metrics are not in accordance with Figure S7.

Minor Essential Revisions 1) Warren et al. use numerous datasets, so I suggest the authors to create a section for each dataset in order to make the article more readable. 2) Legend of figure S7 is unintelligible. 3) Figure S8 is hard to understand. I suggest to 1) split the table and the figure 2) add metrics of the LINKS assemblies to the table. Please use separate colors to help distinguish LINKS assemblies (red and blue dots). 4) NaS assemblies exist for E. coli K12 and S. cerevisae W303 (http://www.genoscope.cns.fr/externe/nas/assemblies.html), I suggest the authors to add these assemblies to Table 2 and Figure 3. 5) I suggest the authors to add a column with the input coverage in Table 1 and Table 2. It’s not clear what is the input coverage used to generate the LINKS assemblies. 6) I suggest the authors to add a column with the number of unknown bases (N’s) in Table 1 and Table 2.

Authors' response to reviews: (http://www.gigasciencejournal.com/imedia/2597824401816219_comment.pdf)

Source

References

L., W. R., Chen, Y., P., V. B., Bahar, B., Albert, L., M., J. S. J., Inanc, B. 2015. LINKS: Scalable, alignment-free scaffolding of draft genomes with long reads. GigaScience.

Pre-publication Review of

LINKS: Scalable, alignment-free scaffolding of draft genomes with long reads

Reviewed On June 12, 2015 , and July 22, 2015

Submitted to

Reviewed by

Actions

Content of review 1, reviewed on June 12, 2015

Source

Content of review 2, reviewed on July 22, 2015

Source

References