Review of SOAPdenovo2: an empirically improved memory-efficient short-read <i>de novo</i> assembler

Content of review 1, reviewed on August 28, 2012

General comments:

The manuscript presents SOAPdenovo2, an upgrade for the popular short-read sequence assembly tool. In order to demonstrate the novel features and improvements of this new version the authors have applied SOAPdenov2 to two different datasets: the raw data used for the Assemblathon 1 project and the YH human whole-genome sequencing data. These datasets are distributed as part of the manuscript’s submission in the spirit of the GigaScience journal. This is very valuable data as allows the reviewers and readers to reproduce the results and evaluate other aspects of the software interface. We welcome this approach and hope this manuscript will set the standard for similar bioinformatics software publications in the future. We have evaluated this manuscript both based on the quality of the presentation of the article and the assembled sequences.

The new developments in SOAPdenovo2 are focused on four areas:

more efficient algorithms and data structures to reduce the memory requirements for the step of the de Bruijn graph construction;
use of a multiple k-mer approach to improve the handling of errors and low-coverage regions;
better handling of heterozygosity with a reduction of misassemblies and chimeric sequences at the scaffold level; and
an improved method to implement gap closure.

Although we have seen significant improvements in the past couple of years the assembly problem for short reads remains an important challenge. The main conclusion from some of the recent competitions and comparative studies such as the Assemblathon and GAGE is that there is no “one-fits-all” solution, rather every study and datasets seem to require a tailored and customised tool. In this context only few tools have been successfully applied across a wide range of genomes and SOAPdenovo is one example. We believe that the authors decision to focus on the improvement of the genome structure rather than the contig level is the correct direction as the weaknesses in the current tools are indeed in the scaffolding stage. We therefore would like to recommend the publication of this manuscript and associated datasets in Gigascience. In order to ensure the best quality for the final version, however, we ask the author to address some major points before the article is accepted for publications.

Major Compulsory Revisions:

A major contribution of Gigascience is to provide a framework to publish “data-driven” scientific studies alongside the datasets following open-access open-data principles. As mentioned above we believe this is an important advance in the way algorithms and software are published in bioinformatics. With this in mind we would like to request the authors to provide a detailed flow-chart with the steps to follow to reproduce (at least partially) some of the results reported in the manuscript. The parameters and software used should also be described to allow the reviewers and readers to run the tools as done by the authors. We understand there will be certain points that might require manual interventions or the use of unpublished software but those steps should be clearly indicated. We tried to follow the steps described in the “readme” document obtaining the following results using the Assemblathon 1 data and running just up to the assembly step:
Before scaffolding: there are 77,636 contigs longer than 100,
Total sum 114,997,317 bps, with average contih length 1,481.
Contig N50 is 3,146 bp, contig N90 is 818 bp.
460,293 contig(s) longer than 32bps
After scaffolding: longest scaffold 21,794,522
Scaffold and singleton number 4,093
Total sum 118,860,825 average length 29,040
Scaffold N50 8,018,180 N90 3,409,497

Although we have tried different parameters and effectively done four different assemblies, no combination of parameters actually allowed us to get similar statistics to the figures mentioned on Table 1, so this is why we think reproducibility will be greatly improved by inclusion of the requested flow-chart. In a similar way, a list of all the software or tools used should be included, detailing whether it is publicly available, being published in this work, soon to be released or to remain closed.

It is not clear whether the impressive improvement in the YH assembly is a result of the new libraries added to the dataset or the novel algorithms in SOAPdenovo2. There are two new libraries, one of them with insert size 4-times larger than anything used before. Was the previous version of SOAPdenovo able to cope with this type of data effectively?
From the readme document distributed with the data it is clear that a consensus step (step 6) was conducted after the assembly. Was this step used to modify the assembly result? Are the metrics (i.e. substitution rates) affected by this step?
Although the authors indicated SOAPdenovo2 outperforms SOAPdenovo in memory usage, it isn't clear from Table 1 what is the level of improvement. The usage of extensive read filtering and correction should be taken into account and be mentioned either as an equal or different scenario.

Minor Essential Revisions:

More detail is needed to understand some of the new algorithms and improvements. This is a list of points that we suggest the authors should address to improve the quality of the final version of the manuscript:
Sparse graph. It is not clear which sparse graph approach has been used for SOAPdenovo2. From the reference in the text the indication is that a "sparse de Bruijn graph" as implemented by sparseassembler1. A more recent version of this approach, sparseassembler 2, has tackled some of the problems of the original development based on a "sparse k-mer graph". There seems to be a mistmatch between the description in the text and the citation. If this was the first version, could this stage be improved by the use of the new approach? More concretely: at which stage is the sparse graph used? From running SOAPdenovo2 ourselves we believe this new data structure is only used at the initial stage of building the graph resorting to a traditional graph for later stages. The two example assemblies have apparently not been generated with the sparse graph, is this because of some trade-off or is the sparse-graph now the default and has been effectively used in all the assemblies? On a technical note will the sparse graph construction directly impact the ability to preserve important meta-information such as sequence coverage? How did this information loss affect the final assemblies?
Multi k-mer approach. The authors recommend to start with a small k-mer, how small? How many k-mers are tried and what are the criteria to preserve information (if anything at all) from one k-mer level to the next one? The relevant ‘-m’ option is apparently only used on the contig step of the assembly, so we assume that the k value is still used as before, but on traversing the graph on the contig step larger k could be used on ambiguous regions? Is this a contig step only modification on is there some kind of support from the other steps? This parameter forces the use of the configuration file as parameter too, is this because it is effectively re-evaluating k-mer content from the input files?
Heterozygosity. The authors mention that a topology-based method has been developed to work around heterezygous sites. Which kind of structure/topology this method detects in the graph?
Read correction. The main goal is to reduce memory consumption and remove the source of spurious contigs. How is the procedure used in SOAPdenovo2 comparable the other assemblers, particularly with the SOAPdenovo run on YH? Is there any evidence that this help SOAPnovo2 getting longer and more reliable contigs? Should read correction always be used as a previous step on SOAPdevono2? We would also like to suggest the authors to describe any rules and principles to set the parameters for read correction. From the data it seems that the only reads corrected are from PE libraries, is this linked to the chimeric contigs introduced by these libraries?
Library insert size. They appear to be very exact for some of the libraries (the ones coming from v1 YH, including the LMP) and just an estimation for other libraries. Is this parameter checked/calibrated at the scaffolding stage? Is there any particular restriction on the distribution of insert sizes that could be taken into account for the scaffolder to work optimally? The GapCloser configuration file specifies a min and a max parameters for insert sizes: how are these values calculated or set? In particularly we would like to know why those same limits aren’t the same as the ones used in the previous steps.

Discretionary revisions:

The results from the Assemblathon 1 datasets suggest a modest enhancement in the contigs but a remarkable improvement on the scaffolds (even if not on the range of 44-fold). This scenario is consistent with the described features of SOAPdenovo2 and should be emphasized.
Although the N50 statistic is a widely used metric for assembly quality, it is important when evaluating assembly algorithms to look at the full set of contigs/scaffolds and their lengths. A plot of accumulated sequence length vs. contig count could provide better insights on the improvements of the algorithms than simply N50, which is effectively a single first-derivative value of this curve on the point where it reaches half of its final value.
The authors suggest that chimeric scaffolds are mainly introduced by smaller PE libraries. Is this just an effect of the sequence depth as usually shorter inserts are generated with more coverage? In that case: wouldn't it be just a simple step to deal with chimeras through coverage?
The term "coverage" is used for two different concepts across the text when referring to:
how much sequencing data has been generated with respect to the estimated target genome size, and
how much of the target genome/sequence is assembled into contigs. Although in most cases this can be resolved from the context it is potentially confusing for the less experience readers.
The solution to resolve heterozygosity by taking a "majority rules" strategy is in principle fine but will inevitably lead to mosaic consensi where two or more haplotypes will be represented within a single scaffold/contig. This is not per se an important issue but we suggest this is clarified in the text.
What is the error profile of the sequence added by the gap closer? Is it clearly different from the rest of the assembly in this aspect? Is there some situation in which the use of such a tool is highly inconvenient or this reason? We believe this is an important issue that hasn’t been properly discussed in recent publications. Hopefully as assembly tools improve this point will be raised as a priority.
We suggest the authors name files in the submission with a schema that is consistent with the procedure used to generate the data
Running scripts depend on some pre-existing tools as SGE, this could be stated for clarity, and although there are options not to use those features it is not necessarily straightforward to figure them out.
Some manual tuning is obviously in place (mapping reads with different k sizes, manual creation of the peGrads file). What are the criteria behind the parameters used for these manual steps?
Mapping for scaffolding on the v1 YH data has been done on k-mer=45 but gap closing is done on overlap=31:
this means that gap closing is done in a more permissive way than scaffolding, which makes sense, but then read mapping on the new reads is done with k=31, why is this?
Is there any rule to set values for this parameters?
Are the tools used for combination of the mapping being distributed as part of SOAP2.
Are the structural errors to the lower "contig/scaffold path NG50" vs. "contig/scaffold N50" ratio compared with the results obtained with ALL-PATHS?

Level of interest: An article of outstanding merit and interest in its field

Quality of written English: Acceptable

Statistical review: No, the manuscript does not need to be seen by a statistician.

Declaration of competing interests: We declare we have no competing interests. We have ongoing informal collaboration with BGI but not directly working on assembly tools.

Bernardo Clavijo and Mario Caccamo
The Genome Analysis Centre
Norwich Research Park
Norwich NR4 7UH
UK

Source

References

Ruibang, L., Binghang, L., Yinlong, X., Zhenyu, L., Weihua, H., Jianying, Y., Guangzhu, H., Yanxiang, C., Qi, P., Yunjie, L., Jingbo, T., Gengxiong, W., Hao, Z., Yujian, S., Yong, L., Chang, Y., Bo, W., Yao, L., Changlei, H., W., C. D., Siu-Ming, Y., Shaoliang, P., Xiaoqian, Z., Guangming, L., Xiangke, L., Yingrui, L., Huanming, Y., Jian, W., Tak-Wah, L., Jun, W. 2012. SOAPdenovo2: an empirically improved memory-efficient short-read de novo assembler. GigaScience.

Pre-publication Review of

SOAPdenovo2: an empirically improved memory-efficient short-read de novo assembler

Reviewed On August 28, 2012

Submitted to

Reviewed by

Actions

Content of review 1, reviewed on August 28, 2012

Source

References