Review of SOAPdenovo2: an empirically improved memory-efficient short-read <i>de novo</i> assembler

Content of review 1, reviewed on August 15, 2012

This paper introduces an update to the SOAPdenovo program, describes its major improvements and shows improved results on two datasets. We reviewed this paper with a group of four people from the same research group.

Major Compulsory Revisions

We have the following major issues with this paper:

The explanation on the improvements in SOAPdenovo2 lack sufficient detail to be able to fully understand them. Papers of this kind usually explain approaches and algorithms used in much more detail. The authors should look at other papers describing new versions of existing software, such as the recent ALLPATHS_LG paper (Ribeiro et al, 2012, http://genome.cshlp.org/content/early/2012/07/24/gr.141515.112.abstract), or even the article describing the first version of SOAP (Li et al, 2009). Improvements of the text are needed so that the reader can understand what changes were implemented and exactly how that improved the program.
Even though we were given access to the underlying raw data, and obtained a pre release versions of SOAPdenovo2 from the authors, we could not replicate the results described in the paper due to a lack of detail in the section on 'Testing and Assessment': the exact commands used for the assemblies are not given.
The article is very biased towards assembly of human genomes. However, SOAPdenovo can be, and often is, used for the assembly of bacterial genomes. The authors use the Assemblathon1 data for their analyses of SOAPdenovo2. In the 'Background' section, the GAGE assembly competition is mentioned, which focusses on comparing programs for assembly of bacterial-sized genomes. However, SOAPdenovo2 was not evaluated against the GAGE data, something we feel is an omission.

One of us tested SOAPdenovo2 on the Rhodobacter sphaeroides dataset from GAGE, and ran the same analysis script as was used for the GAGE publication (http://gage.cbcb.umd.edu/results/index.html). We have included a summary of this analysis as a PDF attached to this report. From the results, we find the following:

SOAPdenovo2, as the first version of the program, still results in many errors in contigs and scaffolds ('corrected' N50's are much lower then N50' values of the sequences generated by SOAPdenovo2)
In our tests of the 'sparse assembly graph' approach, a better assembly was obtained by providing a larger estimated genome size then the real size. Do the authors have an explanation for this effect?
The 'sparse assembly graph' runs improved uncorrected scaffold sizes, however they resulted in a larger number of scaffolds. Also, the corrected scaffolds N50 of these assemblies were in fact lower than reported in the GAGE article for SOAPdenovo1.
We did see an improvement in the contigs from SOAPdenovo2 relative to the first version: fewer errors and higher corrected N50 values, but at the cost of higher contig numbers.

In conclusion, we do not see significant improvements using SOAPdenovo2 versus the first version of the program on the Rhodobacter dataset. We feel the authors should document the performance of SOAPdenovo2 on small genomes with an available reference genome, for example using the data that was the basis of the GAGE competition.

We also tried SOAPdenovo2 on data from one of our own large eukaryotic genomes. The 'default' version of the program crashed, only when we used the sparse assembly graph version did we get the program running. This may have been due to the fact that we were not able to compile the program on our system, and only could use the provided binaries.
GigaScience's description of a technical note requires 'the code described be documented and tested to high standards.' We did not have access to the source code and can therefore not judge whether the code was well documented. Also, we feel the few tests reported in the paper make us uncertain whether the code can be considered 'tested to high standards' (see also above).
The paper makes many claims that are not referring to any articles or actual data. For example, it is written "Scaffold construction is another area that needs improvement in NGS de novo assembly programs." Can the authors point to some references to back up this claim? Similarly, when discussing the original SOAPdenovo program, the authors give three problematic areas as examples -improperly handling of heterozygous contigs, chimeric scaffolds, false contig relationships. However, no documentation of these problems is provided - real tests of assemblies of datasets with a reference genome where these problems can be shown.
The authors tested new YH 2x100 illumina data with SOAPdenovo2 but failed to show comparable analyses of the same data with the original SOAPdenovo program. To fully elucidate the improvements made from the upgrade to SOAPdenovo2, the authors should report on the analysis of these new YH data with both versions of the program.
The authors used analyses from the assemblathon1 (published February 2011) in their comparison of SOAPdenovo2 with the ALLPATHS_LG program. However, new versions of ALLPATHS_LG have been released since February 2011. As such, we feel that the authors should test the most recent version of ALLPATHS_LG against SOAPdenovo2 (using the same data) to ensure a fair comparison between the two programs.

Minor Essential Revisions

There is no reference to table 2 in main text
The doi link for reference reference 11 (http://dx.doi.org/10.5524/100038) was not resolving at the time this manuscript was submitted for review.

Discretionary Revisions

None

Level of interest: An article of importance in its field

Quality of written English: Needs some language corrections before being published

Statistical review: No, the manuscript does not need to be seen by a statistician.

Declaration of competing interests: I declare that I have no competing interests.

Names and affiliations of the reviewers of this report:

Lex Nederbragt, Ole Kristian Tørresen and Karin Lagesen: Centre for Ecological and Evolutionary Synthesis (CEES), Dept. of Biology, University of Oslo, Oslo, Norway

Jeremy Chase Crawford (currently guest researcher at CEES): Dept. of Integrative Biology & Museum of Vertebrate Zoology, University of California, Berkeley, USA

Source

Content of review 2, reviewed on November 30, 2012

The authors have done a good job responding to all the criticism. We thank them for the changes and additions they made to the manuscript. We particularly appreciate the effort that went into the supplementary material.

We therefore recommend to accept the manuscript for publication, pending some minor revisions.

Major remarks:

The authors write in their response "We will release the source code of SOAPdenovo2 as soon as the paper is accepted." We still feel strongly that the reviewers should have been given access to the source code of the program(s). However, we decided not to reject the paper because of this omission.

The authors write on page 5 "Notably, SOAPdenovo1.05 was released two years after SOAPdenovo1 and already included several improvements and new features from SOAPdenovo2, including the new contig and scaffold construction improvements, but without the new error correction and gap closure modules. " We are wondering whether it then is fair to compare SOAPdenovo2 only with SOAPdenovo1. The authors should have included SOAPdenovo1.05 as well for testing the new 100bp PE reads. At the very least they should acknowledge the limits of their comparison.

Minor Essential Revisions

Main text

page 3, in the sentence "However, the error correction module in SOAPdenovo was designed for short Illumina reads (35-50bp), which consumes excessive amount of computational time and memory on longer reads, say over 150GB memory running for two days using 40x 100bp paired-end Illumina HiSeq 2000 reads." It is not clear what is referred to for the 40x. Is this a genome the size of the human genome?
page 5: "The SOAPdenovo2 assembly also had a much lower amount of copy number errors, but did have more substitution errors." Please include a description on how these measures are calculated
please be careful to keep the colors consistent among figure 1 and 2, to avoid confusion
table 1: Copy Number Error-rate --> how is this defined, what is the unit?
table 2: please indicate v1 or v2 for the 'version' column of each row; please explain 'scaffold size'
both tables: what are the units? (usually bp)

Supplementary

could page numbers be included in the table of contents of the supplementary material?
section 1: the text mentions 'SOAPec-1.0' and 'SOAPec-2.0', however, these terms are used nowhere else. Please be consistent
section 1: "The algorithm is based on k-mer frequency spectrums (i.e. KFS), but the algorithm is quite different from other KFS tools" Please reference these other tools.
section 7: "As shown in Table 2, the scaffold N50 of SOAPdenovo2 overwhelmed ALLPATHS-LG and increased more than 4-fold compare to SOAPdenovo. But the contig N50 of ALLPATHS-LG is the longest" We object to the word 'overwhelmed'. There is a difference, but it is not overwhelming. In addition, the larger contig N50 of Allpaths is equally 'overwhelming' soapdenovo2… Please rephrase.
section 7: "However, the contig N50 could be further improved for SOAPdenovo2 by using 3’-end connected reads and larger k-mer size as ALLPATHS-LG do." Why didn't the authors try this then?

Discretionary Revisions

we recommend the authors to ask a native English speaker to have a look at the text, especially of the Supplementary material

Level of interest: An article of importance in its field

Quality of written English: Needs some language corrections before being published

Statistical review: No, the manuscript does not need to be seen by a statistician.

Declaration of competing interests: I declare that I have no competing interests

Source

References

Ruibang, L., Binghang, L., Yinlong, X., Zhenyu, L., Weihua, H., Jianying, Y., Guangzhu, H., Yanxiang, C., Qi, P., Yunjie, L., Jingbo, T., Gengxiong, W., Hao, Z., Yujian, S., Yong, L., Chang, Y., Bo, W., Yao, L., Changlei, H., W., C. D., Siu-Ming, Y., Shaoliang, P., Xiaoqian, Z., Guangming, L., Xiangke, L., Yingrui, L., Huanming, Y., Jian, W., Tak-Wah, L., Jun, W. 2012. SOAPdenovo2: an empirically improved memory-efficient short-read de novo assembler. GigaScience.

Pre-publication Review of

SOAPdenovo2: an empirically improved memory-efficient short-read de novo assembler

Reviewed On August 15, 2012 , and November 30, 2012

Submitted to

Reviewed by

Actions

Content of review 1, reviewed on August 15, 2012

Source

Content of review 2, reviewed on November 30, 2012

Source

References