Review of rnaSPAdes: a de novo transcriptome assembler and its application to RNA-Seq data

Content of review 1, reviewed on December 19, 2018

In the manuscript, Bushmanova et al have proposed an extension of the SPAdes genome assembler named "rnaSPAdes" and have shown parallels between rnaSPAdes and the SPAdes assembler in single cell mode (due to the fact that single cell sequencing gives rise to non-uniform coverage). The authors have also compared rnaSPAdes to various other transcriptome assemblers. They have presented their results in the form of statistics obtained from assembly evaluation tools such as DETONATE, rnaQUAST and Transrate. I have the following concerns:

Major:

Overall it is hard to digest novel methodological contributions from the paper. One of the major modifications from their SPAdes genome assembler is the graph simplification process. Here the authors have removed the bubbles and tips present in the graph based on kmer coverage information, length of the tip/bubble and the sequence similarity between the tip/bubble and the alternate edge. This is similar to previous work, except for the fact maybe that tips are only removed if they have similar sequence, which is not done by other methods. But how large the effect due to this simple change is, remains unclear. The authors have also modified the path extension algorithm of the SPAdes assembler to allow for paths belonging to various isoforms, but the greedy algorithm is similar to other assemblers. They mention strategies to remove chimeric reads, but it is unclear what the impact of these removal strategies is. Overall, it is not clear whether methodological differences make for the improvement in their current experiments, due to the similarities to the other methods.
One other distinction to most other methods compared to in the manuscript is that rnaSPAdes includes an external error correction step inherited from SPAdes using Bayeshammer (which does not work on the de Bruijn graph). I am not sure whether any methodological change has been made to the BayesHammer approach in order to account for the specifics of RNA-seq data (its original purpose was single cell genomics data which shares the non-uniformity), but it has been shown several times before that error correction of RNA-seq data before assembly improves the contiguity of RNA assemblies. Tools like Rcorrector and SEECER that are made specifically for RNA-seq data, are likely to lead to a bigger boost than what is reported here (when one would compare the assembly result of any method after correcting the reads). And clearly any of these de novo correction methods can be used before the assembly with one of the assemblers tested here. For example, it would be interesting to see what difference it makes to assemble the Bayeshammer corrected reads with the competing methods, how does that compare to the results with rnaSPAdes?
The authors have compared rnaSPAdes against various other transcriptome assembler and have shown that rnaSPAdes performs sometimes better (in some statistics). The kmer parameter is one of the most important parameter in an assembly procedure. The authors have optimized the kmer parameter for their own algorithm but have kept the default kmer parameter for other algorithms, which are completely different than the rnaSPAdes kmer often. Hence, the comparison is unfair as they are not made on similar grounds. Combined with the fact that there are no clear methodological improvements the results remain mostly inconclusive, except maybe the additional use of error correction is helpful as reported before.
All the datasets which the authors have used have very low coverage (less than 11 million for all but one dataset with 30 million). This is a bit strange as generation of high coverage datasets is quite common these days. Including at least one other high coverage dataset that is more standard right now would be important to judge the assembly performance as well as runtime and memory consumption. In terms of runtime and memory rnaSPAdes is neither particularly fast nor memory-efficient compared to current tools.
The authors have claimed that they have tested the algorithm on metatranscriptomic data and they have obtained decent assemblies. But no results have been shown in the manuscript as well as in the supplementary data.

Declaration of competing interests Please complete a declaration of competing interests, considering the following questions: Have you in the past five years received reimbursements, fees, funding, or salary from an organisation that may in any way gain or lose financially from the publication of this manuscript, either now or in the future? Do you hold any stocks or shares in an organisation that may in any way gain or lose financially from the publication of this manuscript, either now or in the future? Do you hold or are you currently applying for any patents relating to the content of the manuscript? Have you received reimbursements, fees, funding, or salary from an organization that holds or has applied for patents relating to the content of the manuscript? Do you have any other financial competing interests? Do you have any non-financial competing interests in relation to this paper? If you can answer no to all of the above, write 'I declare that I have no competing interests' below. If your reply is yes to any, please give details below. I declare that I have no competing interests.

I agree to the open peer review policy of the journal. I understand that my name will be included on my report to the authors and, if the manuscript is accepted for publication, my named report including any attachments I upload will be posted on the website along with the authors' responses. I agree for my report to be made available under an Open Access Creative Commons CC-BY license (http://creativecommons.org/licenses/by/4.0/). I understand that any comments which I do not wish to be included in my named report can be included as confidential comments to the editors, which will not be published. I agree to the open peer review policy of the journal

Authors' response to reviews: (https://drive.google.com/open?id=1ie4D8_PtHewMk8XdVGO1HkGhYjMBdSzT)

Source

Content of review 2, reviewed on May 17, 2019

In this work, Bushmanova et al have proposed an algorithm for transcriptome assembly inspired from the genome assembler SPAdes. Compared to the method previously proposed, the authors have made certain changes to their algorithm namely 1) Removal of the error correction step using BayesHammer error correction algorithm and 2) Addition of the step 'removal of isolated edges' in the graph processing step. The authors have also added two new high coverage datasets for analyzing the performance of their algorithm. I appreciate the response by the authors to my previous comments, but unfortunately they have not addressed the major caveat of the study.

1.My previous question: - The authors have compared rnaSPAdes against various other transcriptome assembler and have shown that rnaSPAdes performs sometimes better (in some statistics). The kmer parameter is one of the most important parameter in an assembly procedure. The authors have optimized the kmer parameter for their own algorithm but have kept the default kmer parameter for other algorithms, which are completely different than the rnaSPAdes kmer often. Hence, the comparison is unfair as they are not made on similar grounds. Combined with the fact that there are no clear methodological improvements the results remain mostly inconclusive, except maybe the additional use of error cor-rection is helpful as reported before.

Their answer We absolutely agree that k-mer size is one of the most important parameters for de novo sequence assembly, and this is exactly the reason why we decided to optimize it. However, we did not choose an optimal k value for each dataset separately, but developed a universal strategy that au-tomatically computes nearly optimal k for any kind of data depending on the read length. Thus, se-lecting an optimal value for each assembler on every dataset seems to be unfair, especially taking into account the fact that in real assembly projects the ground truth is unknown and choosing the best assembly becomes non-trivial. The procedure of selecting optimal k-mer values can be also considered as methodological im-provement comparing to other tools (which have just a fixed k-mer size(s) for all cases) and a part of the developed pipeline. Based on our experience with the assembly software users we see that the majority of them use the default k-mer values and rarely change it. Additionally, running all assemblers with different k-mer sizes on several datasets and assessing their quality would require roughly several processor years.
my new response It is good that we agree on the importance of kmer values. However, it is not novel to suggest to use more than one k-mer value for transcriptome de novo assembly. The trans-Abyss paper (cited, 2010), the Oases paper (cited, 2013) and more recent work (KREATION package, Informed kmer selection for de novo transcriptome assembly, Bioinformatics 2016) has clearly demonstrated that using more than one k-mer especially a smaller (more sensitive for lowly-expressed) and a higher one (to deal with excessive coverage and resolve repeats) clearly boosts the overall assembly performance. Thus, there is no novelty in their observation that using two kmers are better than using one. Even worse, trans-ABYSS for example allows to merge the results for two kmers, why was this not done, if the authors believe that using two kmers is much better than one? Why do they not use the other assemblers at these good k-mer values (if possible) ? Concerning this point, I think the contribution of this work could, in the best case, be that they say that easy rules suffice for the selection of two kmers, for example in comparison to the more data-driven strategy of the KREATION approach. But this is not what they analyze with their method comparison. Instead they use a bunch of diverse assemblers each at their default kmer values, which are, of course, not ideal for all datasets. The only exception is IDBA-trans which runs over several kmer values by default. Unless they change the parameters of the other assemblers and run the assemblers for which this is feasible with the same two kmers (trans-abyss, Bridger, IDBA-tran) it remains unclear whether their software in fact has an advantage.

Similarly, the first results part of the paper where they state that Spades performs better than other de novo transcriptome assemblers is unfair, and if they correct the use of kmers may look different (table 1).

Minor comment: In all the tables authors have given names (column names) of the assemblers as IDBA, SOAP, ABySS and Bloom which are genome assemblers (Although they have named it correctly in the table legends). I would suggest to keep the naming consistent as in the current form it might create confusion for the readers.

I declare that I have no competing interests.

Authors' response to reviews: (https://drive.google.com/open?id=1uVwDMz22bk3OL6Osua1kgqcnhYCkUPbE)

Source

References

Elena, B., Dmitry, A., Alla, L., D., P. A. 2019. rnaSPAdes: a de novo transcriptome assembler and its application to RNA-Seq data. GigaScience.

Pre-publication Review of

rnaSPAdes: a de novo transcriptome assembler and its application to RNA-Seq data

Reviewed On December 19, 2018 , and May 17, 2019

Submitted to

Reviewed by

Actions

Content of review 1, reviewed on December 19, 2018

Source

Content of review 2, reviewed on May 17, 2019

Source

References