Content of review 1, reviewed on February 15, 2016
Zhang et al. describe a new program for using RNA-seq data to scaffold genome assemblies. The manuscript is nicely written and the program seems to perform well. My only major concern is that the performance evaluation is very limited.
Major points: 1. The program is only tested on simulated data. Although this has the benefit of facilitating a very controlled evaluation, the results may not sufficiently reflect performance on real data. Therefore, both AGOUTI and RNAPATH should also be tested on real data. There are many options for this. My suggestion would be to use data from the same species used for the simulation (C. elegans). Since the program is presented in the context of short-read sequencing, I would suggest creating an initial assembly from Illumina shotgun reads and applying the programs to scaffold that assembly. Since the current official C. elegans assembly is of very high quality, it should be suitable as a proxy for the true genome sequence in the evaluation.
2. AGOUTI should also be tested (in comparison to RNAPATH) on at least one other species, to demonstrate generality.
3. The authors refer to a similar program called L_RNA_scaffolder (reference 4). Why was it not included in the comparison alongside RNAPATH?
Minor points:
4. The manuscript does not refer much to previous work in the area. Two papers that seem relevant are Chen et al. 2015 Sci Rep 5:18019 (PubMed ID 26658305) and Riba-Grognuz et al. 2011 Bioinformatics 27:3425 (PubMed ID 21994228).
5. AGOUTI uses a greedy algorithm for scaffolding (pages 5-6). The optimal solution is defined as a path recruiting all vertices. However, there could be several such paths. Would it not be better to find the path with the maximum weight? Or the path that has maximum weight among the paths recruiting all vertices? Is there an efficient (e.g. dynamic programming) algorithm for that problem? I am not suggesting that the algorithm implemented in AGOUTI needs to be changed for this paper, but it would be good to discuss these questions in the manuscript, to shed some light on the choices made in the design of this algorithm.
6. I am confused by this sentence in the description of the scaffolding algorithm: "This process terminates at any vertex whose connection with the next would have intervening gene models between them (Fig 5)" (page 6). Is this check for intervening gene models needed, since the earlier denoising step removes read pairs crossing such gene models?
7. Please state which genome sizes AGOUTI can be applied to, and provide some details about memory usage and compute time.
8. Which parameters were used for AUGUSTUS (page 7)?
9. Why were RNA-seq reads mapped using BWA instead of a dedicated RNA-seq alignment program such as STAR or HISAT? Such programs can map reads across introns and therefore achieve much higher accuracy. Does AGOUTI accept intron-spanning read mappings as input?
10. I find the following sentence on page 8 confusing: "We tested this by re-running AGOUTI on the six noise-free datasets, and decreased the minimum number of supporting joining-pairs to 2 (i.e. k=2), the same setting as was used for RNAPATH runs". This sentence seems to introduce results for k=2. However, results for k=2 were already described in the previous two paragraphs, which refer to tables 2 and 3, where results for k=2 are presented.
11. In the sentence on page 9 ending in "two exons of a single gene for each contig pair", I would remove "for each contig pair", since this is stated earlier in the same sentence.
12. In the description of the three cases of merged genes (page 9), it would be informative to provide the percentage of genes for each case, in the text.
13. Also relating to the merged genes, it should be noted that RNA-seq data also captures transcriptional noise, e.g. there may be reads connecting adjacent genes due to rare failures of transcriptional termination. Some of these cases could also be artifacts from library construction.
14. In figure 7, I found the numbering of cases confusing. I would suggest either using numbers 1-4, i.e. include the "standard case" as case 1, or name the other cases "alternative case 1-3" to distinguish them from the standard case.
15. Additional file 2 is a supporting data description, but does not actually say what data it is referring to. It is a large data set (40 Gb). Is it the simulation data, or something else?
Level of interest
Please indicate how interesting you found the manuscript:
An article of importance in its field.
Quality of written English
Please indicate the quality of language in the manuscript:
Acceptable.
Declaration of competing interests
Please complete a declaration of competing interests, considering the following questions:
1. Have you in the past five years received reimbursements, fees, funding, or salary from an
organisation that may in any way gain or lose financially from the publication of this
manuscript, either now or in the future?
2. Do you hold any stocks or shares in an organisation that may in any way gain or lose
financially from the publication of this manuscript, either now or in the future?
3. Do you hold or are you currently applying for any patents relating to the content of the
manuscript?
4. Have you received reimbursements, fees, funding, or salary from an organization that
holds or has applied for patents relating to the content of the manuscript?
5. Do you have any other financial competing interests?
6. Do you have any non-financial competing interests in relation to this paper?
If you can answer no to all of the above, write 'I declare that I have no competing interests'
below. If your reply is yes to any, please give details below.
I declare that I have no competing interests.
I agree to the open peer review policy of the journal. I understand that my name will be included
on my report to the authors and, if the manuscript is accepted for publication, my named report
including any attachments I upload will be posted on the website along with the authors'
responses. I agree for my report to be made available under an Open Access Creative Commons
CC-BY license (http://creativecommons.org/licenses/by/4.0/). I understand that any comments
which I do not wish to be included in my named report can be included as confidential comments
to the editors, which will not be published.
I agree to the open peer review policy of the journal.
Authors' response to reviews: (https://static-content.springer.com/openpeerreview/art%3A10.1186%2Fs13742-016-0136-3/13742_2016_136_AuthorComment_V1.pdf)
Source
Content of review 2, reviewed on May 01, 2016
I only have a few comments on the revised version:
1. It should be indicated which assembly was used as "truth" in the evaluation on real data (N2/CB), e.g. to compute the error counts for Table 3.
2. I think the following statement, which occurs both in the abstract and main text, could be misleading: "genomes sequenced using short-read, next-generation sequencing technologies are error-filled and fragmented into thousands of small sequences". The term "error-filled" is imprecise and could lead readers to think that sequences are so filled with errors that they are nearly useless, which is certainly not the case. For small genomes the number of contigs can be much less than 1000, so "thousands" should also be changed.
3. Another reason for the worse performance on real data (page 12) could be that actual breakpoints are more common in intergenic regions, e.g. due to enrichment of repeats in those regions. This could easily be checked.
Level of interest
Please indicate how interesting you found the manuscript:
An article whose findings are important to those with closely related research interests.
Quality of written English
Please indicate the quality of language in the manuscript:
Acceptable.
Declaration of competing interests
Please complete a declaration of competing interests, considering the following questions:
1. Have you in the past five years received reimbursements, fees, funding, or salary from an
organisation that may in any way gain or lose financially from the publication of this
manuscript, either now or in the future?
2. Do you hold any stocks or shares in an organisation that may in any way gain or lose
financially from the publication of this manuscript, either now or in the future?
3. Do you hold or are you currently applying for any patents relating to the content of the
manuscript?
4. Have you received reimbursements, fees, funding, or salary from an organization that
holds or has applied for patents relating to the content of the manuscript?
5. Do you have any other financial competing interests?
6. Do you have any non-financial competing interests in relation to this paper?
If you can answer no to all of the above, write 'I declare that I have no competing interests'
below. If your reply is yes to any, please give details below.
I declare that I have no competing interests.
I agree to the open peer review policy of the journal. I understand that my name will be included
on my report to the authors and, if the manuscript is accepted for publication, my named report
including any attachments I upload will be posted on the website along with the authors'
responses. I agree for my report to be made available under an Open Access Creative Commons
CC-BY license (http://creativecommons.org/licenses/by/4.0/). I understand that any comments
which I do not wish to be included in my named report can be included as confidential comments
to the editors, which will not be published.
I agree to the open peer review policy of the journal.
Authors' response to reviews: (https://static-content.springer.com/openpeerreview/art%3A10.1186%2Fs13742-016-0136-3/13742_2016_136_AuthorComment_V2.pdf)
Source
References
V., Z. S., Luting, Z., W., H. M. 2016. AGOUTI: improving genome assembly and annotation using transcriptome data. GigaScience.