Review of The case for using mapped exonic non-duplicate reads when reporting RNA-sequencing depth: examples from pediatric cancer datasets

Content of review 1, reviewed on September 28, 2020

Beale and colleagues describe the utility of quantifying mapped exonic non-duplicate (MEND) reads as a way to summarise the quality and depth of RNA-seq data. In their report they use a large number of pediatric cancer studies to demonstrate how MEND is informative for QC. I think this report has some merit but I have a number of concerns and suggestions:

Since most readers will only go as far as the title and abstract it is imperative these are very clear. The title suggests we should "use" MEND counts for RNA-seq. I comprehend this as a recommendation to use MEND counts for all downstream analysis which is contrary to current best practices and a number of previous reports on the topic (PMID: 27156886, PMID: 30001700).
In the abstract authors state "we propose using only definitively informative reads, MEND reads, for the purposes of asserting the accuracy of gene expression measured in a bulk RNA-Seq experiment." However the terms "using" and "accuracy" are not defined or demonstrated in the paper.
rRNA may explain the long tail of the unmapped reads. In my experience rRNA carry-over is the main source of unmapped reads in RNA-seq. This can be quantified by mapping reads to the set of rRNA genes with BWA or similar.
Panels Fig 1B and Fig 2A are not mentioned in the body text.
Figure 3A x-axis should be percentages not proportions.
In the work around Supplemental Figure 2, the detection threshold is set as a certain TPM number but the logic behind this is not sound. If a TPM based threshold is used, then the threshold increases with increasing sequencing depth resulting in losing genes when adding more seq coverage. Rather, setting a static threshold is more logical, for example an average of 10 reads per gene across the experiment.
MEND reads are similar to the number of assigned reads that are reported in most RNA-seq studies, with the difference that duplicate reads are discarded. What is the added benefit of reporting MEND reads as compared to simply the assigned reads?
Many of the tools used for RNA-seq quantification such as Kallisto or Salmon are extremely fast and can be run on relatively low-powered computers. What is the relative computational cost/time of calculating MEND reads?
Can you suggest a rule of thumb for the number of MEND reads or %MEND reads for excluding poor quality datasets? Does applying this threshold improve the accuracy of downstream analysis?
Lastly, authors make some recommendations that RNA-seq studies should report MEND reads, which on the face of it is very sensible, but this should be based on solid data. This recommendation is based largely on the observed correlation data shown in Fig 3C, but this is insufficient. Authors should consider one or more of the following to further support their recommendations:

a. Threshold analysis. Does applying MEND thresholds (omitting samples that do not meet acceptable MEND thresholds) improve the accuracy of downstream analysis (i.e. the discovery of truly differentially expressed genes)?

b. Saturation analysis. Does the number of DEGs discovered for high %MEND datasets increase faster than low %MEND datasets as a function of increasing sequencing depth?

c. After randomly subsampling to a set read number, do low %MEND datasets exhibit poor kmer complexity? This could be quantified with kPAL (PMID: 25514851) or another kmer tool.

d. An analysis of RNA-seq with spike-ins and unique molecular identifiers (UMIs) to confirm that MEND reads are informative and reliable while duplicate reads are a source of bias. This could be done with a parallel analysis that does/doesn't include duplicate removal followed by differential analysis.

Declaration of competing interests Please complete a declaration of competing interests, considering the following questions: Have you in the past five years received reimbursements, fees, funding, or salary from an organisation that may in any way gain or lose financially from the publication of this manuscript, either now or in the future? Do you hold any stocks or shares in an organisation that may in any way gain or lose financially from the publication of this manuscript, either now or in the future? Do you hold or are you currently applying for any patents relating to the content of the manuscript? Have you received reimbursements, fees, funding, or salary from an organization that holds or has applied for patents relating to the content of the manuscript? Do you have any other financial competing interests? Do you have any non-financial competing interests in relation to this paper? If you can answer no to all of the above, write 'I declare that I have no competing interests' below. If your reply is yes to any, please give details below. I declare that I have no competing interests.

I agree to the open peer review policy of the journal. I understand that my name will be included on my report to the authors and, if the manuscript is accepted for publication, my named report including any attachments I upload will be posted on the website along with the authors' responses. I agree for my report to be made available under an Open Access Creative Commons CC-BY license (http://creativecommons.org/licenses/by/4.0/). I understand that any comments which I do not wish to be included in my named report can be included as confidential comments to the editors, which will not be published. I agree to the open peer review policy of the journal.

Authors' response to reviews: Reviewer 1 Comment 1: The authors suggest that in order to assess the accuracy of an RNAseq experiment, researchers should state the MEND read count for their experiment rather than total reads. I found the article interesting and well written. Just one query: could the authors include p-values for all correlations, and check the normality of their variables to ensure that Pearson correlations is the most appropriate statistic to use.

Answer: We thank the reviewer for this insight. We found that the data is not sufficiently normal to warrant a Pearson correlation. We have adopted Spearman correlations and added p values to Figure 4 (previously Figure S1), which is discussed in the paragraph 4 of the results section .

Reviewer 2: Beale and colleagues describe the utility of quantifying mapped exonic non-duplicate (MEND) reads as a way to summarise the quality and depth of RNA-seq data. In their report they use a large number of pediatric cancer studies to demonstrate how MEND is informative for QC. I think this report has some merit but I have a number of concerns and suggestions:

Reviewer 2 Comment 1: Since most readers will only go as far as the title and abstract it is imperative these are very clear. The title suggests we should "use" MEND counts for RNA-seq. I comprehend this as a recommendation to use MEND counts for all downstream analysis which is contrary to current best practices and a number of previous reports on the topic (PMID: 27156886, PMID: 30001700).

Answer: We thank the reviewer for this suggestion. We have added clarifying language to the title (the new title is “The case for using Mapped Exonic Non-Duplicate (MEND) reads when reporting RNA sequencing depth”, abstract (conclusions section) and body (paragraphs 1 and 3 of the background, paragraphs 2 and 4 of results, paragraph 8 of the conclusion) of the document to indicate that we are discussing read counts for the purpose of reporting sequencing depth, not individual gene measurements or downstream analysis such as differential expression.

Reviewer 2 Comment 2: In the abstract authors state "we propose using only definitively informative reads, MEND reads, for the purposes of asserting the accuracy of gene expression measured in a bulk RNA-Seq experiment." However the terms "using" and "accuracy" are not defined or demonstrated in the paper.

Answer: We have replaced "accuracy" with "reproducibility" throughout the document and demonstrate what we mean by reproducibility in the first paragraph of the background section.

As described in item 1, we have added statements to the title, abstract and body of the document clarifying that we are proposing using MEND counts for reporting sequencing depth, not individual gene measurements.

Reviewer 2 Comment 3: rRNA may explain the long tail of the unmapped reads. In my experience rRNA carry-over is the main source of unmapped reads in RNA-seq. This can be quantified by mapping reads to the set of rRNA genes with BWA or similar.

Answer: We agree that rRNA reads may be the cause of the high number of unmapped reads in a few datasets. We feel however that this analysis is beyond the scope of the manuscript. We're demonstrating that read type composition varies between datasets, not trying to explain the underlying causes of that variability.

Reviewer 2 Comment 4: Panels Fig 1B and Fig 2A are not mentioned in the body text.

Answer: Fig 1B is mentioned in the data description subsection of the methods, in the last sentence before the data analysis section. We fixed the reference to Figure 1A in the first paragraph of the results that was intended to refer to Figure 2A.

Reviewer 2 Comment 5: Figure 3A x-axis should be percentages not proportions.

Answer: We thank the reviewer for identifying this. We have fixed it.

Reviewer 2 Comment 6: In the work around Supplemental Figure 2, the detection threshold is set as a certain TPM number but the logic behind this is not sound. If a TPM based threshold is used, then the threshold increases with increasing sequencing depth resulting in losing genes when adding more seq coverage. Rather, setting a static threshold is more logical, for example an average of 10 reads per gene across the experiment.

Answer: We thank the reviewer for this insight. We agree. We repeated the analysis with counts and counts per kb of transcript length. We found that the changes in correlation between gene expression and read depth observed in more highly expressed genes are confounded by the fact that we include duplicate reads in gene expression quantification, as is standard in the field (PMID: 27156886, PMID: 30001700) but not in MEND counts. In hindsight, we feel that this analysis is ancillary to the main point of the manuscript, and we have omitted it. (The main narrative of the manuscript is that previous work shows that two categories of reads aren't used for gene expression quantification, and a third, duplicates, can be spurious. In the survey of RNA-Seq datasets from a variety of sources, we show that the fraction of each read type varies greatly across datasets. This sufficiently supports our argument that neither mapped nor total reads are sufficient for describing the amount of data contributing to the reproducibility of a dataset).

Reviewer 2 Comment 7: MEND reads are similar to the number of assigned reads that are reported in most RNA-seq studies, with the difference that duplicate reads are discarded. What is the added benefit of reporting MEND reads as compared to simply the assigned reads?

Answer: In some datasets, nearly all reads are duplicates. If only assigned reads were considered, the datasets would appear to be high quality. We address this issue and previous studies in the paragraph 3 of the Conclusion.

Reviewer 2 Comment 8: Many of the tools used for RNA-seq quantification such as Kallisto or Salmon are extremely fast and can be run on relatively low-powered computers. What is the relative computational cost/time of calculating MEND reads?

Answer: It's (frankly) quite slow! Our next step is to make a faster version. We've added a "Computing requirements for MEND pipeline" section to the results on page 10 to describe the speed of the process and a note in the conclusion (paragraph 6) about the benefit of developing a faster version.

Reviewer 2 Comment 9: Can you suggest a rule of thumb for the number of MEND reads or %MEND reads for excluding poor quality datasets? Does applying this threshold improve the accuracy of downstream analysis?

Answer: We have added a discussion of this question to the conclusion in paragraph 7.

Reviewer 2 Comment 10: Lastly, authors make some recommendations that RNA-seq studies should report MEND reads, which on the face of it is very sensible, but this should be based on solid data. This recommendation is based largely on the observed correlation data shown in Fig 3C, but this is insufficient. Authors should consider one or more of the following to further support their recommendations:

b. Saturation analysis. Does the number of DEGs discovered for high %MEND datasets increase faster than low %MEND datasets as a function of increasing sequencing depth?

c. After randomly subsampling to a set read number, do low %MEND datasets exhibit poor kmer complexity? This could be quantified with kPAL (PMID: 25514851) or another kmer tool.

Answer: We thank the reviewer for this insight. We feel however that this analysis is beyond the scope of the manuscript. We're demonstrating that read type composition varies between datasets. By the nature of gene expression quantification methods, reads that do not map to genes will not contribute to the quantification, and excessive duplicate reads do not reflect the underlying gene expression.

Source

Content of review 2, reviewed on January 06, 2021

Many thanks and congratulations to the authors for taking on board most of the points.

Authors' response to reviews: Thank you for all your work on this manuscript. I've added all the changes suggested.

Notes from 2/1: 1) Can we add Olena Vaske (senior author) as co-corresponding author? I added that notation to the manuscript. 2) Can we add Matthew A. Cattle and Liam T. McKay (who are manuscript authors) to the dataset authorship list in the same places as they are on the manuscript author list? If this is possible, I'll also need to update the author list on the citation for the dataset. I'm sorry I didn't see this in my earlier review of the dataset citation.

Notes from 2/5: I made updated the citation for GigaDB and also fix another error where I referred to Drew Thompson as Drew Thomson in the author list and GigaDB citation.

Source

References

C., B. H., M., R. J., A., C. M., T., M. L., A., T. D. K., Katrina, L., Geoffrey, L. A., T., K. E., Rob, C., Lam, D. L., Lauren, S., Jacob, P., John, V., Isabel, B., R., S. S., David, H., M., V. O. The case for using mapped exonic non-duplicate reads when reporting RNA-sequencing depth: examples from pediatric cancer datasets. GigaScience.

Pre-publication Review of

The case for using mapped exonic non-duplicate reads when reporting RNA-sequencing depth: examples from pediatric cancer datasets

Reviewed On September 28, 2020 , and January 06, 2021

Submitted to

Reviewed by

Actions

Content of review 1, reviewed on September 28, 2020

Source

Content of review 2, reviewed on January 06, 2021

Source

References