Review of SoupX removes ambient RNA contamination from droplet-based single-cell RNA sequencing data

Content of review 1, reviewed on February 19, 2019

This work attempts to derive a computational approach for subtracting the background transcripts which are a technical issue for droplet based single cell RNA-seq (a majority of the current approaches on the market are droplet based). This idea is appealing although the methods used to conduct this approach in this manuscript are very complex and not well described. The methods section is sufficiently long, but the actual formulas and descriptions of the parameters and utility of the methods are difficult to follow given errors or un-explained parameters. A big criticism is why the authors used a very difficult to follow heuristic based approach to identify genes which are not expressed in a given cell to determine the soup contaminating fraction. Wouldn't a more robust statistical approach be more useful for the community. A mixed effects model based approach or something similar would be more useful as a tool to estimate the contamination from the soup and then get the residuals for downstream analysis.

Over-arching critiques: 1. The goal of the majority of the methods in this manuscript is the calculation of the ratio of transcripts from the cell (pc) and ratio of a gene coming from the soup (pgs). The methods derived attempt to use a heuristic approach of identifying cells which don't express the background transcripts to identify these fractions. The cells in the experiment are likely the source of the background transcripts. How likely is it that you will be able to identify such cells and transcripts in every case? How generalizable is this approach? This seems a big flaw in this approach is the ability to do this. The housekeeping gene approach has always been problematic because of the fact that the genes may be variable under certain conditions.

The methods to find these transcripts in each cell requires a very complex set of operations to filter genes and cells culminating in a final multinomial test. But is difficult to follow how this process is conducted, and even more difficult to figure out what the final product means statistically.
The use of negative binomial, binomial, multinomial, and poisson distributions is difficult to understand. For instance a negative binomial has been shown to be a better representation than the poisson for RNA-seq data. Then why use the poisson for estimating the contamination fraction?
Wouldn't a method like a mixed effects model be a better approach to identify the fraction of the expression that is coming from the cells? Why would you method be superior to a less heuristically based approach? For example the authors could have used an adapted version of this negative binomial mixed effect modeling approach which is used for microbiome studies: https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-016-1441-7
While the analysis presented in Figure 2 for the reduction of potentially contaminating effects is interesting it doesn't get to whether the method is robust. Nor whether it accurately removes only soup counts. The method needs some sort of validation beyond these improvements in biological interpretation, which are not themselves validated.

Specific critiques: 1. In the formula please be sure to describe each parameter. For example: - in formula #3 what is (~g) and G? - line 33 column 2 page 5 there is a statement that is not numbered, and the notation is difficult to decipher - please go through the remaining formula to make sure they are properly labeled, other terms were not described

It is unclear how formula 5 was derived. It does not look like a standard or any adapted form of the negative binomial. Which is commonly seen as: https://wikimedia.org/api/rest_v1/media/math/render/svg/cbbb4081dff51e77322547c061613edb89a800f2
It is also unclear how formula 6 was derived, and what the function of the calculation is supposed to be.
How is batch defined?
If the binomial has flaws why not used multinomial alone?
Why select only the cells between 2 and 9 UMIs?
It is very difficult to glean any information from Figure 2B. Both the description in the main text and the legend are difficult to understand.

Source

Content of review 2, reviewed on February 21, 2020

The ideas represented in SoupX are interesting and would likely be of interest to the general single cell community if they could address a couple of major issues in the current tool implementation. The reformatting of the methods section and inclusion of more useful formulas with more thorough descriptions is very helpful. However, the authors have chosen to take out of the paper the selection of genes to be used in calculating the background. The previous heuristic based approach was criticized as it difficult to assess whether the complicated heuristic approach they employed would be generalizable or if there would be cases where it might fail. The decision to remove this part from the SoupX tool makes it incomplete, and detracts significantly from its usability.

Major Critiques: 1. There needs to be statistical assessments to demonstrate the benefit of using SoupX on a dataset. Currently there is only anecdotal evidence given to suggest the method works as intended. These statistical tests are vital in determining the utility of the SoupX tool. 2. The lack of appropriate datasets to test the tool are something the authors must address. If there really are no datasets which would help them to test their tool then how can we ever know the tool actually functions appropriately? The authors should devise experiments that would allow them to do so and conduct them so they can test their tool. 3. Can the authors make the heuristic for selecting genes more simple or demonstrate at least that the complexity is necessary by applying it to datasets and proving its performance improvement over more simplistic approaches. 4. Is the approach generalizable? Please prove it with statistical assessments.

Declaration of competing interests Please complete a declaration of competing interests, considering the following questions: Have you in the past five years received reimbursements, fees, funding, or salary from an organisation that may in any way gain or lose financially from the publication of this manuscript, either now or in the future? Do you hold any stocks or shares in an organisation that may in any way gain or lose financially from the publication of this manuscript, either now or in the future? Do you hold or are you currently applying for any patents relating to the content of the manuscript? Have you received reimbursements, fees, funding, or salary from an organization that holds or has applied for patents relating to the content of the manuscript? Do you have any other financial competing interests? Do you have any non-financial competing interests in relation to this paper? If you can answer no to all of the above, write 'I declare that I have no competing interests' below. If your reply is yes to any, please give details below. I declare that I have no competing interests.

I agree to the open peer review policy of the journal. I understand that my name will be included on my report to the authors and, if the manuscript is accepted for publication, my named report including any attachments I upload will be posted on the website along with the authors' responses. I agree for my report to be made available under an Open Access Creative Commons CC-BY license (http://creativecommons.org/licenses/by/4.0/). I understand that any comments which I do not wish to be included in my named report can be included as confidential comments to the editors, which will not be published. I agree to the open peer review policy of the journal.

Authors' response to reviews: (https://drive.google.com/file/d/1DE8hxxx91vHiaQ9wVroMabuLlzeZzvly/view?usp=sharing)

Source

References

D., Y. M., Sam, B. 2020. SoupX removes ambient RNA contamination from droplet-based single-cell RNA sequencing data. GigaScience.

Pre-publication Review of

SoupX removes ambient RNA contamination from droplet-based single-cell RNA sequencing data

Reviewed On February 19, 2019 , and February 21, 2020

Submitted to

Reviewed by

Actions

Content of review 1, reviewed on February 19, 2019

Source

Content of review 2, reviewed on February 21, 2020

Source

References