Content of review 1, reviewed on July 05, 2021

Summary

This study addresses the problem of large-scale integration of single-cell RNA sequencing (scRNA-seq) data from different studies using batch integration methods, and generating a comprehensive dataset of scRNA-seq data from retina across multiple species. The authors have developed an R package (scPOP) to facilitate evaluation of batch integration performance, a Snakemake-based workflow (SnakeQUANT) for gene expression quantification, and a Snakemake-based workflow (SnakePOP) to assess batch integration performance. Final processed data objects for the integrated retina dataset (scEiaD) are provided in several formats. An online code notebook is also provided to demonstrate how to integrate external datasets into the retina dataset.

These are several highly useful resources, which are clearly described, and will be useful for various readers. The integrated retina dataset includes >700,000 cells from 34 datasets across 3 species. Reproducible code and data objects are provided through several repositories.

Major comments

To select a top-performing batch integration method (scVI) for generating the integrated retina dataset, the authors calculate a total of 8 performance measures (lines 130-131), and then average these (sumZScale) to give a final combined score, and rank the methods by this score. However, a simple average of multiple performance metrics in this manner seems quite arbitrary - each of the 8 metrics is effectively given equal weight. Could the authors provide some more detail on the sensitivity of the final ranking to the set of metrics used? For example, is the ranking consistent between the individual metrics? (Line 347 in Methods notes that the balance between metrics can be adjusted, with a default value of 1.)

The installation of the scPOP R package from GitHub requires compilation, which may not work easily on all systems (e.g. Windows). It would be useful to test this on multiple systems, and consider adding continuous integration (GitHub Actions or submitting to Bioconductor) to ensure that it can be installed by all users, or at least providing detailed installation instructions.

Minor comments

It is somewhat unclear where the main datasets (scEiaD) are stored. The processed data objects can be found by following the GitHub link (https://github.com/davemcg/scEiaD) and scrolling down on the readme. It would be useful to make the links to the data objects more prominent, or make them accessible via a dedicated website. There is also a separate link to the data via the NIH website (https://plae.nei.nih.gov/), but this does not appear to be mentioned in the manuscript. Adding a "Code and Data Availability" section (or similar) with all links could be helpful.

Processed data objects are provided in Seurat and AnnData format. It would also be useful to provide SingleCellExperiment (Bioconductor) format, to further increase accessibility. However, this should not be considered a requirement for publication if the authors prefer Seurat and AnnData.

The term "hyper-variable genes" used in the manuscript is usually written as "highly variable genes" in the literature.

There is inconsistency in the text and headings about whether the integrated retina dataset contains 33 or 34 datasets.

The number of quality control passed cells mentioned on line 94 does not match Supplementary Figure 1.

On line 94, it would be useful to add a reference for the standard quality control metrics used.

The Zenodo link (line 274) would be easier to access if provided as a URL.

Double hyphens in command-line arguments (e.g. line 287) have been incorrectly formatted as long dashes.

Supplementary file "make_seurat_obj_functions.R" (line 317) is missing.

Declaration of competing interests Please complete a declaration of competing interests, considering the following questions: Have you in the past five years received reimbursements, fees, funding, or salary from an organisation that may in any way gain or lose financially from the publication of this manuscript, either now or in the future? Do you hold any stocks or shares in an organisation that may in any way gain or lose financially from the publication of this manuscript, either now or in the future? Do you hold or are you currently applying for any patents relating to the content of the manuscript? Have you received reimbursements, fees, funding, or salary from an organization that holds or has applied for patents relating to the content of the manuscript? Do you have any other financial competing interests? Do you have any non-financial competing interests in relation to this paper? If you can answer no to all of the above, write 'I declare that I have no competing interests' below. If your reply is yes to any, please give details below. I declare that I have no competing interests.

I agree to the open peer review policy of the journal. I understand that my name will be included on my report to the authors and, if the manuscript is accepted for publication, my named report including any attachments I upload will be posted on the website along with the authors' responses. I agree for my report to be made available under an Open Access Creative Commons CC-BY license (http://creativecommons.org/licenses/by/4.0/). I understand that any comments which I do not wish to be included in my named report can be included as confidential comments to the editors, which will not be published. I agree to the open peer review policy of the journal.

Authors' response to reviews:

Reviewer comment will start with "#"

Author response will start with "-"

However, I consider that the comparison of the batch effect correction presented in the manuscript should be expanded showing UMAP plots for the different methods tested. Authors should show figures similar to the benchmarks published in the paper: Tran, H.T.N., Ang, K.S., Chevrier, M. et al. A benchmark of batch-effect correction methods for single-cell RNA sequencing data. Genome Biol 21, 12 (2020). https://doi.org/10.1186/s13059-019-1850-9

  • I fully agree. Part of the reason that this was not included in the first submission was because I hadnt tracked down the integration/UMAP files. Which is, admittedly, a poor excuse. After your comment I spent more time looking and concluded that I must have panic-deleted them at some point to regain space on the compute cluster. So we bit the bullet and re-ran all the integration methods to re-create Figure 2a,b,c and I have created three new supplementary files (pdfs) that show UMAP from all comparisons in Figure 2c that were able to get a UMAP run completed (We have had a recurring issue where the UMAP fails to complete for a small number of integration/normalization combinations. I have been unable to resolve this issue despite a large amount of time invested). The results changed slightly as we changed the louvin clustering k-nearest neighbors from 7 to 20 to match the current Seurat (and anndata) default.

The UMAP cell points are colored in three ways (hence the three files): known cell type, study (batch), and organism of origin. As we were concerned it would reproduce well in the paper format, we provide these as supplementary files.

I also added a citation for the paper you mentioned. Thanks, we have not come across this publication.

To select a top-performing batch integration method (scVI) for generating the integrated retina dataset, the authors calculate a total of 8 performance measures (lines 130-131), and then average these (sumZScale) to give a final combined score, and rank the methods by this score. However, a simple average of multiple performance metrics in this manner seems quite arbitrary - each of the 8 metrics is effectively given equal weight. Could the authors provide some more detail on the sensitivity of the final ranking to the set of metrics used? For example, is the ranking consistent between the individual metrics? (Line 347 in Methods notes that the balance between metrics can be adjusted, with a default value of 1.)

  • This is an excellent point and one we have internally argued about. Informally our feeling was that the zscale + sum approach was fairly robust because when we toying with the code we did not see dramatic changes. As feelings arent really a valid way of doing research we looked further into your comment in two ways:

  • We binned the metrics into three categories. If they measured batch mixing (e.g. LISI batch), cell type or cluster purity (e.g. ASW/silhouette - cell type), or balanced both (e.g. NMI). We multiplied the batch scores by 3x (giving them 3x more weight in the final sumZscale score) or the cluster/cell type purity by 3x. The results are shown in supplementary figure 3. We added the following text: a. As the LISI and silhouette metrics provide independent cluster and cell type ("purity") and batch ("mixing") scores we looked to see whether the sumZScale scoring is highly influenced by changing the weight placed on purity or mixing by multiplying either by a multiplier of three (Supplementary Figure 3). While scVI still performs well, no matter the weight chosen, there were some changes when more weight was placed on batch mixing. Most notably the 8 latent dims with fastMNN receives a higher score compared fastMNN with 30 latent dims and when a higher weight was placed on batch mixing CCA's performance with 30 latent dimensions has a higher final rank

  • In a more extreme example we multiplied each Zscaled metric by a random number between 0.1 and 10. Then did that 1000 times, extracted the final rank after summing all (randomly scaled) metrics and plotted the distribution in Supplementary Figure 4. I think this figure demonstrates that the final rankings are fairly stable (because the distributions are fairly tight) despite aggressive scaling of the individual metrics.

The installation of the scPOP R package from GitHub requires compilation, which may not work easily on all systems (e.g. Windows). It would be useful to test this on multiple systems, and consider adding continuous integration (GitHub Actions or submitting to Bioconductor) to ensure that it can be installed by all users, or at least providing detailed installation instructions.

  • We have added further instructions on how to compile on multiple platforms and we have submitted scPOP to CRAN

It is somewhat unclear where the main datasets (scEiaD) are stored. The processed data objects can be found by following the GitHub link (https://github.com/davemcg/scEiaD) and scrolling down on the readme. It would be useful to make the links to the data objects more prominent, or make them accessible via a dedicated website. There is also a separate link to the data via the NIH website (https://plae.nei.nih.gov/), but this does not appear to be mentioned in the manuscript. Adding a "Code and Data Availability" section (or similar) with all links could be helpful.

  • Agree. This is confusing. We have added a new Supplementary Table 5 that summarizes where to find the data (both from Zenodo which is accessioned and direct download links from our institutions server).

Processed data objects are provided in Seurat and AnnData format. It would also be useful to provide SingleCellExperiment (Bioconductor) format, to further increase accessibility. However, this should not be considered a requirement for publication if the authors prefer Seurat and AnnData.

  • As you can convert Seurat to a SingleCellExperiment object in one line (as.SingleCellExperiment(Seurat_object)) we did not think it was necessary to provide download links for both R formats.

The term "hyper-variable genes" used in the manuscript is usually written as "highly variable genes" in the literature.

  • Today I learned. This has been fixed.

There is inconsistency in the text and headings about whether the integrated retina dataset contains 33 or 34 datasets.

  • Yes. Good eye. This is a mistake. 33 is the correct number and has been corrected.

The number of quality control passed cells mentioned on line 94 does not match Supplementary Figure 1.

  • This was written poorly. This count (790k) is the passed QC and the in silico doublet summed together (BEFORE doublet removal). As this was confusing we rewrote this to now say "We then removed cells which had more than 10% mitochondrial reads across all gene counts, fewer than 200 unique genes quantified, or were identified as an in silico doublet (see methods). After these quality control (for review see46 ) steps we were left with 766,615 cells (Supplemental Figure 1)"

On line 94, it would be useful to add a reference for the standard quality control metrics used.

  • Added citation for Luecken and Theis Current best practices in single?cell RNA?seq analysis: a tutorial

The Zenodo link (line 274) would be easier to access if provided as a URL.

  • The Zenodo link (line 274) would be easier to access if provided as a URL.

Double hyphens in command-line arguments (e.g. line 287) have been incorrectly formatted as long dashes.

  • Fixed

Supplementary file "make_seurat_obj_functions.R" (line 317) is missing.

  • Correct, this has now been attached along with "merge_methods.R" which was also missing

Source

    © 2021 the Reviewer (CC BY 4.0).

Content of review 2, reviewed on August 10, 2021

The authors have comprehensively addressed the previous set of comments, and I have no further comments.

Declaration of competing interests Please complete a declaration of competing interests, considering the following questions: Have you in the past five years received reimbursements, fees, funding, or salary from an organisation that may in any way gain or lose financially from the publication of this manuscript, either now or in the future? Do you hold any stocks or shares in an organisation that may in any way gain or lose financially from the publication of this manuscript, either now or in the future? Do you hold or are you currently applying for any patents relating to the content of the manuscript? Have you received reimbursements, fees, funding, or salary from an organization that holds or has applied for patents relating to the content of the manuscript? Do you have any other financial competing interests? Do you have any non-financial competing interests in relation to this paper? If you can answer no to all of the above, write 'I declare that I have no competing interests' below. If your reply is yes to any, please give details below. I declare that I have no competing interests.

I agree to the open peer review policy of the journal. I understand that my name will be included on my report to the authors and, if the manuscript is accepted for publication, my named report including any attachments I upload will be posted on the website along with the authors' responses. I agree for my report to be made available under an Open Access Creative Commons CC-BY license (http://creativecommons.org/licenses/by/4.0/). I understand that any comments which I do not wish to be included in my named report can be included as confidential comments to the editors, which will not be published. I agree to the open peer review policy of the journal.

Authors' response to reviews:

Added reference to CRAN for scPOP package (accepted a couple of weeks ago by them).

Fixed links to tables (removed references to "hpc.nih.gov" download links for zenodo)

merge_methods.R script saucy comments removed

Added required sections (declarations/etc) and moved one sub section as per the editors request

Source

    © 2021 the Reviewer (CC BY 4.0).

References

    S., S. V., D., F. T., B., H. R., M., M. D. Building the mega single-cell transcriptome ocular meta-atlas. GigaScience.