Content of review 1, reviewed on February 19, 2020
The manuscript deals with simulation of OTU tables, basic data structures used in studies of microbiomes, using generative adversarial network (GAN), which is a modern way to produce artificial data with similar properties to real data. Although GAN has already been applied to produce various kind of data, it has never been used to simulate microbiomes before. Although the authors provided some potential implications, I don't feel presented tool "MB-GAN" in the current form may bring any substantial advance in simulating microbiomes that would be utilizable in practice. I found substantial limitations in the methodology and I think additional simulation and verification is needed to draw the conclusions. The manuscript is written clearly in very good English. Although the statistics used for verification of simulated results is correct, it is not sufficient. There are other aspects of data that should be considered as the whole topic is much more complex. For the issues listed below, I recommend to reject the manuscript in the current form.
Major issues:
In the section "Potential implications" you state "researchers can use MB-GAN to simulate microbiome abundances for a certain sample size, impose effect sizes on a subset of taxa for different phenotype groups". I have to disagree as a detected microbial composition is highly influenced by sequencing depth. Not only reflects the estimated diversity the differences between samples, it also follows their size because the number of species is nonlinearly depending on the number of individuals within the sample. In microbial studies, the size of a sample is conditional to the sequencing depth, which is different for particular samples even when the collection technique, library preparation procedure and other sequencing parameters are the same for all samples. As MB-GAN takes and also produces OTU table of relative abundances, this information is completely lost. Therefore, one cannot use MB-GAN to simulate any size effects. To be able to do that, MB-GAN would need to work with rarefaction. In the current form, MB-GAN can be used to simulate only a dataset of the same size as the real one and I think this is not enough for a practical utilization. You might have considered this during the preparation of an OTU table for training MB-GAN, but according to the manuscript, you haven't. Although if you have, you would need to prepare different OTU tables for a given dataset by resampling the data to the desired depth. This means you would be able to train the model just for one specific sequencing depth. This does not make any sense for a practical use.
You state that MB-GAN calculates with weighted UniFrac distance. I don't get that. I went through the MB-GAN GitHub and MB-GAN takes OTU table of relative abundances as an input. One cannot compute UniFrac distance without a phylogenetic tree. Therefore, you would need to provide MB-GAN with the corresponding phylogenetic tree of OTUs inferred during OTU picking or with the sequences of particular OTU to infer a tree. I wonder what the real metrics that MB-GAN uses is. It probably takes taxonomy tree just from assigned groups/taxa at different taxonomical levels. This is not applicable to real data where a lot of OTUs cannot be identified to the species level.
In "Evaluation on sample-level properties" you state "Here, since the samples are compared based on their species level abundances with no additional taxonomy information, we employed the Bray-Curtis metric to calculate the two-dimension nMDS values". In real studies of microbiomes the whole phylogeny is considered, this is very untypical. Later, you state that you used the Bray-Curtis metric on purpose because the model already tries to minimize UniFrac (although in fact, it probably does not). Still, it would be nice to have also PCoA (nMDA) UniFrac plots in a supplement, it seems interesting and important to show these.
Minor issues:
"The original sequencing data from the fecal samples are available in the European Bioinformatics Institute database with accession code ERP002061" seem to be rather a funny formulation. The presented accession number is from the ENA database. Just write "are available in the European Nucleotide Archive (ENA) database." Note ENA database is maintained by EMBL-EBI consortium, not just EBI itself.
Supplementary Figure S1 (in paper referred to as just Supplementary Figure 1) is the last one referenced in the paper, it would be nice to renumber suppl. Fig. according to their first referencing in the text.
Figure3: Are the top 10% abundant taxa the same for all datasets? I think not necessarily. It would help to add these correlation matrices into supplement, including taxa information.
Declaration of competing interests Please complete a declaration of competing interests, considering the following questions: Have you in the past five years received reimbursements, fees, funding, or salary from an organisation that may in any way gain or lose financially from the publication of this manuscript, either now or in the future? Do you hold any stocks or shares in an organisation that may in any way gain or lose financially from the publication of this manuscript, either now or in the future? Do you hold or are you currently applying for any patents relating to the content of the manuscript? Have you received reimbursements, fees, funding, or salary from an organization that holds or has applied for patents relating to the content of the manuscript? Do you have any other financial competing interests? Do you have any non-financial competing interests in relation to this paper? If you can answer no to all of the above, write 'I declare that I have no competing interests' below. If your reply is yes to any, please give details below.
I declare that I have no competing interests.
I agree to the open peer review policy of the journal. I understand that my name will be included on my report to the authors and, if the manuscript is accepted for publication, my named report including any attachments I upload will be posted on the website along with the authors' responses. I agree for my report to be made available under an Open Access Creative Commons CC-BY license (http://creativecommons.org/licenses/by/4.0/). I understand that any comments which I do not wish to be included in my named report can be included as confidential comments to the editors, which will not be published. I agree to the open peer review policy of the journal.
Authors response to reviews: (https://drive.google.com/file/d/1UmBGLCVtqx5B5JNPBPDgSEL3_aQkVT3t/view?usp=sharing)
Source
© 2020 the Reviewer (CC BY 4.0).
Content of review 2, reviewed on October 07, 2020
The manuscript presents revised version of formerly rejected manuscript, I have reviewed earlier this year (as Reviewer #3), dealing with simulation of OTU tables for microbial studies, using generative adversarial network (GAN). Since the authors elaborated the analyses and addressed issues raised during previous review, I believe the manuscript in now worth to be published. Although I do appreciate changes made by authors, I still have some concerns that should be addressed before the manuscript is finally accepted:
1) In response to my former major issue 1 you state "We have accordingly revised the sentence as following: researchers can use MB-GAN to simulate microbiome abundances for a certain sample size and impose the statistical effect sizes on a subset of taxa for different phenotype groups (Zhao et al. 2015)", but in the manuscript the sentence remained in its previous form. Please revise the sentence according to your response, I think it is truly more informative and I got your meaning now.
2) In the response to my former minor issue 5 you state that you have rearranged supplementary figures according to their first referencing in the main text, but this is not true. There is Figure S11 referenced directly after Figure S3, then you reference table S1 which is above Figure S11 in the supplement and so on. So during reading the text, reader have to scroll through the supplement searching for correct items, is it inconvenient and might lead to misunderstandings.
3) On lines 286 - 289 describing Figures S6(a) and S7(a), and in fact any other correlograms in the manuscript you state "A blue ellipse represents a negative correlation, while a red one suggests a positive correlation." However, according to scales next to correlograms, blue represent positive correlations and red represent negative ones. I would rather re-color the figures than change the text as when using red-blue mapping, it is really more common to use red for positive correlations.
4) Please update your GitHub. Actually, I was not able to find revised codes, nor code to perform metaSPARSim simulation. The manuscript should not be published before codes are available.
Declaration of competing interests Please complete a declaration of competing interests, considering the following questions: Have you in the past five years received reimbursements, fees, funding, or salary from an organisation that may in any way gain or lose financially from the publication of this manuscript, either now or in the future? Do you hold any stocks or shares in an organisation that may in any way gain or lose financially from the publication of this manuscript, either now or in the future? Do you hold or are you currently applying for any patents relating to the content of the manuscript? Have you received reimbursements, fees, funding, or salary from an organization that holds or has applied for patents relating to the content of the manuscript? Do you have any other financial competing interests? Do you have any non-financial competing interests in relation to this paper? If you can answer no to all of the above, write 'I declare that I have no competing interests' below. If your reply is yes to any, please give details below.
I declare that I have no competing interests.
I agree to the open peer review policy of the journal. I understand that my name will be included on my report to the authors and, if the manuscript is accepted for publication, my named report including any attachments I upload will be posted on the website along with the authors' responses. I agree for my report to be made available under an Open Access Creative Commons CC-BY license (http://creativecommons.org/licenses/by/4.0/). I understand that any comments which I do not wish to be included in my named report can be included as confidential comments to the editors, which will not be published. I agree to the open peer review policy of the journal.
Source
© 2020 the Reviewer (CC BY 4.0).
References
Ruichen, R., Shuang, J., Lin, X., Guanghua, X., Yang, X., J., L. D., Qiwei, L., Xiaowei, Z. MB-GAN: Microbiome Simulation via Generative Adversarial Network. GigaScience.