Content of review 1, reviewed on February 11, 2021
Wu et al. presents a computational tool for identifying virome sequences as virulent or temperate phage-derived fragments. They construct artificial contigs extracted from 127 complete phage genomes with lifestyle annotations, and built a framework integrating one-hot encoding and convolutional neural networks for a binary classification task. The authors tested six different architectures varying parameters such as encoding representations and removing certain layers on different groups of artificial contigs (based on varying lengths) to find that their DeePhage architecture had the best overall performance. The authors then used their methods to predict sequences in metavirome data of bovine rumen and from an ulcerative colitis (UC) study as applications of real metavirome data.Throughout the manuscript the authors use the previously published tool Phage Classification Tool Set (PHACTS) as a point of comparison, but in fact uses the same dataset from for addressing essentially the same problem. The novelty proposed by the authors here appears to be that DeePhage is a read or contig based approach which is adapted for metagenomic fragments.
(1) While each individual step is acceptable, there is a high degree of abstraction and redundancy involved in the study, including using artificial contigs (are there real world annotated virome datasets that can be used as inputs?) and the same dataset as the PHACTS (are there any other high confidence annotations?). The authors should systematically review dataset construction strategies and test different input setups where possible. (2) The authors used PCA on 4-mer frequencies (Figure 1) to show differences in sequence signatures, but the variance captures by the PCs is not striking and this approach is also not statistically rigorous. On top of that, 4-mer frequencies are not used as an input to the model. The authors should perform statistical analyses on the input data to their ML model to show significant differences. (3) When comparing PHACTS and DeePhage on metavirome data of bovine rumen, the authors generated a mini temperate phage-derived protein set to generate a surrogate set of "temperate" contigs. However, only 16/118918 contigs had homology. The authors should test other real world datasets to see if this is a general feature of metavirome data or just characteristic of bovine rumen. Furthermore, they should test their tools on the contigs without homology to see the false positive rates. (4) The authors analyze virome data from ulcerative colitis (UC) patients and healthy people to find associations between phages and gut microbiota, but should describe the utility and biological correlates of their analyses more clearly and also ideally demonstrate them with another real world dataset. (5) Can the authors use DeePhage to make inferences in a species-aware manner? (6) Are there any sequence-based features learned throughout the training process (not just separation of labels)? (7) Minor - the manuscript could use some proof-reading throughout, e.g. "following" to "followed" on line 71, "making" to "make" on line 75.
Declaration of competing interests Please complete a declaration of competing interests, considering the following questions: Have you in the past five years received reimbursements, fees, funding, or salary from an organisation that may in any way gain or lose financially from the publication of this manuscript, either now or in the future? Do you hold any stocks or shares in an organisation that may in any way gain or lose financially from the publication of this manuscript, either now or in the future? Do you hold or are you currently applying for any patents relating to the content of the manuscript? Have you received reimbursements, fees, funding, or salary from an organization that holds or has applied for patents relating to the content of the manuscript? Do you have any other financial competing interests? Do you have any non-financial competing interests in relation to this paper? If you can answer no to all of the above, write 'I declare that I have no competing interests' below. If your reply is yes to any, please give details below. I declare that I have no competing interests.
I agree to the open peer review policy of the journal. I understand that my name will be included on my report to the authors and, if the manuscript is accepted for publication, my named report including any attachments I upload will be posted on the website along with the authors' responses. I agree for my report to be made available under an Open Access Creative Commons CC-BY license (http://creativecommons.org/licenses/by/4.0/). I understand that any comments which I do not wish to be included in my named report can be included as confidential comments to the editors, which will not be published. I agree to the open peer review policy of the journal.
Authors' response to reviews:(https://drive.google.com/file/d/1J-yPA1K-IqclMnz16tU9L0HlZLs5NHVF/view?usp=sharing)
Source
© 2021 the Reviewer (CC BY 4.0).
Content of review 2, reviewed on June 25, 2021
Overall, the authors have developed a deep learning based tool for classifying phage genomes fragments as a virulent or temperate, with an expanded dataset and several new analyses in this revision. The authors also perform additional analysis.
First, there is no point-by-point response to previous round of reviewer's comments, which is essential. From reading of R1 itself it looks like the authors made efforts to address previous comments, however it's not elaborated in response.
While the tool may be useful, there are still some reservations regarding the quality of the data both for training and downstream analyses purposes. The reviewer understand that some of the limitations are due to the limitations in the data currently available.
Specific comments below:
(1) The authors discuss the difference between DeePhage and PHACTs, emphasizing the focus of their tool to be based on metagenomic fragments. They also introduce discussions about and comparisons to the new tool PhagePred. The reviewer acknowledge the value of developing a classifier for phage lifestyle based on metagenomic data, but the main concern was that the dataset (e.g. phage genomes and lifestyle annotations) overlapped exactly with the McNair dataset. To this point, the authors curated another dataset (termed "Dataset-2") that included more phage RefSeq genomes from the NCBI database. The authors write that they "used all Dataset-2 and four fifths of Dataset-1 as the training set, and one fifth of Dataset-1 as the test set" - how was the input dataset partitioned? Were the test set selected randomly? Were they selected by genome or by short sequences? Notably, "Additional File 2" do not have genomes labelled by whether they were used for test or training class. If they used different partitions for cross-validations and the full dataset for final model, this should be stated explicitly. In fact, it would be helpful if the authors provide figures detailing basic statistics (e.g. labels, partitions, lineages, etc.) on their datasets. (2) Considering both the significant difference in size and nature of Dataset-1 and 2, and therefore the initial pipeline and the revised one, the authors should compare (at least internally) and clarify which dataset was used for which analyses. Notably: Have the entire pipeline, subsequent analyses (e.g. PCAs), and github implementations described in the main text been updated with combined dataset? How does the performance change between Dataset-1 only and the combined dataset? (3) The authors write "We have tested that the traditional k-mer frequency encoding form was not superior to the one-hot encoding form." - can they elaborate on this? (4) The authors find differential category distribution according to the COG database for proteins predicted for virulent or temperate phages - the authors should contextualize their findings with what is known in the literature.
Declaration of competing interests Please complete a declaration of competing interests, considering the following questions: Have you in the past five years received reimbursements, fees, funding, or salary from an organisation that may in any way gain or lose financially from the publication of this manuscript, either now or in the future? Do you hold any stocks or shares in an organisation that may in any way gain or lose financially from the publication of this manuscript, either now or in the future? Do you hold or are you currently applying for any patents relating to the content of the manuscript? Have you received reimbursements, fees, funding, or salary from an organization that holds or has applied for patents relating to the content of the manuscript? Do you have any other financial competing interests? Do you have any non-financial competing interests in relation to this paper? If you can answer no to all of the above, write 'I declare that I have no competing interests' below. If your reply is yes to any, please give details below. I declare that I have no competing interests.
I agree to the open peer review policy of the journal. I understand that my name will be included on my report to the authors and, if the manuscript is accepted for publication, my named report including any attachments I upload will be posted on the website along with the authors' responses. I agree for my report to be made available under an Open Access Creative Commons CC-BY license (http://creativecommons.org/licenses/by/4.0/). I understand that any comments which I do not wish to be included in my named report can be included as confidential comments to the editors, which will not be published. I agree to the open peer review policy of the journal.
Authors' response to reviews: To Reviewer 1: General Comments: The manuscript was much improved from the last one. The response to the reviewer's concerns has greatly strengthened the manuscript. I have a few additional (unimportant) recommendations as follows. We thank Reviewer 1 for his/her careful reading and efforts to improve our manuscript. Its glad to see that our previous responses were adequately addressed all the points raised by Reviewer 1. The positive comment that The manuscript was much improved from the last one. The response to the reviewer's concerns has greatly strengthened the manuscript is really encouraging. The remaining comments raised by Reviewer 1 would help a lot for improving the quality of the manuscript. Below, we itemize our revisions in response to Reviewer 1s points.
Minor comments: Minor comment 1: Table 2. Hits length -> Hit length (bp) We thank Reviewer 1 for this comment. We have changed Hits length to Hit length (bp) in Table 2 in the revised manuscript. (Please refer to Line 419, Page 15 in the revised manuscript.)
Minor comment 2: Figure S1. Check upper/lower cases in labels of x- and y-axes. "First Principal Component" -> "First principal component", and "Second Principal Component" -> "Second principal component" We thank Reviewer 1 for noting the typos in Figure S1. In Figure S1, we have corrected First Principal Component to First principal component and Second Principal Component to Second principal component. (Please refer to Figure S1 in the revised Additional File 1.)
Minor comment 3: Additional file 2. Taxonomic columns would be sorted in decreasing order (i.e., Kingdom, Phylum, Class, Order, Family, Genus, Species). We thank Reviewer 1 for reminding us to sort the taxonomic columns in decreasing order. In the revised Additional File 2, we have sorted all the phage and its host taxonomic columns in decreasing order. (Please refer to the revised Additional File 2.)
To Reviewer 2: General Comments: Overall, the authors have developed a deep learning based tool for classifying phage genomes fragments as a virulent or temperate, with an expanded dataset and several new analyses in this revision. The authors also perform additional analysis. First, there is no point-by-point response to previous round of reviewer's comments, which is essential. From reading of R1 itself it looks like the authors made efforts to address previous comments, however it's not elaborated in response. While the tool may be useful, there are still some reservations regarding the quality of the data both for training and downstream analyses purposes. The reviewer understand that some of the limitations are due to the limitations in the data currently available. We thank Reviewer 2 for the comments to the last round review response. We are pleased to receive the positive judgement as Overall, the authors have developed a deep learning based tool for classifying phage genomes fragments as a virulent or temperate, with an expanded dataset and several new analyses in this revision. Also, we appreciate that Reviewer 2 understood some of our limitations are owing to the data limitations. We believe that such limitations will be improved by the future development of high-quality data. In addition, we have tried our best to present our answers to Reviewer 2s all concerns point-by-point in the last revision, while Reviewer 2 mentioned that there is no point-by-point response to previous round of reviewer's comments, and Editor has noticed that there might be some misunderstandings and had emailed Reviewer 2 with our last response separately. Herein we thank both Editor and Reviewer 2 for their understanding our work seriously and responsibly. We then report our responses to Reviewers further comments as follows.
Specific comments1: The authors discuss the difference between DeePhage and PHACTs, emphasizing the focus of their tool to be based on metagenomic fragments. They also introduce discussions about and comparisons to the new tool PhagePred. The reviewer acknowledge the value of developing a classifier for phage lifestyle based on metagenomic data, but the main concern was that the dataset (e.g. phage genomes and lifestyle annotations) overlapped exactly with the McNair dataset. To this point, the authors curated another dataset (termed "Dataset-2") that included more phage RefSeq genomes from the NCBI database. The authors write that they "used all Dataset-2 and four fifths of Dataset-1 as the training set, and one fifth of Dataset-1 as the test set" - how was the input dataset partitioned? Were the test set selected randomly? Were they selected by genome or by short sequences? Notably, "Additional File 2" do not have genomes labelled by whether they were used for test or training class. If they used different partitions for cross-validations and the full dataset for final model, this should be stated explicitly. In fact, it would be helpful if the authors provide figures detailing basic statistics (e.g. labels, partitions, lineages, etc.) on their datasets. We are glad to hear that Reviewer 2 admit the value of our work, which is different from PHACTS as we have denoted in the last round review response. However, Reviewer 2 was confused about the usage of new dataset (Dataset-2), we realize that we have not clarified this important point. Therefore, we would like to describe our detailed data processing process point-by-point. (1) how was the input dataset partitioned?, Were the test set selected randomly? Were they selected by genome or by short sequences? All Dataset-2 and four fifths of Dataset-1 were used as training set and one-fifth of Dataset-1 were used as test set. In detail, the partition of Dataset-1 for training and test is randomly selected by genome. We did a five-fold cross validation as follows. Dataset-1 (225 phage genomes) was randomly separated into five parts. Each part was then used as test set in each cross validation and the rest were used as the training set. Statistically, there are nearly 1820 (180 from four-fifth Dataset-1 and 1640 from all Dataset-2) phage genomes for training and 45 (one-fifth Dataset-1) phage genomes for test in each rotation. Thus, each phage genome in Dataset-1 is used as the test set one time only. Then the MetaSim software was used to simulate 80000 and 20000 short fragments for training and test. To clarify this point, we revised the sentence Therefore, we used all phages of Dataset-2 and four-fifth phages of Dataset-1 to be the training set, and one-fifth phages of Dataset-1 to be the test set. to Therefore, based on genomes, we used all phages of Dataset-2 and four-fifth phages of Dataset-1 to be the training set, and one-fifth phages of Dataset-1 to be the test set. (Please refer to Line 146-148, Page 5 in the revised manuscript.). More information about the data processing is described in the manuscript as To test whether DeePhage can distinguish the lifestyle of novel phages or not, for each validation, we divided the training set and the test set based on complete genomes rather than artificial contigs, and then simulated 80,000 training sequences and 20,000 test ones using MetaSim [26]. (Please refer to Line 293-296, Page 11 in the revised manuscript.) (2) Additional File 2 do not have genomes labelled by whether they were used for test or training class. To be clearer and in accordance with the Reviewers concerns, we added the usage label (Training usage or Test usage) of each phage genomes in Additional File 2. Usage labels indicate in which cross validation the phage genome was used for training or test set. We added the sentence Moreover, a usage label (Training usage or Test usage) was used to indicate in which cross validation the phage genome was used for training or test set in Additional File 2. (Please refer to Line 149-151, Page 6 in the revised manuscript.) Also, we added the column of usage label for each phage genomes in Additional File 2. (Please refer to the revised Additional File 2.) (3) it would be helpful if the authors provide figures detailing basic statistics (e.g. labels, partitions, lineages, etc.) on their datasets. Labels and partitions information has been added into Additional File 2 according to items (1) and (2). As for the information of lineages, our previous manuscript has included the lineages information of each phage genomes in Additional File 2. Also, we have sorted all the phage and its host taxonomic columns in decreasing order following the suggestion of Reviewer 1. (Please refer to the revised Additional File 2.)
Specific comments2: Considering both the significant difference in size and nature of Dataset-1 and 2, and therefore the initial pipeline and the revised one, the authors should compare (at least internally) and clarify which dataset was used for which analyses. Notably: Have the entire pipeline, subsequent analyses (e.g. PCAs), and github implementations described in the main text been updated with combined dataset? How does the performance change between Dataset-1 only and the combined dataset? The PCA analysis did not update, since we still used phage genomes only from Dataset-1. To be clearer, we revised the sentence , among virulent phage genomes and temperate phage genomes to , among virulent phage genomes and temperate phage genomes from Dataset-1. (Please refer to Line 168-169, Page 6 in the revised manuscript.) Internally, DeePhages performance improved on five-fold cross validation set after using the combined dataset, compared with using only Dataset-1. In each length group, the Acc criterion increased 1~2%. It means dataset expanding is useful. For the remain analysis, the performance of DeePhage using the combined dataset slightly altered and have no effects on any conclusion. Thus, we updated all the results using combined dataset in the current manuscript, except for the PCA analysis. Those updated results include analysis on five-fold cross validation, all phage CDS sequences, metavirome from bovine rumen dataset, metagenome/metavirome from healthy/ulcerative-colitis-patients and so on. Those changes were also reported in the response to Reviewer 2s comment #1 in the last round review response.
Specific comments3: The authors write "We have tested that the traditional k-mer frequency encoding form was not superior to the one-hot encoding form." - can they elaborate on this? As shown in Figure S2 of the previous Additional File 1, we designed a Kmer-4 model, which used k-mer frequencies as an encoding representation, and the architecture of Kmer-4 model was similar to that of DeePhage except to the input layer. Also, we presented that the Kmer-4 model did a terrible prediction as shown in Table S1 of the previous Additional File 1. Thus, to be clearer, we revised the sentence We have tested that the traditional k-mer frequency encoding form was not superior to the one-hot encoding form. to We have tested that the traditional k-mer frequency encoding form was not superior to the one-hot encoding form, since Kmer-4 model did a terrible prediction as shown in Additional File 1 (Table S1). (Please refer to Line 536-537, Page 19 in the revised manuscript.)
Specific comments4: The authors find differential category distribution according to the COG database for proteins predicted for virulent or temperate phages - the authors should contextualize their findings with what is known in the literature. We thank Reviewer 2 for this suggestion. As we can see in Figure S4, the biggest difference between virulent and temperate protein sequences is located at Mobilome: prophages, transposons category. Temperate phages could exist as prophages and mediate horizontal gene transfer via transduction, while such genetic exchanges might be rare when involving virulent phages. Thus, such the difference could be comprehended and obviously learned by DeePhage. We added the sentence As we can see in Figure S4, the biggest difference between virulent and temperate protein sequences is located at Mobilome: prophages, transposons category. Temperate phages could exist as prophages and mediate horizontal gene transfer via transduction [51], while such genetic exchanges might be rare when involving virulent phages [52]. Thus, such difference could be comprehended and obviously learned by DeePhage. in the revised manuscript (Please refer to Line 584-588, Page 21 in the revised manuscript.)
Source
© 2021 the Reviewer (CC BY 4.0).
References
Shufang, W., Zhencheng, F., Jie, T., Mo, L., Chunhui, W., Qian, G., Congmin, X., Xiaoqing, J., Huaiqiu, Z. DeePhage: distinguishing virulent and temperate phage-derived sequences in metavirome data with a deep learning approach. GigaScience.
