Review of DeePhage: distinguishing virulent and temperate phage-derived sequences in metavirome data with a deep learning approach

Content of review 1, reviewed on January 25, 2021

In this manuscript, Wu et al. developed a novel bioinformatics tool, DeePhage, for classifying viral genomes into virulent and temperate phage types in metagenomic settings. The classification is based on a deep learning method and showed better performance (i.e., accuracy and calculation speed) than PHACTS, which was a similar tool but focuses on near-complete viral genome and useless for metagenomic sequencing reads. In addition, the DeePhage tool was applied to real metagenomic and metaviromic dataset resulted in several interesting findings

The manuscript is overall well written. The purpose of this study is also important because the viral lifecycle is amost always overlooked in genomic analysis of environmental microbes and viruses. While I'm not a specialist in machine learning, I think the developed method seems to be typical and reasonable. In addition to the manuscript, online manual and GitHub page were also well prepared that could facilitate easy usage for researchers including bioinformatics beginners. Although the further improvement of performance and evaluation of the analysis should be required in further study, mainly due to the insufficient teaching dataset currently available, the novel tool could be a trigger to shed light on the viral ecology in nature, and thus the study could have a potential impact on wide fields of microbiology and virology.

I have several suggestions, both major and minor, that if addressed would increase the clarity and impact of this manuscript.

Major comment:

In this study, a total of 225 annotated viral genomes were used for teaching data. One of my concerns is that the teaching data size is likely still insufficient for reliable classification. And also I concerned whether the dataset may compose unevenly-selected genomes, for example, were some specific viral lineages unevenly abundant in the dataset? The author should provide list of the used viral genomes (maybe as same as the McNair dataset) with several descriptions and statistics if possible (e.g., host linage, viral lineage, genome size, etc.) in this paper or GitHub site. This not only provides details of the analysis but also allow someone for further maintenance, improvement, and expansion of the teacher dataset. Because the kinds of database biases should affect downstream analysis and method evaluation as similar to the unevenness of positive/negative samples (as described in L136-143), it would be better to add more explanations of the potential database biases and limitations of the classification accuracy/sensitivity when using current teacher dataset in this manuscript.

L315-320. I think this cutoff is useful in many cases in real usage of actual data analysis because 'reliable' results are more valuable in various scientific studies rather than those mixed with uncertain classification. Could the author provide a recommended value of the cutoff and set this as a default option of the tool? Why the 'uncertain' label was not used in the downstream analysis in this study? Besides, I'm also interested in the distribution of the score when applying the tool to each dataset: artificial virome, rumen virome, human gut metagenome, and human gut virome. It may be expected bipolar distribution (i.e., ~1 or ~0) when considering bar plot of score vs. number of contigs, right? In contrast, I presumed the analysis of RefSeq bacterial genomes (L521-543) showed scores most of them were close to 0.5 because the bacterial reads will not appropriately be classified in each of the virulent/temperate types with higher reliability than actual viral contigs. Is this correct?

Minor comment:

Title. Because the main result of this study was the development of the tool, I think it would be better to add the term 'DeePhage' into the title to clarify this.

L41. add citation. ([1]?)

L44. What ' untargeted' mean? Is this extra term necessary in this manuscript?

L68-69. Some prophages are known as 'plasmid prophage' which are not integrated into host chromosome but independently state like plasmid in host cell. (e.g., https://www.sciencedirect.com/science/article/pii/S0042682217304245)

L162-165. 'they have different sequence signature' here, but it seems to be a bit hard to separate virulent/temperate phage genomes in Figure 1. PCA analysis frequently hard to visualize clear separations in categories, and this statement is likely misreading. Hence I recommend that figure 1 will move to supplemental information and remove later text of this sentence (', showing that ~ phage genomes' in L164-165).

L352-359. Was this performance measured using one CPU thread? This is a comment and not required for this manuscript, but I think an additional option of parallel calculation with multiple threads should be much convenient for most users because typical metagenomic/virome datasets contain numbers of sequencing reads that take the great computational cost for those analyses.

L408. The comparative analysis of human gut metagenome and virome dataset is complicated. Additional figure of the analysis scheme and/or summarized table of the results of these analyses should greatly improve readability.

L533. How to chose the 120 genomes from RefSeq? Random selection?

Figure 3. Why the y-axis was started from 40% instead of 0%? This could give a false impression to readers.

Declaration of competing interests Please complete a declaration of competing interests, considering the following questions: Have you in the past five years received reimbursements, fees, funding, or salary from an organisation that may in any way gain or lose financially from the publication of this manuscript, either now or in the future? Do you hold any stocks or shares in an organisation that may in any way gain or lose financially from the publication of this manuscript, either now or in the future? Do you hold or are you currently applying for any patents relating to the content of the manuscript? Have you received reimbursements, fees, funding, or salary from an organization that holds or has applied for patents relating to the content of the manuscript? Do you have any other financial competing interests? Do you have any non-financial competing interests in relation to this paper? If you can answer no to all of the above, write 'I declare that I have no competing interests' below. If your reply is yes to any, please give details below. I declare that I have no competing interests.

I agree to the open peer review policy of the journal. I understand that my name will be included on my report to the authors and, if the manuscript is accepted for publication, my named report including any attachments I upload will be posted on the website along with the authors' responses. I agree for my report to be made available under an Open Access Creative Commons CC-BY license (http://creativecommons.org/licenses/by/4.0/). I understand that any comments which I do not wish to be included in my named report can be included as confidential comments to the editors, which will not be published. I agree to the open peer review policy of the journal.

Authors' response to reviews:(https://drive.google.com/file/d/1J-yPA1K-IqclMnz16tU9L0HlZLs5NHVF/view?usp=sharing)

Source

Content of review 2, reviewed on May 21, 2021

The manuscript was much improved from the last one. The response to the reviewer's concerns has greatly strengthened the manuscript. I have a few additional (unimportant) recommendations as follows.

Minor: Table 2. Hits length -> Hit length (bp) Figure S1. Check upper/lower cases in labels of x- and y-axes. "First Principal Component" -> "First principal component", and "Second Principal Component" -> "Second principal component" Additional file 2. Taxonomic columns would be sorted in decreasing order (i.e., Kingdom, Phylum, Class, Order, Family, Genus, Species).

Authors' response to reviews: To Reviewer 1: General Comments: The manuscript was much improved from the last one. The response to the reviewer's concerns has greatly strengthened the manuscript. I have a few additional (unimportant) recommendations as follows. We thank Reviewer 1 for his/her careful reading and efforts to improve our manuscript. Its glad to see that our previous responses were adequately addressed all the points raised by Reviewer 1. The positive comment that The manuscript was much improved from the last one. The response to the reviewer's concerns has greatly strengthened the manuscript is really encouraging. The remaining comments raised by Reviewer 1 would help a lot for improving the quality of the manuscript. Below, we itemize our revisions in response to Reviewer 1s points.

Minor comments: Minor comment 1: Table 2. Hits length -> Hit length (bp) We thank Reviewer 1 for this comment. We have changed Hits length to Hit length (bp) in Table 2 in the revised manuscript. (Please refer to Line 419, Page 15 in the revised manuscript.)

Minor comment 2: Figure S1. Check upper/lower cases in labels of x- and y-axes. "First Principal Component" -> "First principal component", and "Second Principal Component" -> "Second principal component" We thank Reviewer 1 for noting the typos in Figure S1. In Figure S1, we have corrected First Principal Component to First principal component and Second Principal Component to Second principal component. (Please refer to Figure S1 in the revised Additional File 1.)

Minor comment 3: Additional file 2. Taxonomic columns would be sorted in decreasing order (i.e., Kingdom, Phylum, Class, Order, Family, Genus, Species). We thank Reviewer 1 for reminding us to sort the taxonomic columns in decreasing order. In the revised Additional File 2, we have sorted all the phage and its host taxonomic columns in decreasing order. (Please refer to the revised Additional File 2.)

To Reviewer 2: General Comments: Overall, the authors have developed a deep learning based tool for classifying phage genomes fragments as a virulent or temperate, with an expanded dataset and several new analyses in this revision. The authors also perform additional analysis. First, there is no point-by-point response to previous round of reviewer's comments, which is essential. From reading of R1 itself it looks like the authors made efforts to address previous comments, however it's not elaborated in response. While the tool may be useful, there are still some reservations regarding the quality of the data both for training and downstream analyses purposes. The reviewer understand that some of the limitations are due to the limitations in the data currently available. We thank Reviewer 2 for the comments to the last round review response. We are pleased to receive the positive judgement as Overall, the authors have developed a deep learning based tool for classifying phage genomes fragments as a virulent or temperate, with an expanded dataset and several new analyses in this revision. Also, we appreciate that Reviewer 2 understood some of our limitations are owing to the data limitations. We believe that such limitations will be improved by the future development of high-quality data. In addition, we have tried our best to present our answers to Reviewer 2s all concerns point-by-point in the last revision, while Reviewer 2 mentioned that there is no point-by-point response to previous round of reviewer's comments, and Editor has noticed that there might be some misunderstandings and had emailed Reviewer 2 with our last response separately. Herein we thank both Editor and Reviewer 2 for their understanding our work seriously and responsibly. We then report our responses to Reviewers further comments as follows.

Specific comments1: The authors discuss the difference between DeePhage and PHACTs, emphasizing the focus of their tool to be based on metagenomic fragments. They also introduce discussions about and comparisons to the new tool PhagePred. The reviewer acknowledge the value of developing a classifier for phage lifestyle based on metagenomic data, but the main concern was that the dataset (e.g. phage genomes and lifestyle annotations) overlapped exactly with the McNair dataset. To this point, the authors curated another dataset (termed "Dataset-2") that included more phage RefSeq genomes from the NCBI database. The authors write that they "used all Dataset-2 and four fifths of Dataset-1 as the training set, and one fifth of Dataset-1 as the test set" - how was the input dataset partitioned? Were the test set selected randomly? Were they selected by genome or by short sequences? Notably, "Additional File 2" do not have genomes labelled by whether they were used for test or training class. If they used different partitions for cross-validations and the full dataset for final model, this should be stated explicitly. In fact, it would be helpful if the authors provide figures detailing basic statistics (e.g. labels, partitions, lineages, etc.) on their datasets. We are glad to hear that Reviewer 2 admit the value of our work, which is different from PHACTS as we have denoted in the last round review response. However, Reviewer 2 was confused about the usage of new dataset (Dataset-2), we realize that we have not clarified this important point. Therefore, we would like to describe our detailed data processing process point-by-point. (1) how was the input dataset partitioned?, Were the test set selected randomly? Were they selected by genome or by short sequences? All Dataset-2 and four fifths of Dataset-1 were used as training set and one-fifth of Dataset-1 were used as test set. In detail, the partition of Dataset-1 for training and test is randomly selected by genome. We did a five-fold cross validation as follows. Dataset-1 (225 phage genomes) was randomly separated into five parts. Each part was then used as test set in each cross validation and the rest were used as the training set. Statistically, there are nearly 1820 (180 from four-fifth Dataset-1 and 1640 from all Dataset-2) phage genomes for training and 45 (one-fifth Dataset-1) phage genomes for test in each rotation. Thus, each phage genome in Dataset-1 is used as the test set one time only. Then the MetaSim software was used to simulate 80000 and 20000 short fragments for training and test. To clarify this point, we revised the sentence Therefore, we used all phages of Dataset-2 and four-fifth phages of Dataset-1 to be the training set, and one-fifth phages of Dataset-1 to be the test set. to Therefore, based on genomes, we used all phages of Dataset-2 and four-fifth phages of Dataset-1 to be the training set, and one-fifth phages of Dataset-1 to be the test set. (Please refer to Line 146-148, Page 5 in the revised manuscript.). More information about the data processing is described in the manuscript as To test whether DeePhage can distinguish the lifestyle of novel phages or not, for each validation, we divided the training set and the test set based on complete genomes rather than artificial contigs, and then simulated 80,000 training sequences and 20,000 test ones using MetaSim [26]. (Please refer to Line 293-296, Page 11 in the revised manuscript.) (2) Additional File 2 do not have genomes labelled by whether they were used for test or training class. To be clearer and in accordance with the Reviewers concerns, we added the usage label (Training usage or Test usage) of each phage genomes in Additional File 2. Usage labels indicate in which cross validation the phage genome was used for training or test set. We added the sentence Moreover, a usage label (Training usage or Test usage) was used to indicate in which cross validation the phage genome was used for training or test set in Additional File 2. (Please refer to Line 149-151, Page 6 in the revised manuscript.) Also, we added the column of usage label for each phage genomes in Additional File 2. (Please refer to the revised Additional File 2.) (3) it would be helpful if the authors provide figures detailing basic statistics (e.g. labels, partitions, lineages, etc.) on their datasets. Labels and partitions information has been added into Additional File 2 according to items (1) and (2). As for the information of lineages, our previous manuscript has included the lineages information of each phage genomes in Additional File 2. Also, we have sorted all the phage and its host taxonomic columns in decreasing order following the suggestion of Reviewer 1. (Please refer to the revised Additional File 2.)

Specific comments2: Considering both the significant difference in size and nature of Dataset-1 and 2, and therefore the initial pipeline and the revised one, the authors should compare (at least internally) and clarify which dataset was used for which analyses. Notably: Have the entire pipeline, subsequent analyses (e.g. PCAs), and github implementations described in the main text been updated with combined dataset? How does the performance change between Dataset-1 only and the combined dataset? The PCA analysis did not update, since we still used phage genomes only from Dataset-1. To be clearer, we revised the sentence , among virulent phage genomes and temperate phage genomes to , among virulent phage genomes and temperate phage genomes from Dataset-1. (Please refer to Line 168-169, Page 6 in the revised manuscript.) Internally, DeePhages performance improved on five-fold cross validation set after using the combined dataset, compared with using only Dataset-1. In each length group, the Acc criterion increased 1~2%. It means dataset expanding is useful. For the remain analysis, the performance of DeePhage using the combined dataset slightly altered and have no effects on any conclusion. Thus, we updated all the results using combined dataset in the current manuscript, except for the PCA analysis. Those updated results include analysis on five-fold cross validation, all phage CDS sequences, metavirome from bovine rumen dataset, metagenome/metavirome from healthy/ulcerative-colitis-patients and so on. Those changes were also reported in the response to Reviewer 2s comment #1 in the last round review response.

Specific comments3: The authors write "We have tested that the traditional k-mer frequency encoding form was not superior to the one-hot encoding form." - can they elaborate on this? As shown in Figure S2 of the previous Additional File 1, we designed a Kmer-4 model, which used k-mer frequencies as an encoding representation, and the architecture of Kmer-4 model was similar to that of DeePhage except to the input layer. Also, we presented that the Kmer-4 model did a terrible prediction as shown in Table S1 of the previous Additional File 1. Thus, to be clearer, we revised the sentence We have tested that the traditional k-mer frequency encoding form was not superior to the one-hot encoding form. to We have tested that the traditional k-mer frequency encoding form was not superior to the one-hot encoding form, since Kmer-4 model did a terrible prediction as shown in Additional File 1 (Table S1). (Please refer to Line 536-537, Page 19 in the revised manuscript.)

Specific comments4: The authors find differential category distribution according to the COG database for proteins predicted for virulent or temperate phages - the authors should contextualize their findings with what is known in the literature. We thank Reviewer 2 for this suggestion. As we can see in Figure S4, the biggest difference between virulent and temperate protein sequences is located at Mobilome: prophages, transposons category. Temperate phages could exist as prophages and mediate horizontal gene transfer via transduction, while such genetic exchanges might be rare when involving virulent phages. Thus, such the difference could be comprehended and obviously learned by DeePhage. We added the sentence As we can see in Figure S4, the biggest difference between virulent and temperate protein sequences is located at Mobilome: prophages, transposons category. Temperate phages could exist as prophages and mediate horizontal gene transfer via transduction [51], while such genetic exchanges might be rare when involving virulent phages [52]. Thus, such difference could be comprehended and obviously learned by DeePhage. in the revised manuscript (Please refer to Line 584-588, Page 21 in the revised manuscript.)

Source

References

Shufang, W., Zhencheng, F., Jie, T., Mo, L., Chunhui, W., Qian, G., Congmin, X., Xiaoqing, J., Huaiqiu, Z. DeePhage: distinguishing virulent and temperate phage-derived sequences in metavirome data with a deep learning approach. GigaScience.

Pre-publication Review of

DeePhage: distinguishing virulent and temperate phage-derived sequences in metavirome data with a deep learning approach

Reviewed On January 25, 2021 , and May 21, 2021

Submitted to

Reviewed by

Actions

Content of review 1, reviewed on January 25, 2021

Source

Content of review 2, reviewed on May 21, 2021

Source

References