Content of review 1, reviewed on December 13, 2018
The article by Fang et al describes a novel tool aimed at simultaneous identification of viral and plasmid sequences in metagenomic datasets. Presented tool can be of great value to many researchers analyzing microbiome data, especially for those interested in viral communities and/or mobile elements involved in the AMR spread and may as such also be widely cited. Authors used the novel neural network approach, which improves sequence classification for shorter sequences over other existing tools. Especially, usage of both noncoding and coding information is an important improvement over other existing tools, what authors clearly show in their manuscript.
I find the manuscript well written and informative, the methods are appropriate and analysis is well conducted. However, I do have several not too difficult to accomplish, requests, as well as comments and questions, before I can recommend the manuscript for publication.
Authors do not state what is the exact target of their software. Should it be used mainly with the metagenomic assemblies or can it be also easily applied to raw sequencing reads (because one of the models used in the work was trained on relatively short sequences (100-400 bp))? How will it work with isolate genome sequences?
Although the training procedure and the algorithm itself are quite well described I missed some details regarding data preprocessing and preparation of models. What parameters were used in MetaSim to generate sequences used for training? Authors should note that MetaSim does not produce artificial contigs but synthetic sequencing reads, with technology-specific errors introduced. Did authors use 'exact' preset to return fragments perfectly matching reference sequences? It would be also great if all the codes and scripts (e.g. those for preprocessing of sequences, and neural network training) are available online or as Supplemental materials.
Have authors tried to build model similar to BiPathCNN for longer sequences? As authors claim that codon path is beneficial for distinguishing plasmids, phages and chromosomes (p. 16, lines 7-10), this additional information should increase kmer-based approach, especially that longer fragments are more likely to contain coding sequences. I also wonder why authors chose the hexamer frequencies and not any odd-number kmer?
I find really interesting part in which authors used likelihood scores generated by PPR-Meta to predict phage lifestyle or plasmid transmissability. Results shown are really encouraging and may make PPR-Meta an important tool in MGE studies. In the discussion of this phenomenon authors should cite the paper of Suzuki et al. (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2976448/), which discusses usage of genome signatures to predict the evolutionary host range of plasmids. Additionally, similarity between host and plasmid nucleotide composition is a known phenomenon, and so called genome amelioration is a term properly describing convergence of plasmid and host sequence patterns. Therefore, the statement "Since temperate phages and non-transmissible plasmids experience longer residence times within the host cell, they may adjust the sequence pattern toward the host" seems correct.
For me it is really worth emphasizing that authors tested their software in the context of 3rd generation sequencing technologies like PacBio and Nanopore. As there is more and more reports showing applicability of single molecule sequencing technologies for describing microbial communities it would be really great to have tool which which is tested and works well on such datasets. Authors used 1% error rates for their simulated datasets, however, in my opinion they should test also higher error rates, as for aforementioned technologies they can reach up to 10%, error rate, especially for indels. Both technologies are sensitive to homopolymer sequences therefore one would expect more errors in such regions. Additionally I would also like to see if PPR-Meta is able to predict phage/plasmid sequences on real 3rd generation sequencing data. They can use for example recently published virome (https://www.biorxiv.org/content/early/2018/11/12/345041.full.pdf+html) or mock microbial community (https://www.biorxiv.org/content/early/2018/12/04/487033.full.pdf+html) nanopore data. The former one can also help with "Since we lack samples in which only chromosomes are enriched and all the extrachromosomal elements are filtered, estimating whether related tools will misjudge chromosomes as MGEs directly is difficult using real data" (p. 21, l. 9-11)
It should be also noted that in case of plasmid prediction PPR-Meta presents different approach than previously published cBar and PlasFlow only for sequences shorter than 10kb, as neural network trained on Group D essentially use the 6-mer frequencies for prediction. Additionally in the PlasFlow manual it is said that such approach does not work well on short sequences (recommended length is > 1kb), therefore comparisons using test datasets from GroupA and GroupB should be done with having this in mind. In my opinion authors should also comment on different lengths of sequences used for training in cBar, PlasFlow and PPR-Meta: cBar was trained with whole chromosome/plasmid sequences whereas PlasFlow on 10kb fragments and PPR-Meta using 4 datasets with differing lengths, what may significantly influence their performance.
Specific comments:
Page 5 lines 2-4: "extracted" should be "extract"
Page 5 line 21: Authors may also cite following tool: https://www.ncbi.nlm.nih.gov/pubmed/30383524,
Page 6 line 6 "this tool applies SMO" should be changed to: "this tool applies SOM" (Self Organizing Map)
page 8 lines 13-19 "phage metagenomic data of bovine rumen [19], which were downloaded from MG-RAST [29] (Accessions: mgm4534202.3 and mgm4534203.3) as raw reads and assembled by SPAdes" and "20 samples of healthy human gut [32], downloaded from the NCBI Short Read Archive [33] and assembled by SPAdes."
I miss details on SPAdes assembly. What settings were used and what was the quality of assembly (N50, number of contigs and so on).
page 12 lines 5-11
Which approach is used for sequences 5-10 kb? FNN or biPath-CNN (groupC model)? If biPath-CNN - what is its performance on this dataset compared to FNN (group D)? It is not clear which approach the software will use for the real datasets of this length.
I would also like to see the comparison to other software regarding fragments longer than 10kb (which are easily achievable with current metagenomic sequencing techniques).
Additionally, please test accuracy of each model on testing datasets from other models, e.g. model for group A on test datasets for groups B, C and D, model for group B on test datasets for groups A, C and D, etc. It is not likely that in real datasets sequences will be distributed such uniformly. It is also interesting how biPath-CNN performs on long sequences, as coding information should significantly increase its performance (like it was shown for shorter fragments). And, maybe, any of the single models is good enough to be used on fragments of all lengths; for me this possibility can not be excluded by looking at presented data.
Page 15 line 12
I would remove the word "obviously". Please be less advertising and more informative.
page 16 lines 10-13
"Compared with other sequence representation methods that ignore the coding or non-coding region, such as method based on k-mer frequencies, PPR-Meta uses a more detailed method of describing a sequence and achieves a higher performance."
Authors should explicitly note that it relates only to sequences shorter than 5(or 10, see my note above) kb.
Page 22, lines 16-17 "PPR-Meta is designed with the option to adjust the default threshold of discriminant criteria"
It should be described more precisely. Although in the Manual it is noted, that "In this way, sequences with the phage (or plasmid) score higher than the other two categories and the threshold will be regarded as phage (or plasmid)", it is not mentioned that sequences not exceeding threshold for phage or plasmid category will fall into the "chromosome" category, what may not be the best option, increasing False Negative Rate. I also lack more information on the accuracy of PPR-Meta run with different thresholds (Table S1). Can you include also AUC in the table? And compare with PlasFlow, using the same thresholds?
This is only the suggestion, but all the data presenting performance of PPR-Meta in comparison to other tools can be also presented as graphs, what would allow for easy assessment of differences.
Specific comments regarding software usaibility:
Using output file extension other than .csv throws an error:
Error using writetable (line 124) Unrecognized file extension '.tsv'. Use the 'FileType' parameter to specify the file type
Error in PPR-Meta(line 124)
MATLAB:table:write:UnrecognizedFileExtension
This should be better documented, and the user should be warned at the beginning of computation that using custom file extensions will cause that output file cannot be written.
Again, I find your work really interesting and solid and I hope I managed to help you improve the manuscript in some way. Good luck with other review! Declaration of competing interests Please complete a declaration of competing interests, considering the following questions: Have you in the past five years received reimbursements, fees, funding, or salary from an organisation that may in any way gain or lose financially from the publication of this manuscript, either now or in the future? Do you hold any stocks or shares in an organisation that may in any way gain or lose financially from the publication of this manuscript, either now or in the future? Do you hold or are you currently applying for any patents relating to the content of the manuscript? Have you received reimbursements, fees, funding, or salary from an organization that holds or has applied for patents relating to the content of the manuscript? Do you have any other financial competing interests? Do you have any non-financial competing interests in relation to this paper? If you can answer no to all of the above, write 'I declare that I have no competing interests' below. If your reply is yes to any, please give details below. I'm the author of PlasFlow, one of the tools compared to PPR-Meta in the manuscript. I declare that I have no other competing interests.
I agree to the open peer review policy of the journal. I understand that my name will be included on my report to the authors and, if the manuscript is accepted for publication, my named report including any attachments I upload will be posted on the website along with the authors' responses. I agree for my report to be made available under an Open Access Creative Commons CC-BY license (http://creativecommons.org/licenses/by/4.0/). I understand that any comments which I do not wish to be included in my named report can be included as confidential comments to the editors, which will not be published.
I agree to the open peer review policy of the journal.
Authors' responses to reviews:
To Reviewer #1:
General Comments: The article by Fang et al describes a novel tool aimed at simultaneous identification of viral and plasmid sequences in metagenomic datasets. Presented tool can be of great value to many researchers analyzing microbiome data, especially for those interested in viral communities and/or mobile elements involved in the AMR spread and may as such also be widely cited. Authors used the novel neural network approach, which improves sequence classification for shorter sequences over other existing tools. Especially, usage of both noncoding and coding information is an important improvement over other existing tools, what authors clearly show in their manuscript. Herein we are glad to see Reviewer 1’s positive comments on the present work as “… be of great value to many researchers analyzing microbiome data, especially for those interested in viral communities and/or mobile elements involved in the AMR spread and may as such also be widely cited.” We especially thank Reviewer 1 for his/her careful reading of our manuscript. The suggestions and comments raised by Reviewer 1 were certainly very helpful for us to improve the manuscript. Below, we itemize the revisions in response to Reviewer 1’s points.
Authors do not state what is the exact target of their software. Should it be used mainly with the metagenomic assemblies or can it be also easily applied to raw sequencing reads (because one of the models used in the work was trained on relatively short sequences (100-400 bp))? How will it work with isolate genome sequences? We first thank Reviewer 1 for reminding us with a clear statement about the exact target of PPR-Meta. In the current work, PPR-Meta is primarily designed for metagenomic assemblies generated by next-generation sequencing technology. However, there are many sequences that are assembled poorly or even remain unassembled with short length in most metagenomic assemblies, especially the sequences from low abundance species or in condition of low coverage sequencing. This is the reason why we also trained and tested PPR-Meta using short sequences from 100 to 400 bp. So PPR-Meta can also be easily applied to raw sequencing reads. The only requirement for users is to dispose of all sequences in a “fasta” format file. To make the target clearer for readers, we revised the sentence in Section “Abstract” as follows: “We present PPR-Meta, a three-class classifier that allows simultaneous identification of both phage and plasmid fragments from metagenomic assemblies.” (Please refer to Line 8-10, Page 2 in the revised manuscript.) Of course, PPR-Meta can also work with isolated genome sequences. We collected complete genomes of the Acidianus tailed spindle virus (NC_029316.1), Ichthyobacterium seriolicida (AP014564.1) and Methylobacterium populi plasmid pMPPM01 (AP014810.1) from the test set (shown in the first line in the corresponding sheet in Additional File 1), and we found that PPR-Meta could correctly identify them as a phage, chromosome and plasmid, respectively. However, identifying complete genomes is not the main purpose of PPR-Meta because this is not a difficult issue and has been well addressed by other tools. In addition, we have added a new section to provide an example application to show how PPR-Meta can be used to analyse metagenomic data. We employed PPR-Meta to identify phage and plasmid sequences on a series of metagenomic datasets of the human digestive tract, including the gut, throat and oral cavity, from the Human Microbiome Project (HMP). The finding is interesting and may be significant to the study of human health. We found that in the position closer to the outer end of the digestive tract, the percentages of phages and plasmids tended to be higher. For example, in the gut, the inner end of the digestive tract, the percentages of phages and plasmids were lower; in the oral cavity, the outer end of the digestive tract, the percentages of phages and plasmids were higher. Please refer to the new section “Phages and plasmids in the human digestive tract” for more details (Please refer to Line 18, Page 24 - Line 21, Page 25 in the revised manuscript.).
Although the training procedure and the algorithm itself are quite well described I missed some details regarding data preprocessing and preparation of models. What parameters were used in MetaSim to generate sequences used for training? Authors should note that MetaSim does not produce artificial contigs but synthetic sequencing reads, with technology-specific errors introduced. Did authors use 'exact' preset to return fragments perfectly matching reference sequences? It would be also great if all the codes and scripts (e.g. those for preprocessing of sequences, and neural network training) are available online or as Supplemental materials. Herein we realized that we did not provide a clear description of the data pre-processing and the preparation of the models. Actually, we used MetaSim to generate synthetic sequencing reads without technology-specific errors, that means we just used MetaSim to extract DNA sequences with different lengths that perfectly match reference sequences or that be modified with base substitutions or indels that are evenly distributed over the sequences. To make this process clearer, we have added the description to Section “Methods” as: “To generate artificial contigs with no error for both training and test set, we used the “exact” preset to return fragments exactly matching reference sequences. In each group, the “DNA Clone Size Distribution Type” was set to “Uniform”. To generate artificial contigs modified with sequencing errors, we used the “Sanger” preset, which allowed users to modify sequences according to their settings. Note that as we were not going to generate sequences with technology-specific errors, the following settings do not reflect the real situation of the Sanger technology. For the generation of sequences with 1% base substitutions, the “Read Length Distribution Type” was set to “Uniform”, and the “Mate Pair Probability” was set to 0; both the “Error Rate at Read Star” and the “Error Rate at End of Read” were set to 0.01; and both the “Insertion Error Rate” and “Deletion Error Rate” were set to 0. For the generation of sequences with 1% base insertions or deletions, most settings were the same as mentioned above, except that both the “Insertion Error Rate” and “Deletion Error Rate” were set to 0.5.” (Please refer to Line 5-19, Page 33 in the revised manuscript.) Also, we have added the following sentences in Subsection “Structure of deep learning neural networks” to describe how we prepared the BiPathCNN and selected hyperparameters: “The selection of the related hyperparameters of each path mentioned above was referred to LeNet-5 [37] and VGG [38], two classic Convolutional Neural Networks in the field of artificial intelligence. Specifically, the distribution of layers was referred to LeNet-5, which contained three convolution layers, and there was a pooling layer between every two convolution layers. Meanwhile, the distribution of the number of convolution kernels was referred to VGG, in which the number of convolution kernels in the different layers was increased by doubling. We also referred to VGG to use ReLU as the activation function.” (Please refer to Line 5-12, Page 12 in the revised manuscript). The citations of LeNet-5 and VGG have also been added to the list of the References. In addition, we have uploaded the Keras scripts for the construction of the neural network as well as other scripts for data preprocessing to the GigaScience Database and our website (Link: http://cqb.pku.edu.cn/ZhuLab/PPR_Meta/data/). We wish that this is the best way to provide access to readers who want to reproduce or improve PPR-Meta.
Have authors tried to build model similar to BiPathCNN for longer sequences? As authors claim that codon path is beneficial for distinguishing plasmids, phages and chromosomes (p. 16, lines 7-10), this additional information should increase kmer-based approach, especially that longer fragments are more likely to contain coding sequences. I also wonder why authors chose the hexamer frequencies and not any odd-number kmer? We thank Reviewer 1 for this suggestion of using BiPathCNN for longer sequences. Firstly, we would like to describe the improvements that we made in the revised version of the PPR-Meta tool, which may also be closely related to the other comments. In the original version of PPR-Meta, we built four neural network models for sequences of different lengths. Among these neural networks, we used BiPathCNN, which contains a base path and a codon path, for model A, B and C, and we used a Fully Connected Neural Network (FNN), which takes k-mer frequencies as inputs, for model D. In the revised version of the PPR-Meta tool, we removed model D and kept model A, B and C. In practical applications, PPR-Meta uses model A to predict sequences between 100 and 400 bp, model B to predict sequences between 400 and 800 bp, and model C to predict sequences between 800 and 1200 bp. For sequences longer than 1200 bp, a scan window will move across the sequence without overlapping, and the weighted average of all windows’ predictions is calculated. The length of the window is set to 1200 bp (or less if the window is beyond the sequence boundary). For example, given a sequence of length 2500 bp, the scan window will first cover the bases from the 1st to 1200th positions, then the window will move to the bases from the 1201st to 2400th positions, and finally, the window will move to the bases from the 2401st to 2500th positions. Then, PPR-Meta uses model C, model C and model A to predict the subsequences under the first, second and third windows, respectively. To generate the final score for the whole sequence, PPR-Meta calculates the weighted average of these windows. The weights of these three windows are 1200/2500, 1200/2500 and 100/2500, respectively. We made this change because we found that the revised version of PPR-Meta could achieve a higher performance on long sequences. For example, for sequences with a length of 30k bp, the AUCs of both the phage identification and plasmid identification demonstrate higher performance. In particular, the TPR of phages increases from 93.76% to 99.84%, and almost all phages were identified. Although most sequences in the current metagenomic data are short fragments, a few reads from high-abundance species can be assembled into long contigs containing tens of thousands of bases, and we think that the revised PPR-Meta can be better adapted to these species. Additionally, considering that the third-generation sequencing technology is becoming more and more widely used, we hope that PPR-Meta can also promote studies using long sequencing technology, even though PPR-Meta is designed primarily for the next-generation sequencing technology. In the revised manuscript, we have added comparisons with related tools using sequences longer than 10k bp and sequences from real third-generation sequencing technology, which we will mention in the responses to the below comments. We then address the questions in this comment. We tried to train BiPathCNN for longer sequences, but we failed to do this because it was very time consuming and had high hardware requirements. Thus, using a scan window to move across a long sequence may be a good alternative. In terms of the reason we used the 6-mer in the original version of PPR-Meta, it seems that the choice of k is not so significant. We tried different k values around 6, such as 5 and 7, and found that the results were comparable. In our opinion, the design of the neural network structure may be more important than the choice of the k value. Because of the improvements we made in the revised PPR-Meta tool, as we mentioned at the beginning of this response, some of the results in the manuscript have also been updated. Herein, we would like to describe the updated content in our manuscript that is the result of these changes. None of the updated results mentioned below affect any conclusions that we have made in this manuscript. The revised version of the PPR-Meta tool has slight differences only on long sequences, while most of the test data we used in the manuscript are shorter than 5k bp, which is dominant in the current metagenomic sequences, and the revised PPR-Meta generates the same results for sequences shorter than 5k bp. Thus, the magnitude of all of the changes is small, except that the program has a longer running time for sequences longer than 5k bp, as shown in item (15) below. The changes in the manuscript include the following: (1). The original Figure 2, which describes the structure of the FNN, was removed. The second paragraph form the last in Subsection “Structure of deep learning neural networks”, which describes the FNN, was also removed. (2). In Subsection “Mathematical model of DNA sequences”, the sentence “Here, we use a more detailed approach to represent the short sequences in Group A, Group B and Group C.” has been revised to “Here, we use a more detailed approach to represent the DNA fragments.”(Please refer to Line 7-8, Page 9 in the revised manuscript.) Also, the last paragraph of this section, which described using k-mer to represent DNA fragments in Group D, was removed. (3). In Subsection “Structure of deep learning neural networks”, the sentences “…we trained corresponding neural networks for each group. For Group A, B and C, we designed BiPathCNN to improve the performance (Figure 1).” has been revised to: “…we trained three neural networks for Group A, B and C. To improve the performance, we designed BiPathCNN (Figure 1), a novel neural network structure, to make reliable predictions.”(Please refer to Line 15-17, Page 10 in the revised manuscript.) (4). In Figure 2 in the revised manuscript (as Figure 3 in the original manuscript), the confusion matrix of Group D was updated. Also, in Subsection “Overall performance”, the phrase “shown in Figure 3” has been revised to “shown in Figure 2”. (Please refer to Line 15, Page 13 in the revised manuscript.) (5). In Figure 4, the ROCs of Group D, which described the potential of using life_score and trans_score to classify the phage lifestyle and plasmid transmissibility, were updated. Also, the legend of Figure 4 has been revised to: “(a) Classify virulent phages and temperate phages using life_score. In order of sequence length, the AUC is 0.63, 0.69, 0.71 and 0.76. (b) Classify transmissible plasmid and non-transmissible plasmid using trans_score. In order of sequence length, the AUC is 0.58, 0.55, 0.60 and 0.62. (6). In Section “Methods”, Line 21-22, Page 33, the sentence “Considering the memory size, running time and accuracy, a total of 3,060,000 artificial contigs were generated to train PPR-Meta.” has been revised to “Considering the memory size, running time and accuracy, a total of 2,700,000 artificial contigs were generated to train PPR-Meta.” Also, the phrase “and 120,000 from Group D” has been removed from the sentence “The number of training contigs of each phage, chromosome and plasmid is 300,000 from Group A to C and 120,000 from Group D.” (Please refer to Line 22, Page 33-Line 2, Page 34 in the revised manuscript.) (7). In Table 1, Table 3 and Table 4, the performance of PPR-Meta on Group D (the fifth row from the last of each table) was updated. (8). In Table 5, the prophage recognition rate of PPR-Meta on Group D (the third row from the last) was updated. (9). In Subsection “Performance comparison”, the sentence “The TPR of PPR-Meta was approximately 3%~13% higher than that of VirFinder and the FPR was approximately 6%~9% lower” has been revised to “The TPR of PPR-Meta was approximately 10% higher than that of VirFinder, and the FPR was approximately 5~10% lower.” (Refer to Line 13-15, Page 15 in the revised manuscript.) (10). In Subsection “Performance comparison”, the sentences “For PPR-Meta, our FPR was much lower than that of cBar and PlasFlow. Although PPR-Meta achieved a slightly lower TPR than PlasFlow in Group A, our TPR remained highest in all other cases” have been revised to “For PPR-Meta, the TPR was comparable with that of PlasFlow, while the FPR was approximately 25~40% lower.” (Please refer to Line 1-2, Page 16 in the revised manuscript.) (11). In Subsection “Evaluation in real metagenomic data”, the sentence “VirFinder and PPR-Meta were much better than VirSorter and identified 68.86% and 76.88% of the contigs, respectively, showing that PPR-Meta had the highest coverage of this data set” has been revised to “VirFinder and PPR-Meta were much better than VirSorter and identified 68.86% and 76.90% of the contigs, respectively, showing that PPR-Meta had the highest coverage of this data set.” (Refer to Line 21, Page 21 in the revised manuscript.) (12). In Subsection “Evaluation in real metagenomic data”, the sentences “For PPR-Meta, total of 82.00% of the sequences were identified as MGEs, in which 49.16% were phages and 32.84% were plasmids. More than half of the sequences (64.72%) predicted as phages by PPR-Meta were also predicted as phages by VirFinder, and most of the sequences (74.72%) predicted as plasmids by PPR-Meta were also predicted by PlasFlow” have been revised to “For PPR-Meta, total of 81.96% of the sequences were identified as MGEs, in which 49.18% were phages and 32.78% were plasmids. More than half of the sequences (64.73%) predicted as phages by PPR-Meta were also predicted as phages by VirFinder, and most of the sequences (74.74%) predicted as plasmids by PPR-Meta were also predicted by PlasFlow.” (Please refer to Line 16-20, Page 22 in the revised manuscript.) (13). In Subsection “Evaluation in real metagenomic data”, the sentence “In terms of phage identification, PPR-Meta, VirFinder and VirSorter predicted an average of 4.18%, 11.03% and 0% of the 16S-like contigs as phages, respectively, indicating that PPR-Meta likely generated fewer false positive predictions than VirFinder” has been revised to “In terms of phage identification, PPR-Meta, VirFinder and VirSorter predicted an average of 3.43%, 11.32% and 0% of the 16S-like contigs as phages, respectively, indicating that PPR-Meta likely generated fewer false positive predictions than VirFinder.” (Please refer to Line 17, Page 23 in the revised manuscript.) (14). In Subsection “Evaluation in real metagenomic data”, the sentence “In terms of plasmid identification, PPR-Meta, PlasFlow and cBar predicted an average of 25.46%, 53.74% and 63.83% of the 16S-like contigs as plasmids, respectively, indicating that the PPR-Meta may generate the lowest number of false positive predictions” has been revised to “In terms of plasmid identification, PPR-Meta, PlasFlow and cBar predicted an average of 26.69%, 52.57% and 63.36% of the 16S-like contigs as plasmids, respectively, indicating that the PPR-Meta may generate the lowest number of false positive predictions.” (Refer to Line 22, Page 23 in the revised manuscript.) (15). In Subsection “Usage of PPR-Meta”, the sentence “We tested the running time of PPR-Meta using 90,000 sequences from 100 to 10k bp and found that this tool can handle all sequences in approximately 15 minutes on a machine with the following configuration: CPU: Intel Core i7 6700; GPU: NVIDIA GTX1060; and Memory: 64G, DDR4” has been revised to: “… and found that this tool can handle all sequences in approximately 45 minutes on a machine with the following configuration …”(Please refer to Line 10, Page 27 in the revised manuscript.) In items (13) and (14), the results of the comparative tools were also slightly different because we found that one setting in the data pre-processing script may not be the best option. Thus, we re-ran the pre-processing procedure to make the results more precise. The dataset pre-processing procedure is provided in Section “Methods” (Please refer to Line 3-16, Page 34 in the revised manuscript), and the data pre-processing script (file “20_gut.sh”) is stored on our website and in the GigaScience Database.
I find really interesting part in which authors used likelihood scores generated by PPR-Meta to predict phage lifestyle or plasmid transmissability. Results shown are really encouraging and may make PPR-Meta an important tool in MGE studies. In the discussion of this phenomenon authors should cite the paper of Suzuki et al. (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2976448/), which discusses usage of genome signatures to predict the evolutionary host range of plasmids. Additionally, similarity between host and plasmid nucleotide composition is a known phenomenon, and so called genome amelioration is a term properly describing convergence of plasmid and host sequence patterns. Therefore, the statement "Since temperate phages and non-transmissible plasmids experience longer residence times within the host cell, they may adjust the sequence pattern toward the host" seems correct. We appreciate Reviewer 1 for providing a reference that supports our hypothesis. Firstly, to make the description more rigorous, we have revised the sentence “This phenomenon may be due to the adaptation of foreign DNA to the host” to “This phenomenon may be due to the genome amelioration of foreign DNA to the host.” (Please refer to Line 7, Page 30 in Section “Discussion and conclusions”.) In addition, we have added the following sentence to the revised manuscript as: “For example, research has shown that the comparison of the trinucleotide composition between a plasmid and bacterial chromosome can be used to predict the host range of plasmids [53].” (Please refer to Line 8-10, Page 30 in Section “Discussion and conclusions”.) The new citation has also been added to the list of the References.
For me it is really worth emphasizing that authors tested their software in the context of 3rd generation sequencing technologies like PacBio and Nanopore. As there is more and more reports showing applicability of single molecule sequencing technologies for describing microbial communities it would be really great to have tool which which is tested and works well on such datasets. Authors used 1% error rates for their simulated datasets, however, in my opinion they should test also higher error rates, as for aforementioned technologies they can reach up to 10%, error rate, especially for indels. Both technologies are sensitive to homopolymer sequences therefore one would expect more errors in such regions. Additionally I would also like to see if PPR-Meta is able to predict phage/plasmid sequences on real 3rd generation sequencing data. They can use for example recently published virome (https://www.biorxiv.org/content/early/2018/ 11/12/345041.full.pdf+html) or mock microbial community (https://www.biorxiv.org/content/early/ 2018/12/04/487033.full.pdf+html) nanopore data. The former one can also help with "Since we lack samples in which only chromosomes are enriched and all the extrachromosomal elements are filtered, estimating whether related tools will misjudge chromosomes as MGEs directly is difficult using real data" (p. 21, l. 9-11) We would like to thank Reviewer 1 for the concern about testing PPR-Meta using artificial contigs with a higher error rate and real third generation sequencing data. In the revised manuscript, we tested PPR-Meta and related tools using artificial contigs with a high error rate. We used MetaSim to generate artificial contigs modified with 10% base substitutions and 10% indels in Group D, whose lengths were close to the raw reads generated from third generation sequencing data. The two types of errors were tested separately. The results showed that the AUCs of PPR-Meta remained the highest (>90%), although the performance was somewhat fluctuating, especially in the presence of 10% indels. Please refer to Additional File 3, Figure S3, for more details. On the other hand, we consider that a tool that can tolerate a 1% error rate is competent for handling the third generation sequencing data. Although the error rate of third generation sequencing technology can reach up to 10%, many basecalling tools have been developed to help improve the accuracy to over 99%. Thus, we believe that PPR-Meta can generate reliable predictions on the third sequencing data that go through Quality Control (QC). We have added the following sentences to the revised manuscript as: “Considering that the error rate of the raw data generated from the third-generation sequencing technology may be much higher, we also tested PPR-Meta and the related tools using artificial contigs modified with 10% base substitutions and 10% insertions or deletions in Group D, whose lengths were close to the raw reads generated from third-generation sequencing technology. The results are shown in Additional File 3 and Figure S3. The results showed that the AUCs of PPR-Meta remained the highest (>90%), although the performance was somewhat fluctuating, especially in the presence of 10% insertions or deletions. Recently, many basecalling tools for the third-generation sequencing technology have been developed to help improve the accuracy over 99% [41], therefore the extremely high error rate on the raw data will not affect the usage of PPR-Meta.” (Please refer to Line 4-15, Page 19 in Subsection “Performance in the presence of sequencing errors” in the revised manuscript.) The new citation has also been added to the list of the References. We also tested whether PPR-Meta and the related tools can predict phage sequences on real third generation sequencing data using virome data from the reference provided by Reviewer 1. The results showed that PPR-Meta could identify more viruses in this dataset. We have added the following sentences to the revised manuscript as: “Considering that the third-generation sequencing technology is more and more widely used to analyse metagenomes, we also used real virome data generated by MinION [46] to test whether PPR-Meta and the related tools can identify phages from third-generation sequencing technology. The virome was downloaded as assembled sequences (accession: GCA_900491955.1), containing 1500 sequences. The results showed that PPR-Meta, VirFinder and VirSorter could identify 79.20%, 76.27% and 30.40% of viral sequences respectively, indicating that PPR-Meta has the highest performance. Therefore PPR-Meta can also handle data from the third-generation sequencing technology, although it is designed primarily for the next-generation sequencing technology.” (Please refer to Line 6-16, Page 24 in Subsection “Evaluation in real metagenomic data”.) The new citation has also been added to the list of the References.
It should be also noted that in case of plasmid prediction PPR-Meta presents different approach than previously published cBar and PlasFlow only for sequences shorter than 10kb, as neural network trained on Group D essentially use the 6-mer frequencies for prediction. Additionally in the PlasFlow manual it is said that such approach does not work well on short sequences (recommended length is > 1kb), therefore comparisons using test datasets from GroupA and GroupB should be done with having this in mind. In my opinion authors should also comment on different lengths of sequences used for training in cBar, PlasFlow and PPR-Meta: cBar was trained with whole chromosome/plasmid sequences whereas PlasFlow on 10kb fragments and PPR-Meta using 4 datasets with differing lengths, what may significantly influence their performance. We thank Reviewer 1 for reminding us to note that PPR-Meta uses a similar approach to cBar and PlasFlow for long sequences. As we mentioned in the response to General Comment #3, in the revised version of the PPR-Meta tool, we removed the k-mer-based Fully Connected Neural Network (FNN) and instead used BiPathCNN to predict sequences of all lengths, which helped to improve its performance for very long sequences. Indeed, we found that the k-mer-based approach did not work well for short sequences. For example, we used the FNN in the original PPR-Meta tool to predict the sequences in Group A (100~400 bp), and we found that the performance was much poorer: the AUC of phage identification decreased from 91.84% to 72.70%, and the AUC of plasmid identification decreased from 83.05% to 64.79%. One of the innovations of PPR-Meta is that we do not use k-mer frequencies to represent DNA sequences, as we have emphasized in our manuscript. For example, we mentioned that “Although k-mer frequencies have been widely used in many studies, such frequencies may present serious fluctuations in short sequences” and “The performance improvement on short sequences demonstrates that our sequences representation method is more detailed than the k-mer frequencies”. To further emphasize the difference between PPR-Meta and the comparative tools, we have added the following sentence to the revised manuscript as: “Overall, PPR-Meta presented a much better performance than other homology-search-based tools such as VirSorter and k-mer-based tools such as VirFinder, PlasFlow and cBar.” (Please refer to Line 11-13, Page 16 in Subsection “Performance comparison”.) In our opinion, the k-mer-based methods are more sensitive to sequence length than our BiPathCNN. The distribution of the k-mer frequencies may be different between long sequences and short sequences, and the variance of the k-mer frequencies for short sequences may be much higher. Thus, the k-mer-based classifier constructed using short sequence data may not by applicable for long sequences, and vice versa. We also tested the accuracy of each BiPathCNN on test sets from other groups, as per Reviewer 1’s suggestion in Specific Comment #7. We found that although the overall accuracy was slightly reduced when testing each group using the non-corresponding BiPathCNNs from the other groups, the decrease was not obvious (Please refer to Additional File 3, and Figure S2), indicating that our approach was not quite sensitive to the sequence length. We have added the following sentences to Section “Discussion and conclusions” in the revised manuscript as: “On the other hand, k-mer-based methods may also be more sensitive to the sequence length than the BiPathCNN method in the current work. The distribution of k-mer frequencies may be different between long sequences and short sequences, and the variance of the k-mer frequencies for short sequences may be much higher. Thus, the k-mer-based classifier constructed using short sequence data may not by applicable for long sequences, and vice versa. Among the k-mer-based tools, cBar was trained with complete genomes and PlasFlow was trained on 10k bp fragments, which might make them hard to adapt to metagenomic data with a wide range of lengths. Differently, our BiPathCNN directly extracts sequence features from the raw data represented by the one-hot matrix and may be less sensitive to the sequence length. Tests of each BiPathCNN on test datasets from different groups (Additional File 3, Figure S2) also showed that although the overall accuracy was slightly reduced when testing each group using a non-corresponding BiPathCNN from the other groups, the decrease was not obvious, indicating that our approach is not quite sensitive to the sequence length.” (Please refer to Line 5-20, Page 28 in the revised manuscript.)
Specific Comments: 1. Page 5 lines 2-4: "extracted" should be "extract" We thank Reviewer 1 for the careful check of our manuscript, and we have corrected this mistake. The sentence has been revised to “Such approaches primarily used a scan window to move across the complete bacterial chromosome and extract regions that seem to be phages based on a similarity search against viral databases.” (Please refer to Line 4, Page 5, in Section “Introduction”.) Meanwhile, after proofreading our manuscript, we have also revised some phrases to improve the language and presentation. Herein we list the revision as follows: Line 4, Page 15 and Line 14, Page 22, “In term of” have been revised to “In terms of”. Line 20, Page 15, “appearing” has been revised to “appeared”. Line 22, Page 31, “sequencing” has been revised to “sequenced”. Line 16, Page 9, “For BOH” has been revised to “For BOH in PPR-Meta”. Line 11-12, Page 20, “The decline in performance may be due to…” has been revised to “The lower recognition rate of prophages compared with that of the phages in the NCBI database may be due to…” Line 20-21, Page 28, we have replaced “Moreover…” with “Another shortcoming of k-mer-based tools may be that…”
Page 5 line 21: Authors may also cite following tool: https://www.ncbi.nlm.nih.gov/pubmed/ 30383524. We thank Reviewer 1 for reminding us that we missed a related tool in our manuscript. Now, the corresponding sentence has been revised to “In terms of plasmids, most of the current tools for plasmid identification were designed for WGS or even specific species, such as PlasmidFinder [20], PLACNET [21], PlasmidSeeker [22] and mlplasmids [23].” (Please refer to Line 21, Page 5, in Section “Introduction”.)
Page 6 line 6 "this tool applies SMO" should be changed to: "this tool applies SOM" (Self Organizing Map) We thank Reviewer 1 for this careful check. According to the reference of cBar, SMO refers to “sequential minimal optimization”. To make it clearer to readers, we have revised the sentence to “This tool applies sequential minimal optimization (SMO) as a classifier based on k-mer frequencies.” (Please refer to Line 6, Page 6, in Section “Introduction”.) In addition, the term “SMO” has also been added to Section “List of abbreviations.” (Please refer to Line 3, Page 36.)
page 8 lines 13-19 "phage metagenomic data of bovine rumen [19], which were downloaded from MG-RAST [29] (Accessions: mgm4534202.3 and mgm4534203.3) as raw reads and assembled by SPAdes" and "20 samples of healthy human gut [32], downloaded from the NCBI Short Read Archive [33] and assembled by SPAdes." I miss details on SPAdes assembly. What settings were used and what was the quality of assembly (N50, number of contigs and so on). We apologize that we missed some details in our manuscript. In the revised manuscript, we have added these details to Section “Methods” as follows: “We also used real metagenomic data to evaluate PPR-Meta and the related tools. We used SPAdes to assemble the raw reads, as we mentioned in the main text. The phage metagenomic data of the bovine rumen were downloaded from MG-RAST, and we used the command “spades.py --meta -1 file1.fastq -2 file2.fastq -o out_folder” to assemble the pair-end raw reads. In the assembly, the contig number, N50, average length, maximum length and minimum length were 107529, 288, 312.06, 75508 and 56, respectively. To download the 20 samples of the healthy human gut, we used the command “prefetch SRRaccession” from the SRA Toolkit. All samples were downloaded as “.sra” files. We then used the command “fastq-dump --split-files accession.sra” from the SRA Toolkit to convert the sra file into two pair-end fastq files and used SPAdes with the same settings as mentioned above to assemble the raw reads. The information about the contig number, N50, average length, maximum length and minimum length is provided in Additional File 1.” (Please refer to Line 3-16, Page 34 in the revised manuscript.) Also, the sentence “Additional details on the dataset construction are provided in Methods section.” has been moved from the second paragraph from the last to the last paragraph in Section “Dataset construction”. (Please refer to Line 1, Page 9 in the revised manuscript.) In addition, the scripts used to calculate the quality of the assemblies were provided in the GigaScience Database and on our website.
page 12 lines 5-11: Which approach is used for sequences 5-10 kb? FNN or biPath-CNN (groupC model)? If biPath-CNN - what is its performance on this dataset compared to FNN (group D)? It is not clear which approach the software will use for the real datasets of this length. We are very sorry that we did not provide a clear statement of this issue in our manuscript. As we mentioned in the response to General Comment #3, we removed the FNN from the revised PPR-Meta tool. For sequences longer than 1200 bp, a scan window is used to make a prediction. To make this clearer to the readers, we have added the following sentences to the revised manuscript as: “In practical applications, PPR-Meta uses BiPathCNN A to predict sequences between 100 and 400 bp, BiPathCNN B to predict sequences between 400 and 800 bp, and BiPathCNN C to predict sequences between 800 and 1200 bp. For sequences longer than 1200 bp, such as sequences in Group D, a scan window will move across the sequence without overlapping, and the weighted average of all windows’ predictions is calculated. The length of the window is set to 1200 bp (or less if the window ends beyond the sequence boundary). For example, given a sequence of length 2500 bp, the scan window will first cover the bases from the 1st to 1200th positions, then the window will move to bases from the 1201st to 2400th positions, and finally, the window will move to bases from the 2401st to 2500th positions. Then, PPR-Meta uses BiPathCNN C, BiPathCNN C and BiPathCNN A to predict the subsequences under the first, second and third windows, respectively. To generate the final score for the whole sequence, PPR-Meta calculates the weighted average of these windows. The weights of these three windows are 1200/2500, 1200/2500 and 100/2500, respectively.” (Please refer to Line 15, Page 12-Line 8, Page 13, Subsection “Structure of deep learning neural networks” in the revised manuscript.)
I would also like to see the comparison to other software regarding fragments longer than 10kb (which are easily achievable with current metagenomic sequencing techniques). We would like to thank Reviewer 1 for this concern about making comparisons for fragments longer than 10k bp. We have added comparisons on 15k and 30k bp fragments to the revised manuscript, and the performance of PPR-Meta is still the best. The results are shown in Additional File 3, Figure S1, and we have added the following sentences to the revised manuscript as: “In some cases, a few assembled sequences from high-abundance species may be much longer, so we also tested PPR-Meta and the related tools using 15k bp and 30k bp fragments (shown in Additional File 3, Figure S1). The results showed that the performance of PPR-Meta was still the best for these long sequences.” (Please refer to Line 2-6, Page 16, Subsection “Performance comparison” in the revised manuscript.)
Additionally, please test accuracy of each model on testing datasets from other models, e.g. model for group A on test datasets for groups B, C and D, model for group B on test datasets for groups A, C and D, etc. It is not likely that in real datasets sequences will be distributed such uniformly. It is also interesting how biPath-CNN performs on long sequences, as coding information should significantly increase its performance (like it was shown for shorter fragments). And, maybe, any of the single models is good enough to be used on fragments of all lengths; for me this possibility can not be excluded by looking at presented data. We understand the consideration of Reviewer 1 that using a single model may be good enough for fragments of all lengths. In fact, using different neural networks for sequences of different groups can not only help to improve the accuracy but also speed up the program. As shown in Additional File 3, Figure S2, we tested the accuracy as well as the running time of each neural network on the test datasets from all groups. The results showed that using a non-corresponding neural network from another group to predict sequences from a specific group would lead to a lower accuracy and longer running time (the reason will be described below). Indeed, the sequence lengths in a real dataset may not be distributed uniformly. In many cases, the distribution of the sequence lengths in real metagenomic data is more like a Poisson distribution, with most of the lengths around 0.5k-2k bp and a few of them much longer. Since most of the sequences are short, we considered that it is essential to construct different neural networks for short sequences. As we mentioned in the response to General Comment #3, we also tried to train BiPathCNN for long contigs but failed because it was very time consuming and had high hardware requirements. As an alternative, PPR-Meta uses a scan window for long sequences and predict the subsequence in each window using the corresponding BiPathCNN. The evaluation on long sequences, such as sequences of 15k bp and 30k bp, has shown the effectiveness of this approach. Herein, we would like to explain why using a non-corresponding BiPathCNN would increase the running time. In fact, once the BiPathCNN is constructed, the input size of the neural network is fixed. For example, the BiPathCNN of Group C can handle sequences with a maximum length of 1200 bp. If the BiPathCNN of Group C is used to predict a sequence of 100 bp, the “base one-hot matrix” must be padded with a number of rows of [0,0,0,0], in which all bits are zero, to adapt the input size of the neural network and so does the “codon one-hot matrix”. In general, padding with zeros does not significantly affect the accuracy but will add unnecessary calculations for the neural network, which will also increase the running time. Similarly, the BiPathCNN of Group A can handle sequences with a maximum length of 400 bp. If the BiPathCNN of Group A is used to predict a sequence of 1200 bp, a scan window must be used to split the sequence into 3 subsequences of 400 bp. Each subsequence will be predicted separately, and then an average score for the whole sequence will be calculated. Since the total number of sequences is increased, the running time will also be longer, although the accuracy will not be significantly reduced, as we showed in Figure S2. Overall, we consider that the usage of three BiPathCNNs together with the scan window for long sequences in the revised version of PPR-Meta may be a good choice for handling sequences of different lengths. In the revised manuscript, we have added the following sentence as: “In addition, we tested the accuracy as well as the running time of each BiPathCNN on test datasets from different groups and found that using a non-corresponding BiPathCNN to predict sequences from specific groups would lead to a lower accuracy and longer running time (shown in Additional File 3, Figure S2).” (Please refer to Line 6-10, Page 16, Subsection “Performance comparison” in the revised manuscript.)
Page 15 line 12: I would remove the word "obviously". Please be less advertising and more informative. We are sorry for our inappropriate description. In the revised manuscript, this sentence has been removed because this sentence was part of the discussion of the performance of the FNN that we used in the original version of PPR-Meta, but we removed the FNN in the revised version.
page 16 lines 10-13: "Compared with other sequence representation methods that ignore the coding or non-coding region, such as method based on k-mer frequencies, PPR-Meta uses a more detailed method of describing a sequence and achieves a higher performance." Authors should explicitly note that it relates only to sequences shorter than 5(or 10, see my note above) kb. We thank Reviewer 1 for bringing to our attention this inappropriate description. However, as we mentioned in the response to General Comment #3, the revised version of the PPR-Meta tool employs BiPathCNN for all sequences. Thus, this description seems to be appropriate in the revised manuscript.
Page 22, lines 16-17: "PPR-Meta is designed with the option to adjust the default threshold of discriminant criteria" It should be described more precisely. Although in the Manual it is noted, that "In this way, sequences with the phage (or plasmid) score higher than the other two categories and the threshold will be regarded as phage (or plasmid)", it is not mentioned that sequences not exceeding threshold for phage or plasmid category will fall into the "chromosome" category, what may not be the best option, increasing False Negative Rate. I also lack more information on the accuracy of PPR-Meta run with different thresholds (Table S1). Can you include also AUC in the table? And compare with PlasFlow, using the same thresholds? We thank Reviewer 1 for this comment. We quite agree that sequences not exceeding the threshold for the phage or plasmid category will fall into the chromosome category may increase the False Negative Rate. In the revised version of PPR-Meta, we referred to the threshold usage of PlasFlow. Specifically, given a threshold by a user, the sequence with the highest score lower than the threshold will be labelled as “uncertain”. In this way, the outputs of PPR-Meta contain six categories: phage, uncertain phage, chromosome, uncertain chromosome, plasmid and uncertain plasmid. In Additional File 3, Figure S5, we evaluated the uncertain prediction rate, accuracy, AUC, TPR and FPR under different thresholds. The accuracy was defined as the ratio of the number of correctly predicted fragments to the total number of fragments, rather than being calculated on either the phages or plasmids separately. Thus, the accuracy can reflect the overall performance of PPR-Meta. The accuracy, AUC, TPR and FPR were calculated only on the certain predictions. In general, with a higher threshold, the accuracy, AUC, and TPR as well as the uncertain prediction rate will be higher, while the FPR will be lower. We have added the following sentences to the revised manuscript as: “To meet users’ actual requirements, PPR-Meta is designed with the option to adjust the threshold to filter out the uncertain predictions so that the remaining predictions may be more reliable. Given a threshold, a sequence with a highest score lower than the threshold will be labelled as “uncertain”. In this way, the outputs of PPR-Meta contain six categories: phage, uncertain phage, chromosome, uncertain chromosome, plasmid and uncertain plasmid. We evaluated the uncertain prediction rate, accuracy, AUC, TPR and FPR under different thresholds, and the results are shown in Additional File 3, Figure S5. In general, with a higher threshold, the accuracy, AUC, and TPR as well as the uncertain prediction rate will be higher, while the FPR will be lower.” (Please refer to Line 11-21, Page 26, Subsection “Usage of PPR-Meta” in the revised manuscript.) Herein, we compared the AUC of plasmid identification between PPR-Meta and PlasFlow, both using 0.7 as the threshold (the default threshold in PlasFlow). In Group A (100-400 bp), the AUC of PPR-Meta increased from 83.05% to 90.61%, while 53.36% of sequences were labelled as uncertain; the AUC of PlasFlow increased from 56.30% to 77.05%, while 37.70% of sequences were labelled as uncertain. In Group B (400-800 bp), the AUC of PPR-Meta increased from 89.64% to 93.99%, while 32.71% of sequences were labelled as uncertain; the AUC of PlasFlow increased from 62.50% to 81.85%, while 39.34% of sequences were labelled as uncertain. In Group C (800-1200 bp), the AUC of PPR-Meta increased from 91.84% to 94.68%, while 23.40% of sequences were labelled as uncertain; the AUC of PlasFlow increased from 68.01% to 84.79%, while 38.21% of sequences were labelled as uncertain. In Group D (5k-10k bp), the AUC of PPR-Meta increased from 96.02% to 97.63%, while 25.11% of sequences were labelled as uncertain; the AUC of PlasFlow increased from 88.42% to 93.86%, while 25.46% of sequences were labelled as uncertain.
This is only the suggestion, but all the data presenting performance of PPR-Meta in comparison to other tools can be also presented as graphs, what would allow for easy assessment of differences. We thank Reviewer 1 for this suggestion. In the main text, we presented the results as tables to make the information more precise. However, we quite agree that too many tables will make the manuscript difficult to read. Thus, we presented all of the supplementary results mentioned above as graphs in Additional File 3 to make them easy to assess. Correspondingly, the description of Additional File 3 in the “Additional file” section has been revised to: “Additional file 3: Figure S1 to Figure S5”. (Please refer to Line 19, Page 35.)
Specific Comments Regarding Software Usaibility 1. Using output file extension other than .csv throws an error: Error using writetable (line 124) Unrecognized file extension '.tsv'. Use the 'FileType' parameter to specify the file type Error in PPR-Meta(line 124) MATLAB:table:write:UnrecognizedFileExtension This should be better documented, and the user should be warned at the beginning of computation that using custom file extensions will cause that output file cannot be written. We thank Reviewer 1 for testing our program and discovering the “bug” in PPR-Meta. We have addressed this “bug”, and PPR-Meta will now automatically check the extension. If the user does not use “.csv” as an extension, the program will directly add the “.csv” extension to the output file and give a warning. We have added the following sentences to the manual to remind users of this: “Note: the current version of PPR-Meta uses “comma-separated values (CSV)” as the format of the output file. Please use “.csv” as the extension of the output file. PPR-Meta will automatically add the “.csv” extension to the file name if the output file does not take “.csv” as its extension.” (Please refer to Section 5, Part Ⅰ in the manual.)
To Reviewer #2:
General Comments: Authors present a tool which is able to perform multiclass prediction of phage, plasmid and chromosome sequences in metagenomic data. Using this deep learning approach along with its architecture and the weighted average of all windows is worthy of publication by itself. Really interesting. Tool, source code, virtual machine (along with a video explaining how to use it) and supporting datasets are all publicly available. Also, the discussion section is really interesting. Noting that the differences between chromosome and phage scores may reflect phage lifestyle may be an relevant finding. Herein we first appreciate Reviewer 2’s positive comments on our present work. We would like to especially thank Reviewer 2 for these comments and suggestions, which were certainly helpful for us to improve our work. For Reviewer 2’s following concerns on our work, we present our responses and the corresponding improvements or revisions as follows.
- I am somehow concerned by the biological support of trying to predict these three classes. My concern in mainly about the overlap characteristics that plasmid and chromosomes usually present. And the same thing may be said about phages/prophages, chromosomes and plasmids. An example of this is explored by authors in results reported in page 20. As a suggestion, authors should state clearly that the tool is aimed to perform a three-class prediction. Perhaps adding sentences to the abstract or even to the title. We fully understand Reviewer 2’s concern about the biological support of trying to predict these three classes. Indeed, genome amelioration is widely observed in foreign DNA, so it is not strange that plasmids (or phages) and chromosomes usually present overlap characteristics. In fact, the similarity of the nucleotide composition between plasmids (or phages) and chromosomes is often used to predict the host of a given plasmid or phage (see Galiez et al., WIsH: who is the host? Predicting prokaryotic hosts from metagenomic phage contigs. Bioinformatics, 2017; 33(19): 3113-3114; Suzuki et al., Predicting plasmid promiscuity based on genomic signature. J. Bacteriol., 2010; 192(22): 6045–6055.) However, the similarity of the nucleotide compositions between plasmids (or phages) and chromosomes is not contradictory with using the characteristics of the nucleotide composition to identify plasmids or phages because the plasmids or phages can still maintain some specific sequence patterns. For example, experiments conducted by Ren et al. showed that a virus and its host shared some similar k-mers, which could help predict the host of a given virus, while viruses would also share more similar k-mers with each other, which could help distinguish them from the hosts (see Ren et al., VirFinder: a novel k-mer based tool for identifying viral sequences from assembled metagenomic data. Microbiome, 2017; 5(1): 69.). Thus, we consider that using the nucleotide composition to identify phages and plasmids is biologically feasible. To make it clearer to readers that PPR-Meta is intended to perform a three-class prediction, we have revised the sentence in Section “Abstract” as: “We present PPR-Meta, a three-class classifier that allows simultaneous identification of both phage and plasmid fragments from metagenomic assemblies.” (Please refer to Line 8-10, Page 2 in the revised manuscript.) We have also revised the sentence in Section “Introduction” as: “In this paper, we present the PPR-Meta (Phage and Plasmid Recognizer for Metagenomes), a three-class classifier for identifying metagenomic fragments as phages, plasmids or chromosomes based on the deep learning algorithm.” (Please refer to Line 19, Page 6 in the revised manuscript.) In addition, we have added a new subsection to provide an example application to show how PPR-Meta can be used to analyse metagenomic data. We employed PPR-Meta to identify phage and plasmid sequences on a series of metagenomic datasets of the human digestive tract, including the gut, throat and oral cavity, from the Human Microbiome Project (HMP). The finding is interesting and may be significant to the study of human health. We found that in positions closer to the outer end of the digestive tract, the percentages of phages and plasmids tended to be higher. For example, in the gut, the inner end of the digestive tract, the percentages of phages and plasmids were lower, while in the oral cavity, the outer end of digestive tract, the percentages of phages and plasmids were higher. Please refer to the new subsection “Phages and plasmids in the human digestive tract” for more details (Please refer to Line 18, Page 24 - Line 21, Page 25 in the revised manuscript.).
Minor comments: 1. Page 5, line 13: "However, research has shown that viral sequences are highly fragmented in the metagenome [19], which may prevent binning, thereby limiting the usage of MARVEL." This is an unfair affirmation, since its only supporting reference is a 2013 article and it is safe to say that much has been done to improve metagenomic assemblers since then. Recent publications have been reporting the retrieval of phage complete and/or almost complete genomes by only applying assembly and binning approaches. I refer specially to the IMG-VR database (versions 1.0 and 2.0), which is a repository of thousands of viral sequences retrieved from metagenomic datasets all around the world. Nonetheless, there are many other publications in this line such as: Paez-Espino, David, et al. "IMG/VR: a database of cultured and uncultured DNA Viruses and retroviruses." Nucleic acids research (2016): gkw1030. Paez-Espino, David, et al. "IMG/VR v. 2.0: an integrated data management and analysis system for cultivated and environmental viral genomes." Nucleic acids research (2018). Paez-Espino, David, et al. "Uncovering Earth's virome." Nature 536.7617 (2016): 425. Sangwan, Naseer, Fangfang Xia, and Jack A. Gilbert. "Recovering complete and draft population genomes from metagenome datasets." Microbiome 4.1 (2016): 8. Herein, we agree with Reviewer 2 so that we could have an appropriate way to describe this point. We thus have revised these sentences in an appropriate way to “The tool MARVEL can assign metagenomic bins as phages or bacteria and demonstrates better performance than previous tools. In the other hand, in order to identify sequences from low-abundance phages, which may fall into binning, we also need tools that can directly judge each fragment.” (Please refer to Line 12-16, Page 5, Section “Introduction” in the revised manuscript.) Also, we have removed the sentence “Compared with the other tools, VirFinder is more suitable for metagenomes” from Section “Introduction”.
Page 27, line 12 and other parts of the article: It is not clear whether authors extracted prophages from chromosomes to train the algorithm with more phages or to remove noise from chromosome datasets. Do the chromosome datasets still contain their prophages? We realize that we did not provide a clear description of the dataset construction. In fact, both adding more phages and removing noise from the chromosome dataset are the reasons that we extracted the prophages from the chromosomes. On the one hand, the number of phage genomes in the current RefSeq database is much less than that of bacterial chromosomes, while the abundance of viruses in real microbial communities is estimated to be much higher than that of bacteria. At the time we constructed PPR-Meta, we downloaded 10,090 complete prokaryote chromosomes, while only 2,279 completed phages were collected. Thus, extracting prophages from prokaryote chromosomes may be a good approach to expand the phage training set. On the other hand, some bacteria contain several prophages, which may account for up to 20% of the host chromosome. If all of these prophages were labelled as chromosomes when training PPR-Meta, the accuracy would be reduced, especially the sensitivity of phage identification. Thus, in the training set, all of the extracted prophages were directly added to the phage dataset, and the chromosome dataset did not contain prophages. However, all these prophages were predicted by ProphET, a software to extract prophages from prokaryote chromosomes based on similarity search, and were not subjected to experimental verification. Thus, these prophages could not be used as a benchmark. Therefore, in the test set, we directly removed the prophages, and neither the phage dataset nor the chromosome dataset contain the prophages, which we have emphasized in the manuscript. To test the prophage identification of PPR-Meta and the related tools, we additionally collected 267 manually annotated prophages. (Please refer to Line 2, Page 8, Subsection “Dataset construction”.) These manually annotated prophages have widely been used as benchmarks for related computational software, such as Prophinder, Phage_Finder, PHAST, PHASTER and VirSorter. To make this clearer to readers, we have added the following sentence to the revised manuscript as: “Moving prophages from a chromosome dataset to a phage dataset can help to both expand the phage dataset and remove noise from the chromosome dataset.” (Please refer to Line 20-22, Page 7, Subsection “Dataset construction” in the revised manuscript.)
A more practical question: For each query given by the user, will the tool automatically decide which model to use for prediction? Page 12 states how authors have proceeded by each query size, but my question regards the tool's behavior. We are sorry that we missed some details about the tool’s behaviour, and then address our response to Reviewer 2’s question as follows. Firstly, we would like to describe some improvements that we have made in the revised version of the PPR-Meta tool, which is closely related to this comment. In the original version of PPR-Meta, we built four neural network models for sequences of different lengths. Among these neural networks, we used BiPathCNN, which contains a base path and a codon path, for model A, B and C, and we used a Fully Connected Neural Network (FNN), which takes k-mer frequencies as inputs, for model D. In the revised version of the PPR-Meta tool, we removed model D and kept model A, B and C. In practical applications, PPR-Meta uses model A to predict sequences between 100 and 400 bp, model B to predict sequences between 400 and 800 bp, and model C to predict sequences between 800 and 1200 bp. For sequences longer than 1200 bp, a scan window will move across the sequence without overlapping, and the weighted average of all windows’ predictions is calculated. The length of the window is set to 1200 bp (or less if the window is beyond the sequence boundary). For example, given a sequence of length 2500 bp, the scan window will first cover the bases from the 1st to 1200th positions, then the window will move to the bases from the 1201st to 2400th positions, and finally, the window will move to the bases from the 2401st to 2500th positions. Then, PPR-Meta uses model C, model C and model A to predict the subsequences under the first, second and third windows, respectively. To generate the final score for the whole sequence, PPR-Meta calculates the weighted average of these windows. The weights of these three windows are 1200/2500, 1200/2500 and 100/2500, respectively. We made this change because we found that the revised version of PPR-Meta could achieve a higher performance on long sequences. For example, for sequences with a length of 30k bp, the AUCs of both the phage identification and plasmid identification are higher. In particular, the TPR of phages increases from 93.76% to 99.84%, and almost all phages were identified. Although most sequences in the current metagenomic data are short fragments, a few reads from high-abundance species can be assembled into long contigs containing tens of thousands of bases, and we think that the revised PPR-Meta can be better adapted to these species. Additionally, considering that the third-generation sequencing technology is becoming more and more widely used, we hope that PPR-Meta can also promote studies using long sequencing technology, even though PPR-Meta is designed primarily for the next-generation sequencing technology. In the revised manuscript, we have added comparisons between PPR-Meta and the related tools using 15k bp and 30k bp sequences, which are much longer than the sequences used for the comparisons in the main text. The results showed that PPR-Meta was still the best performing tool. (Please refer to Additional File 3, Figure S1.) In addition, we also used real metagenomic data of viromes generated from third-generation sequencing technology. The results showed that PPR-Meta could identify more sequences as phages compared with the related tools. (Please refer to Line 6-16, Page 24, in the “Evaluation in real metagenomic data” section.) We now answer the question in this comment. Actually, the PPR-Meta tool’s behaviour is the same as our above description about the practical applications of PPR-Meta. To make the PPR-Meta tool’s behaviour clearer, we have added the following sentences to the revised manuscript as: “In practical applications, PPR-Meta uses BiPathCNN A to predict sequences between 100 and 400 bp, BiPathCNN B to predict sequences between 400 and 800 bp, and BiPathCNN C to predict sequences between 800 and 1200 bp. For sequences longer than 1200 bp, such as sequences in Group D, a scan window will move across the sequence without overlapping, and the weighted average of all windows’ predictions is calculated. The length of the window is set to 1200 bp (or less if the window ends beyond the sequence boundary). For example, given a sequence of length 2500 bp, the scan window will first cover the bases from the 1st to 1200th positions, then the window will move to bases from the 1201st to 2400th positions, and finally, the window will move to bases from the 2401st to 2500th positions. Then, PPR-Meta uses BiPathCNN C, BiPathCNN C and BiPathCNN A to predict the subsequences under the first, second and third windows, respectively. To generate the final score for the whole sequence, PPR-Meta calculates the weighted average of these windows. The weights of these three windows are 1200/2500, 1200/2500 and 100/2500, respectively.” (Please refer to Line 15, Page 12-Line 8, Page 13, Subsection “Structure of deep learning neural networks” in the revised manuscript.) Because of the improvements we made in the revised PPR-Meta tool, as we mentioned at the beginning of this response, some of the results in the manuscript have also been updated. Herein, we would like to describe the updated content in our manuscript that is the result of these changes. None of the updated results mentioned below affect any conclusions that we have made in this manuscript. The revised version of the PPR-Meta tool has slight differences only on long sequences, while most of the test data we used in the manuscript are shorter than 5k bp, which is dominant in the current metagenomic sequences, and the revised PPR-Meta generates the same results for sequences shorter than 5k bp. Thus, the magnitude of all of the changes is small, except that the program has a longer running time for sequences longer than 5k bp, as shown in item (15) below. The changes in the manuscript include the following: (1). The original Figure 2, which describes the structure of the FNN, was removed. The second paragraph form the last in Subsection “Structure of deep learning neural networks”, which describes the FNN, was also removed. (2). In Subsection “Mathematical model of DNA sequences”, the sentence “Here, we use a more detailed approach to represent the short sequences in Group A, Group B and Group C.” has been revised to “Here, we use a more detailed approach to represent the DNA fragments.”(Please refer to Line 7-8, Page 9 in the revised manuscript.) Also, the last paragraph of this section, which described using k-mer to represent DNA fragments in Group D, was removed. (3). In Subsection “Structure of deep learning neural networks”, the sentences “…we trained corresponding neural networks for each group. For Group A, B and C, we designed BiPathCNN to improve the performance (Figure 1).” has been revised to: “…we trained three neural networks for Group A, B and C. To improve the performance, we designed BiPathCNN (Figure 1), a novel neural network structure, to make reliable predictions.”(Please refer to Line 15-17, Page 10 in the revised manuscript.) (4). In Figure 2 in the revised manuscript (as Figure 3 in the original manuscript), the confusion matrix of Group D was updated. Also, in Subsection “Overall performance”, the phrase “shown in Figure 3” has been revised to “shown in Figure 2”. (Please refer to Line 15, Page 13 in the revised manuscript.) (5). In Figure 4, the ROCs of Group D, which described the potential of using life_score and trans_score to classify the phage lifestyle and plasmid transmissibility, were updated. Also, the legend of Figure 4 has been revised to: “(a) Classify virulent phages and temperate phages using life_score. In order of sequence length, the AUC is 0.63, 0.69, 0.71 and 0.76. (b) Classify transmissible plasmid and non-transmissible plasmid using trans_score. In order of sequence length, the AUC is 0.58, 0.55, 0.60 and 0.62. (6). In Section “Methods”, Line 21-22, Page 33, the sentence “Considering the memory size, running time and accuracy, a total of 3,060,000 artificial contigs were generated to train PPR-Meta.” has been revised to “Considering the memory size, running time and accuracy, a total of 2,700,000 artificial contigs were generated to train PPR-Meta.” Also, the phrase “and 120,000 from Group D” has been removed from the sentence “The number of training contigs of each phage, chromosome and plasmid is 300,000 from Group A to C and 120,000 from Group D.” (Please refer to Line 22, Page 33-Line 2, Page 34 in the revised manuscript.) (7). In Table 1, Table 3 and Table 4, the performance of PPR-Meta on Group D (the fifth row from the last of each table) was updated. (8). In Table 5, the prophage recognition rate of PPR-Meta on Group D (the third row from the last) was updated. (9). In Subsection “Performance comparison”, the sentence “The TPR of PPR-Meta was approximately 3%~13% higher than that of VirFinder and the FPR was approximately 6%~9% lower” has been revised to “The TPR of PPR-Meta was approximately 10% higher than that of VirFinder, and the FPR was approximately 5~10% lower.” (Refer to Line 13-15, Page 15 in the revised manuscript.) (10). In Subsection “Performance comparison”, the sentences “For PPR-Meta, our FPR was much lower than that of cBar and PlasFlow. Although PPR-Meta achieved a slightly lower TPR than PlasFlow in Group A, our TPR remained highest in all other cases” have been revised to “For PPR-Meta, the TPR was comparable with that of PlasFlow, while the FPR was approximately 25~40% lower.” (Please refer to Line 1-2, Page 16 in the revised manuscript.) (11). In Subsection “Evaluation in real metagenomic data”, the sentence “VirFinder and PPR-Meta were much better than VirSorter and identified 68.86% and 76.88% of the contigs, respectively, showing that PPR-Meta had the highest coverage of this data set” has been revised to “VirFinder and PPR-Meta were much better than VirSorter and identified 68.86% and 76.90% of the contigs, respectively, showing that PPR-Meta had the highest coverage of this data set.” (Refer to Line 21, Page 21 in the revised manuscript.) (12). In Subsection “Evaluation in real metagenomic data”, the sentences “For PPR-Meta, total of 82.00% of the sequences were identified as MGEs, in which 49.16% were phages and 32.84% were plasmids. More than half of the sequences (64.72%) predicted as phages by PPR-Meta were also predicted as phages by VirFinder, and most of the sequences (74.72%) predicted as plasmids by PPR-Meta were also predicted by PlasFlow” have been revised
Source
© 2018 the Reviewer (CC BY 4.0).
Content of review 2, reviewed on April 01, 2019
In the revised version of their manuscript, the Authors have adequately addressed all points that I had raised on the previous version. The manuscript is now clearer and easier to read. The change to the algorithm, mostly removal of FNN part makes the software and idea clearer. Moving in scan window across longer sequences is also a great idea and I see great possibilities with that approach for identification of prophages or chromosome-derived fragments on plasmids and phage genomes. An interesting addition is also a part describing virome and plasmidome of the digestive tract, and such analysis can be further expanded into a separate manuscript.
I still have some cosmetic comments regarding manuscript:
- On page. 33, line 15: should be Error Rate at Read Start
- I think Figure 4 should be presented in better resolution.
- Please provide versions of software used, if possible (missing for most of the programs)
- Unfortunately, I can't agree with the statement: „Recently, many basecalling tools for the third-generation sequencing technology have been developed to help improve the accuracy over 99% [41], therefore the extremely high error rate on the raw data will not affect the usage of PPR-Meta" (page 19). Authors correctly refer to work by Wick et al, recently published as a preprint (https://doi.org/10.1101/543439), but accuracy over 99% is not a raw read accuracy, but rather consensus accuracy, where reads are first assembled into contigs, then polished using dedicated software. This sentence should be changed to something like (this is only the suggestion): Recently, many dedicated tools have been developed to help improve the consensus accuracy for the third-generation sequencing technology over 99% [41], therefore the extremely high error rate on the raw data should not affect the usage of PPR-Meta on assembled 3rd generation sequences.
- In the case of the sentence: "PPR-Meta can also handle data from the third-generation sequencing technology, although it is designed primarily for the next-generation sequencing technology" (page 24): Isn't 3rd generation also the next-generation? Authors should consider changing „next-generation" to „2nd generation" or 3rd generation to single-molecule
- The sentence "In the other hand, in order to identify sequences from low- abundance phages, which may fall into binning, we also need tools that can directly judge each fragment." (page 5) needs rephrasing
Source
© 2019 the Reviewer (CC BY 4.0).
References
Zhencheng, F., Jie, T., Shufang, W., Mo, L., Congmin, X., Zhongjie, X., Huaiqiu, Z. 2019. PPR-Meta: a tool for identifying phages and plasmids from metagenomic fragments using deep learning. GigaScience.