Content of review 1, reviewed on December 20, 2020

Dear authors,

"Mantis: flexible and consensus-driven genome annotation" is a manuscript presenting novel software for computational functional annotation of proteins called Mantis. The software is presented as flexible, adaptable, reproducible, fast and scalable to metagenomes. The main features described in the manuscript are the use of a Depth-First Search algorithm to search for the best combination of hits from which to transfer annotations to a given protein query, the combination of unspecific and taxa-specific HMMs along with other HMM references (PFAM, KOFAM, TIGRFAM), and the consensus-driven integration of annotations from the different resources, which can be easily customized.

At first sight, it could be seen as a lighter, more easily customizable version of eggNOG-mapper. However, there are substantial differences which makes Mantis a very interesting software and a good contribution to the repertoire of functional annotation software. After testing Mantis, I would say that the main advantage I have found is that it is fast and it is rather easy to setup and customize.

The manuscript itself focuses on i) describing how Mantis works, ii) what are the advantages of the DFS algorithm, iii) what are the advantages of the use of taxa-specific HMMs, iv) the comparison of using different HMM DBs and with eggNOG-mapper.

In general, in my opinion, the manuscript is successful in some aspects: i) showing the advantages of using the DFS algorithm, ii) showing the advantages of using TSHMMs, iii) showing Mantis as a easy to customize software and flexible regarding the use of different databases. There are other aspects that I personally would need more explanation to understand: i) what do the authors mean with adaptable, if it is different than flexible, ii) what do the authors mean with reproducible (which in general I would say it is considered implicit for most computational tools of this kind), iii) how several steps of the Mantis annotation algorithm actual work (specially those related with inter-HMM and consensus-driven integration), iv) how it actually scales from genomes to metagenomes, v) how it actually compares to other tools. Also, there are some other more technical aspects that I consider that should be addressed before recommending the manuscript for publication.

Next, I will try to ellaborate in more detail my concerns with some of the points mentioned above.

I will start with some very minor points that the authors should review, but which should be completely solved before publication. First, the list of keywords maybe should be reviewed: function as a keyword it is too generic in my opinion. protein; function; annotation could be merged into protein function annotation. Also, I could not find an explanation of what NLP means. Please, review also the other keywords included. The same for the list of abbreviations. For example, I could not find NLP there, but I found that BPO is also not included. Please, review all abbreviations both from the main text and from the list.

Please, review also the text for potential errors. For example, "whilst sing" --> "whilst using", or "herein after" --> "hereinafter". I am also not native English, so my review in this sense could be not the best, so maybe a professional language review could be good to improve the manuscript. I leave this decision to the authors, editor and maybe some Ensligh-native speaker from among the reviewers.

Regarding supplemental material, it has been a bit confusing for me. There are links throughout the text, but there are also other supplemental files at the end of the manuscript. Are all of them valid? Nomenclature of sections, tables, figures, etc from the different materials should be merged, and all supplemental material provided in a common way. This is what I had:

Throughout the text Supplemental Table 1 is a link to https://htmlpreview.github.io/?https://raw.githubusercontent.com/wiki/PedroMTQ/mantis/Resources/tab_e_value.html Supplemental Table 2 is a link to https://htmlpreview.github.io/?https://raw.githubusercontent.com/wiki/PedroMTQ/mantis/Resources/tab_uniprot_algorithms.html Supplemental Table 3 is a link to https://htmlpreview.github.io/?https://raw.githubusercontent.com/wiki/PedroMTQ/mantis/Resources/tab_genomes.html

At the end of the manuscript Supplements.pdf Defining a similarity threshold (Figure 1) Annotating metagenomes 0.1 Benchmarking annotation efficiency (Figure 2, Table 1, Table 2) Supplementary benchmarking against other PFA tools (Table 3, Table 4) Supplemental tables (Table 5, Table 6, Table 7, Table 8) References Supplements.xlsx Mantis vs emapper DFS vs BPO vs heuristic Comparisons with Prokka, DeepEC, emapper, RAST

Some comments about the abstract. What do the authors mean with "adaptable"? How is this different from "flexible"? Also, I would say that "reproducible" is implicit of computational tools, in general terms. Why are the authors stressing this in the abstract as a feature specifically of Mantis? Also, the authors say that "Mantis is fast, annotating an average genome in 25-40 minutes", which I would like better "an average bacterial genome", for instance.

And some comments about the Background section. I personally would remove the first 2 sentences, but this is completely personal preference. Also, I am not sure to agree with the protein function annotation definition which is given both in the abstract and in Background: "protein function annotation(PFA), which is the identification of regions of interest (domains) in a sequence and assignment of biological function(s) to these regions."

Later, the authors say that: "We reviewed the implementation of three widely used PFA tools [13, 24, 14] and observed that the processing of candidate annotations (i.e. sequences or HMM profiles which are highly similar to the query sequence) is done by capturing only the most significant candidate between the references ("best prediction only", herein after called BPO)". That is: Prokka, eggNOG-mapper and InterProScan 5. And, "This classic PFA approach works well for single-domain proteins, but multi-domain proteins may have multiple putative predictions [30, 31, 32], whose location in the sequence may or may not overlap.". The previous 2 sentences, although are true to some extent, are in my opinion an oversimplification. For example, stating that eggNOG-mapper picks the most significant candidate between the references is true, but only part of the story. eggNOG-mapper picks the most significant homolog as seed to retrieve orthologs from which annotations are transferred. Therefore, a protein can be annotated with multiple domains which may or may not overlap, which I believe is true also for Prokka and InterProScan 5. Therefore, although I think that the authors have a valid point here, I believe that they should focus on the advantages of identification of multiple references, but not as a mandatory requirement for multi-domain annotation. Actually, the 3 "reviewed" tools are very different, and if they are mentioned specifically as reviewed, a more precise description of their approaches, weaknesses or features which could be improved would be expected.

Then, regarding consensus integration of annotations the authors say that "This approach addresses three very relevant issues with PFA [34, 35, 51, 52]: over-annotation (through the use of overlapping but independent sources, thus obtaining a more reliable final annotation); under-annotation (through the use of multiple reference sources, which implicitly leads to a wider search space); and elimination of redundancy (through the creation of a consensus-driven annotation)." I agree that this approach would deal with under-annotation and with elimination of redundancy, due to integration of annotations from different sources. However, I fail to understand how this approach helps with over-annotation.

I will next comment on the sections in which how Mantis works is described.

First, regarding reference data, the authors say that eggNOG OGs are from eggNOG 5. However, actually TSHMMs and annotations downloaded by Mantis correspond to eggNOG 5 data, but the unspecific HMMs are from eggNOG 4.5. Checking some lines of code:

./source/MANTIS_DB.py: eggnog_downloads_page_hmm = 'http://eggnogdb.embl.de/download/eggnog_4.5/data/NOG/NOG.hmm.tar.gz' ./source/MANTIS_DB.py: eggnog_downloads_page_annot = 'http://eggnogdb.embl.de/download/eggnog_4.5/data/NOG/NOG.annotations.tsv.gz' ./source/MANTIS_DB.py: eggnog_downloads_page = 'http://eggnog5.embl.de/download/latest/per_tax_level/'+str(taxon_id)+'/' ./source/MANTIS_DB.py: url = 'http://eggnogdb.embl.de/download/emapperdb-5.0.0/eggnog.db.gz'

This is important, and it will impact both the use of the tool by users and the benchmarks presented in the paper. Note that eggNOG 4.5 and eggNOG 5 orthologous groups are not cross-linked, and therefore annotations for unspecific NOGs will be lacking in Mantis output. For example, I annotated with Mantis a protein using unspecific NOGs as ENOG4111x9y. This is an OG from eggNOG 4.5, as expected, and I got only the description as annotation ("uncharacterized protein yaho"), when I should be getting also annotation identifiers, GO terms for example. The equivalent OG in eggNOG 5 is ENOG502DN1T though. This can be checked comparing OGs from http://eggnog45.embl.de/ and http://eggnog5.embl.de/

Besides that, "TSHMMs metadata was extracted from the eggNOG SQL database" I would ask authors to specify that this is an eggNOG-mapper DB, not an eggNOG one, as can be seen in the link above.

Also, the TSHMMs databases are huge as a whole. In my opinion, including in Mantis an option to download only TSHMMs from a specific taxa would be great and almost mandatory for the "flexibility" that the authors have as a goal and claim in the manuscript (which in other aspects, like customization of DBs is achieved, in my opinion).

Mantis workflow is described as consisting of 6 steps: i) sample pre-processing, where samples are split in chunks for parallelization. Also, input data should be proteins. Are CDS queries (fasta in nucleotide/DNA format) also accepted? ii) HMM profile-based homolgy search against each reference dataset using HMMER (hmmsearch). It is a bottom-top hierarchical search for TSHMMs, which is great. iii) intra-HMM reference hits processing.

"HMMER outputs a domtblout ?le [23], where each line corresponds to a hit/match between the reference dataset and the unknown protein sequence. The e-value within the HMMER command limits the available solution space to be analyzed in the posterior processing steps." Which is the e-value used by Mantis then? The e-value from "the HMMER command" or the e-value from the "domtblout" file? Note that e-value in "domtblout" is the "e-value of the overall sequence/profile comparison (including all domains)", which outputs also other values, like "c-evalue" and "i-evalue". (http://eddylab.org/software/hmmer/Userguide.pdf, pp70-71, see also pp 34-35). Therefore, if using the "e-value", either all the hits from such domain should be included for intra-HMM integration, or preferably some hits should be discarded using the "i-evalue", for example. Also note that as the score used to compare combinations through DFS includes e-value, if several repeated domains exist in a sequence maybe the i-evalue should be included in the score instead of the overall e-value several times, which would inflate the combination in which several domains are part of the sequence.

"should the DFS algorithm running time exceed 60 seconds, Mantis employs the previously described "heuristic" algorithm [30], which scales linearly and outputs an unique combination of hits." How many queries in average could be affected by this? Is the user wanred somehow if some results are produced with the heuristic algorithm due to this DFS time limit?

Are "output_annotation.tsv" files a product of this 3rd step?

iv) metadata integration. Annotation description and identifiers are added to the respective hits. I guess this yields the "integrated_annotation.tsv" output files.

v) inter-HMM reference hits processing. I have many doubts with this step. First, in "Mantis" section it is said that "During inter-HMM hits processing the DFS algorithm is again used to generate all the combinations of hits for all HMM sources." However, in "Methods - Using multiple reference datasets" this is not confirmed not further explained. Also, if DFS is used here, how are the e-values from the different references used to compute the scores. Because e-values are depend on database size, in contrast with bit-scores. Also, query coverage will be very different by the very nature of the HMM profiles of the different databases. For example, PFAM is a database of domains, whereas eggNOG contains profiles which usually span a large section of the protein sequence.

Moreover, it said later that "Since several groups of consensus annotations may be generated, we evaluate their quality and select the best one, considering the following: percentage of the sequence covered by the hits in the consensus, the significance of the hits (e-value) in the consensus, significance of the reference datasets (customizable), and the number of different reference datasets in the consensus." "Since some sources are more specific than others, the user may also customize the weight given to each source during consensus generation [73]." (Methods - Reference data and customization)

Does this have to do with the DFS algorithm for inter-HMM integration? Also, how is the weighting of different references used? In the GitHub it is explained that it is a [0-1] value, but, is it just a priority value or is it included in the DFS scoring formula somehow? I fail to find an explanation about how this weighting is used at all.

vi) Consensus generation From Background section: "However, to our knowledge, there is no tool for the dynamic generation of a consensus from multiple protein annotations." "We implemented a two-fold approach to build a consensus annotation, first by checking for any intersecting annotation identifiers and second by evaluating how similar the free-text annotation descriptions are." From the "Mantis" section: "all the combinations of hits are expanded and intersected (if possible), the best consensus combination of hits is then selected for each query sequence."

So the tool is "checking" and/or "evaluating" intersections of identifiers and similarity of descriptions. But I fail to understand what is the desired or expected output after the check is done. According to the text, one would expect to obtain the intersection of the identifiers, which actually does not make sense at all, since many identifiers would be discarded just because they are not part of one or the other reference. The same question regarding descriptions: several descriptions are compared in a pairwise manner and those pairs with similarity are kept and those without any similar description are discarded?

Again, in "Methods - using multiple reference datasets": "For the integration of multiple reference datasets, a two-fold text mining approach was used: 1. Consensus between identifiers; and 2. Consensus between the free-text annotation description." "The consensus between identifiers is calculated by identifying intersections between the different sources. Identifiers within the free-text annotation descriptions are extracted and used here."

And also, "If no consensus between identifiers is found, then we proceed with a consensus calculation between annotation descriptions." so I guess that similarity of descriptions is only evaluated when no intersection is found among all the annotation identifiers? If this is true, then multiple identical descriptions would be kept just because a single domain is common to both reference sources? My sincere apologies, but I am missing some key point for sure here. I would ask the authors to explain in more detail how this works.

Also, if I understood correctly, there can be several descriptions from a single source. For example, if the right combination of hits for a query are 2 different PFAM domains, with different descriptions. How is this handled? Are all descriptions compared in a pairwise manner intra and inter reference?

Regarding the similarity of descriptions, the authors say that "Alongside Mantis, We developed a standalone tool [58] that allows the use of multi-ple reference datasets through the generation of a consensus annotation. As we have shown in the supplement "Defining a similarity threshold", this tool has high specificity, thus, in the context of Mantis, it allows for the correct identification of similar free-text annotation descriptions."

I guess this "high specificity" and how the threshold is defined should be part of the main manuscript in the Methods section? Even if this is left in the supplemental, it should be explained how this "high specificity" is achieved, and also how the definition of 0.8 as threshold is not arbitrary, and it is not as good as 0.6, 0.7 or 0.9. Also they say (in supplemental) that: "In the current scenario, sensitivity is unimportant for two reasons: 1. the reference datasets from Mantis may contain data from Swiss-Prot (TPs inflation) 2. two annotations may completely agree in identifiers but, while describing the same function, be lexically different (FNs inflation)" which I fail to understand. They define "False-positives (FP) = similarity score is above the threshold and identifiers disagree", which they say in the introduction that it is the goal of the consensus-description annotation. Thus, if you define these as FPs, you are contradicting the goal you define in the introduction. Also, "False-negatives (FN) = similarity score is below the threshold and identifiers agree", even when FNs are not going to be processed if those annotations with common identifiers are not processed for consensus-description, it genuinely gives an idea of the FN rate you get, and thus the sensitivity of the method, which is relevant to assess the potential impact of the consensus-description method to expand the annotations from results without common identifiers.

Besides the above comments, I tried to understand how these v and vi steps work by looking at some examples.

Example 1: 5 descriptions from a query: description:Response Regulator description:Two-component systems (sub1role) description:Signal transduction (mainrole) description:Transcriptional regulator description:two-component system, sensor histidine kinase ChiS

Shouldn't at least 2 of the descriptions be merged?

Example 2: 7 descriptions from a query description:DNA metabolism (mainrole) description:helicase description:Restriction endonuclease description:Type ISP C-terminal specificity domain description:Restriction/modification (sub1role) description:type iii restriction protein res subunit description:Type III restriction enzyme, res subunit

Example 3: full annotation of a query query_ID NOGG_merged;Pfam-A Plug;ENOG410XQM1 2 6 | pfam:PF07715 description:tonB-dependent Receptor Plug description:TonB-dependent Receptor Plug Domain

Example 4: comparing hits with and without tax info (-od)

without -od

NP_416050.4 kofam_merged K15268 1 4 | cog:COG0697 kegg_ko:K15268 tcdb:2.A.7.3.2 description:O-acetylserine/cysteine efflux transporter

with -od "Escherichia coli" NP_416050.4 kofam_merged;NOGT561_merged K15268;3XP05 2 4 | bigg_reaction:iECIAI39_1322.ECIAI39_1\835 cog:COG0697 go:0000101 go:0003333 go:0003674 go:0005215 go:0005575 go:0005623 go\:0005886 go:0006810 go:0006811 go:0006812 go:0006820 go:0006865 go:0008150 go:0015562 \ go:0015711 go:0015804 go:0015849 go:0016020 go:0016021 go:0022857 go:0031224 go:0032973 \ go:0033228 go:0034220 go:0042883 go:0044425 go:0044464 go:0046942 go:0051179 go:005123\4 go:0055085 go:0071702 go:0071705 go:0071944 go:0072348 go:0098655 go:0098656 go:014\0115 go:1903712 go:1903825 go:1905039 kegg_brite:ko00000 kegg_brite:ko02000 kegg_ko:K15268 tcd\b:2.A.7.3.2 description:O-acetylserine/cysteine efflux transporter description:May be an export pump for cysteine and other\ metabolites of the cysteine pathway (such as N-acetyl-L-serine (S) and O- acetyl-L-serine (OAS)), and for other amino acids \and their metabolites

In this case, the annotation with -od seems richer. Besides that, there is certain redundancy in identifiers (K15268) and possibly in descriptions also (cysteine transporter, export for cysteine).

Example 5: comparing hits with and without tax info (-od) (II)

without -od

NP_417950.1 kofam_merged;NOGG_merged COG0306;K16322 2 3 | enzyme_ec:3.4.14.10 enzyme_ec:5.4\ .2.12 bigg_reaction:iECABU_c1320.ECABU_c33930 bigg_reaction:iECO111_1330.ECO111_4301 bigg_reaction:iNJ661.Rv0545c bi\ gg_reaction:iNJ661.Rv2281 bigg_reaction:iPC815.YPO3967 cog:COG0306 … GO terms … KEGG brite terms … kegg_ko:K01280 kegg_ko:K03306 kegg_ko:K03569 kegg_ko:K04043 kegg_ko:K14640 kegg_ko:K15633 kegg_ko:K16322 kegg_ko:K16331 … KEGG modules, pathways, ...tcdb:2.A.20.1 description:inorganic phosphate transmembrane transporter activity \ description:low-affinity inorganic phosphate transporter

with -od "Escherichia coli"

NP_417950.1 kofam_merged;NOGT561_merged K16322;3XMW3 2 3 | bigg_reaction:iECO111_1330.ECO111_430\1 cog:COG0306 … GO terms ... kegg_brite:ko00000 \ kegg_brite:ko02000 kegg_ko:K16322 tcdb:2.A.20.1 description:low-affinity inorganic phosphate transporter de\scription:Low-affinity inorganic phosphate transport. Can also transport arsenate

In this case, the without -od annotation seems richier. Again, KEGG and descriptions consensus seem to be somewhat limited.

Example 6: inter-HMM integration

from "output_annotation.tsv"

YP_025301.1 Pfam-A HOK_GEF PF01848.17 5.93440414507772e-22 49 8 43 2 42 YP_025301.1 kofam_merged K18921 - 1.3407357512953368e-23 49 4 46 42 89 YP_025301.1 NOGG_merged NOG.ENOG410ZT31.meta_raw - 2.1979274611398962e-27 49 4 46 56 102

Here all the 3 hits overlap.

From "integrated_annotation.tsv"

Query HMM_file HMM_hit HMM_hit_ac evalue l qs qe ss se YP_025301.1 Pfam-A HOK_GEF PF01848.17 5.93440414507772e-22 49 8 43 2 42 | pfam:PF01848 \ description:Hok/gef family YP_025301.1 kofam_merged K18921 - 1.3407357512953368e-23 49 4 46 42 89 | kegg_ko:K18921 \ description:protein HokB YP_025301.1 NOGG_merged ENOG410ZT31 - 2.1979274611398962e-27 49 4 46 56 102 | description:Hok/gef family

The final "consensus_annotation.tsv"

YP_025301.1 Pfam-A;NOGG_merged ENOG410ZT31;HOK_GEF 2 3 | pfam:PF01848 description:Hok/gef f\amily

Why is the Pfam included here if it does overlap with the NOGG result? Why not the kofam hit? Descriptions of PFAM and NOGG match, and the evalue is better for NOGG. I would expect either PFAM or NOGG to be discarded, and, just maybe, kofam included due to slightly different description?

Example 7: the same as above, but with -od "Escherichia coli"

From "integrated_annotation.tsv":

YP_025301.1 Pfam-A HOK_GEF PF01848.17 5.93440414507772e-22 49 8 43 2 42 | pfam:PF01848 d\escription:Hok/gef family YP_025301.1 kofam_merged K18921 - 1.3407357512953368e-23 49 4 46 42 89 | kegg_ko:K18921 d\escription:protein HokB YP_025301.1 NOGT561_merged 3XRAS - 1.4945906735751294e-15 49 6 46 7 49 | description:Hok/gef family

Final "consensus_annotation.tsv"

YP_025301.1 kofam_merged K18921 1 3 | kegg_ko:K18921 description:protein HokB

In this case only the kofam hit is kept (due to being the one with the best evalue?).

Example 8: From "output_annotations.tsv"

1000565.METUNv1_03812 tigrfam_merged TIGR00092 TIGR00092 1.1e-161 363 20 \344 1 368 1000565.METUNv1_03812 kofam_merged K19788 - 2.7e-151 363 20 \344 79 446 1000565.METUNv1_03812 Pfam-A MMR_HSR1 PF01926.24 3.5e-26 363 13 \158 2 87 1000565.METUNv1_03812 Pfam-A YchF-GTPase_C PF06071.14 3.7e-42 363 284 \357 1 84 1000565.METUNv1_03812 NOGG_merged NOG.COG0012.meta_raw - 4.7e-144 363 20 \344 273 573

From "consensus_annotation.tsv"

1000565.METUNv1_03812 kofam_merged;tigrfam_merged;NOGG_merged COG0012;TIGR00092;K19788

The PFAM domains are discarded, whereas kofam and NOGG results are included despite overlapping with with TIGRFAM, which has the best e-value.

Example 9:

From "output_annotation.tsv"

362663.ECP_0061 kofam_merged K02336 - 0.0 783 41 \743 96 876 362663.ECP_0061 Pfam-A DNA_pol_B PF00136.22 5e-47 783 385 \744 22 414 362663.ECP_0061 NOGG_merged NOG.COG0417.clustalo_raw - 2.4e-127 783 70 \742 872 1449 362663.ECP_0061 Pfam-A DNA_pol_B_exo1 PF03104.20 4.6e-10 783 197 \290 241 337 362663.ECP_0061 tigrfam_merged TIGR00592 TIGR00592 8.4e-49 783 217 \736 597 1141

From "consensus_annotation.tsv"

362663.ECP_0061 kofam_merged;NOGG_merged K02336;COG0417

In this case TIGRFAM and PFAM are discarded, whereas NOGG is kept despite overlapping with kofam hit and having worse e-value.

Overall, I have problems understanding steps v and vi: inter-HMM integration and consensus generation. In my opinion, the explanations about how the algorithm works should be improved, made more accurate and exhaustive, at least in the Methods section. It could be needed to adjust some of the procedures/parameters to obtain more comprehensive and less redundant annotations, which should be one of the major advantages of Mantis according to the authors.

Now, I will comment regarding to the benchmarks presented in the manuscript.

In Methods - Accessibility and Scaling "Mantis is also scalable to metagenomes. For more details on performance see Annotating metagenomes in supplements." Also said in Abstract/Conclusions, at the end of the Background section

I think that if authors want to state that Mantis is scalable to metagenomes they should provide data supporting this in the main text. According to the results in the supplements, it is indeed scalable, in my opinion, although I ran a test with a sample with 100k queries and got rather slower results, but capable of running the analysis in a decent amount of time (120946 secs → 33.60 hours) anyway.

Methods - Stablishing a test environment "True-Positives (TP) are evaluated via two main methods: i) identifiers match and ii) description match"

Each query can have multiple annotation items (identifiers and descriptions). Each item that matches with one of the reference is a new TP? Or a TP is the annotation of query as a whole? This should further explained.

The same for FP "Annotations which do not match with the reference annotation are classied as False-Positives (FP)."

Is it the whole annotation or each of the items (identifiers and descriptions) could be a FP or TP independently of the others?

How are FPs, TPs ... computed when benchmarking SwissProt proteins? What are the annotation identifiers used from SwissProt? Or is it only description used?

"Whenever Mantis outputs a valid annotation but the reference is either of poor quality or non-existent, then this annotation is considered a potentially new annotation." (Methods - Establishing a test environment)

I think this sentence is a bit obscure. What is a "valid annotation" from Mantis, in comparison with a reference which is of "poor quality or non-existent", and what does mean for FPs and TPs the consideration of "potentially new annotation"? Is it not considered as a TP, a FP, neither? Shouldn't references of poor quality or non-existent just be removed from the benchmark?

"Annotation coverage is defined here as the number of Mantis' annotations divided by the total amount of protein sequences in a sample." (Methods - Establishing a test environment)

Is this the same as the % of queries which have received (consensus) annotation?

"As seen in supplemental Table 1, being more permissive (by using a higher e-value threshold e.g. 1e - 3) resulted in a higher annotation coverage and a higher precision." (Analysis - Initial quality control - Function assignment e-value threshold)

It is surprising that higher e-value threshold leads to higher precision: "This is due to Mantis' internal quality control in the form of the DFS hit-processing algorithm. Being too strict with the e-value threshold constricts the available solution space, resulting in a lower precision." However, expanding the solution space would yield higher TP rate, which potentially could lead to higher precision if FP rate does not increase so much as TP rate. However, in supplements.xlsx it is shown (e-value tab) that absolute number of FPs are reduced when using a higher e-value, and this is much more surprising. Although this could be true, the authors should check this result is an artifact of the definitions and methods used to compute TP/FPs should be discarded. If the authors really did, my apologies, and also maybe they could further explain why this happens.

Initial Quality Control "When comparing the hit processing algorithms we found that the DFS algorithm consistently outperformed the other algorithms, with an average precision 0.038 and 0.013 higher than the BPO and heuristic algorithms respectively."

Is this difference statistically significative?

"for homology search, Mantis uses HMMER [23] and eggNOG-mapper uses Diamond [22]." (Methods - Establishing a test environment)

This is not entirely true, since eggNOG-mapper can use both HMMER and/or Diamond, depending on the eggNOG-mapper version being used. I would rather prefer "and for eggNOG-mapper we used the Diamond-based search".

"Mantis ran with the following command: python mantis run_mantis -t sample.faa -od "NCBI ID" eggNOG-mapper was executed with the following command: python emapper.py -i sample.faa -o output_folder -m diamond" (Methods - Establishing a test environment)

According to this, the comparison is performed between Mantis using -od but eggNOG-mapper without taxonomy parameter. For results to be comparable, both Mantis and eggNOG-mapper should be run with (-od for Mantis, --tax_scope for eggNOG-mapper) and without taxonomy.

Also regarding comparison of tools, I don't understand why comparison with other tools have not been performed, or those already done (included in supp xlsx) are not included in the main text. Also, running Mantis through the CAFA benchmark would be highly recommended (https://genomebiology.biomedcentral.com/articles/10.1186/s13059-019-1835-8)

Finally, in my opinion precision (TP / TP + FP) is only part of the story, and besides "Annotation coverage", other metrics should be provided, specially some considering FNs (Sensitivity), which is the other part of the story I guess.

Finally, just 2 comments about the GitHub. I would be grateful if the web will point out that NOGT and NOGG are from e...

Declaration of competing interests Please complete a declaration of competing interests, considering the following questions: Have you in the past five years received reimbursements, fees, funding, or salary from an organisation that may in any way gain or lose financially from the publication of this manuscript, either now or in the future? Do you hold any stocks or shares in an organisation that may in any way gain or lose financially from the publication of this manuscript, either now or in the future? Do you hold or are you currently applying for any patents relating to the content of the manuscript? Have you received reimbursements, fees, funding, or salary from an organization that holds or has applied for patents relating to the content of the manuscript? Do you have any other financial competing interests? Do you have any non-financial competing interests in relation to this paper? If you can answer no to all of the above, write 'I declare that I have no competing interests' below. If your reply is yes to any, please give details below.

I am member of the main development team of eggNOG-mapper, which could be considered direct competition for Mantis, and which is compared to it in the manuscript.

I agree to the open peer review policy of the journal. I understand that my name will be included on my report to the authors and, if the manuscript is accepted for publication, my named report including any attachments I upload will be posted on the website along with the authors' responses. I agree for my report to be made available under an Open Access Creative Commons CC-BY license (http://creativecommons.org/licenses/by/4.0/). I understand that any comments which I do not wish to be included in my named report can be included as confidential comments to the editors, which will not be published. I agree to the open peer review policy of the journal.

Authors' response to reviews: (https://drive.google.com/file/d/1oo0jbHdBVyz-8plERbYNKDP4Quf_eWrA/view?usp=sharing)

Source

    © 2020 the Reviewer (CC BY 4.0).

Content of review 2, reviewed on April 22, 2021

Dear authors,

First of all, I would like to thank you for improving the manuscript, for providing the means to facilitate our tasks as reviewers, including the reorganization of supplementary material, and for the patience when answering my questions and doubts.

I like the new additions, especially Figures 2 and 7. I believe that the Mantis algorithm can be understood better now, specially the inter-HMM integration and consensus steps, which I was struggling to understand in practice. I also thank for the many explanations regarding the use of HMMER e-values, the notes about impact of DFS time limit, and the description of the output files generated by Mantis. I hope that all this is of help for readers also. In this regard, I think that the authors would improve the manuscript by reorganizing the Methods subsections related to sample selection and benchmarking, but just to facilitate readers interested in understanding better these specific aspects of the manuscript.

I also believe that the statements about other PFA tools are more fair now. Also, thank you for updating the eggNOG general HMMs to use eggNOG 5, and for making more accurate the references to eggNOG and eggNOG-mapper resources. Also thank you for changing the sentence about eggNOG-mapper homology search. I just didn't want the reader to think that eggNOG-mapper is using only Diamond, because nowadays it is not true, although the authors are completely right that at the moment they asked we were only giving support for diamond search (for version 2). Note that hmm_mapper.py does not perform annotation, but emapper.py yes, it is now again able to use the hmmer mode, after downloading and building the corresponding eggNOG 5 databases, of course. We still recommend using Diamond in most cases, though. I guess changing directions is a normal thing during support and the cycle of development of any tool. I am surprised about the tax_scope though, since I thought it was present since version 1? Since you very kindly did the tax scope analysis I won't even check. I just think that your decision improves the comparison and the taxa-based annotation discussion.

I also think that including Prokka improves the interest of the results for the audience, due to the broad use of this tool for PFA (especially in prokaryotes). It is a shame that other tools and the CAFA benchmark did not make it, but I thank the author's explanations about why they are out.

I also thank you for acknowledging the proposed changes in the tool, which were made with the only aim to enhance Mantis flexibility for users. I just regret that you didn't agree with allowing using CDS as input, since it would be very useful for end users. Note that translating a CDS does not need any third party tool, but just a translation table (or using Biopython, which could be considered as using another third party package? though of broad use in python bioinformatics). I respect the author's decision though.

Regarding reproducibility, I still disagree with the notion of "reproducible" used in the paper (e.g.: "a versalite and reproducible tool" (lines 204-205). How can a tool be reproducible?). Actually, the paper from Mangul is about "requirements to promote installability and long-term archival stability of software tools" which is not a feature of Mantis, but related to the requirements that have been asked by the editor (as you mention in the Discussion, with better criteria than in the introduction, in my opinion). Also, Snakemake, (Galaxy, other tools) may facilitate reproducibility, as a conda environment or even git versioning could do, for example. I am not saying reproducibility is not an important topic. I just don't like the use of the term in the manuscript. But I of course respect the author's point of view.

Regarding over-annotation, I have yet doubts about whether you are really showing in the paper that this is true, or to what extent. In your benchmarks, you are measuring whether there is intersection of a single ID with the references, not whether the absent IDs are TNs or FNs due to the consensus you carry out. As such, how can you be sure that "over-annotation is minimized" and that "these three annotations are more likely to be valid" (line 186), and not just that these 3 are valid but the other 2 are also good (e.g. ID3 and ID4 in Figure 7). Avoiding over-annotation would be true only when you are correctly dealing with FPs being classified as TNs? I agree about "This is clearly a problem" and that consensus generation could be of help, but not sure to what extent Mantis does "addresses" this "very relevant" issue (lines 176-177) . Maybe some results regarding Precision with and without consensus, something similar, could help. Could you provide some evidence about this among the results or in the discussion?

Also, after better understanding the use of free-text descriptions ("Similar functional descriptions, unless exactly the same, are kept regardless of their text similarity (e.g. "glucose degradation" and "degrades glucose"), in your answers", I disagree that "Redudancy is eliminated by removing duplicate database IDs and/or extremely similar descriptions" (line 187-189). I think it would be more accurate to say something like "Redundancy, which is a drawback inherent to consensus-driven annotation, is ameliorated by removing duplicate database IDs and/or identical descriptions."

Regarding the e-value, I would really appreciate the authors to make clear the adjustment depending on sample size, since this has an impact on users results and reproducibility. As you describe it in your answers, is the e-value going to depend on the dinamically generated chunks of the input sequences? (in this sense, does the e-value represent the odds to find false positives in the HMM references, which are the actual databases where the query is searched?). Also, the same query used as input along different queries (for example, a file with 10 sequences and a file with 1000 sequences, using different CPUs, or different distribution in a cluster) would lead to different e-values? Maybe this could be clarified in the "Multiple hits per protein" section, unless I am mistaken, in which case I would thank very much further clarification by the authors.

Minor things:

Abstract: What does "high-result ion" mean? Should it be "high-resolution"?

Declaration of competing interests Please complete a declaration of competing interests, considering the following questions: Have you in the past five years received reimbursements, fees, funding, or salary from an organisation that may in any way gain or lose financially from the publication of this manuscript, either now or in the future? Do you hold any stocks or shares in an organisation that may in any way gain or lose financially from the publication of this manuscript, either now or in the future? Do you hold or are you currently applying for any patents relating to the content of the manuscript? Have you received reimbursements, fees, funding, or salary from an organization that holds or has applied for patents relating to the content of the manuscript? Do you have any other financial competing interests? Do you have any non-financial competing interests in relation to this paper? If you can answer no to all of the above, write 'I declare that I have no competing interests' below. If your reply is yes to any, please give details below.

I declare that I have no competing interests.

I agree to the open peer review policy of the journal. I understand that my name will be included on my report to the authors and, if the manuscript is accepted for publication, my named report including any attachments I upload will be posted on the website along with the authors' responses. I agree for my report to be made available under an Open Access Creative Commons CC-BY license (http://creativecommons.org/licenses/by/4.0/). I understand that any comments which I do not wish to be included in my named report can be included as confidential comments to the editors, which will not be published. I agree to the open peer review policy of the journal.

Authors' response to reviews: We would like to thank the editor and reviewers for the new round of revision, we are overjoyed that our revised manuscript successfully addressed many of the concerns previously raised. We have organized this document in the same fashion, where each comment is marked as “RXCY”, X being the reviewer number and Y the comment number. R0 corresponds to the editor, R1 to the reviewer Karen Ross and R2 to the reviewer Carlos Cantalapiedra. In response to the reviewer comments, we have updated some figures and slightly modified the manuscript. All modifications are described in detail in this document.

R0 R0C1: the e-value: its precise meaning and how users handle this with their data Response: We thank the editor for this important comment. We have addressed this in the reviewer comments R1C2 and R2C6 and modified the manuscript accordingly.

R0C2: impact of the tool for redundancy removal and over-annotation avoidance. Response: We have edited the manuscript to address the comment R2C4.

R0C3: more details on the benchmarks Response: We have addressed the concerns that were raised in the comments R2C1 and R2C4.

R1

R1C,3,4,9: Abstract: meaning of "high-result ion" data is unclear Analysis: "(iii) how each reference data source's contribution to the final output"--change to "how each reference data source contributed to the final output" Analysis: "(iv) impact of the consensus generation on annotations quality."-- change "annotations" to "annotation" In the "Using multiple reference datasources" section, should the 1 in "e.g., if a hit comes from Pfam, that has a weight of 1, and another from eggNOG, that has a weight of 0.8, HMMW would equal to 0.9+0.8 = 0.85)" be "0.9"? Response: Thank you for these suggestions, we have amended the manuscript accordingly. Specifically, regarding the HMM weight, the reviewer is correct, since in our example the Pfam has a weight of 1 and eggNOG of 0.8 then the average should be 0.9 (i.e., changed to “(1+0.8)/2 = 0.9”).

R1C2: Introduction: "depending on how strict the confidence threshold is, it may also increase the false-positives (over-annotation, due to a high confidence threshold) or false- negatives (under-annotation, due to a low confidence threshold)"--Is this backwards? Is seems to be that a high confidence threshold would accept only the most confident predictions and therefore lead to false negatives and vice versa. Response: We thank the reviewer for this question. Since the confidence threshold used throughout the manuscript corresponds to an e-value threshold, a higher threshold (e.g. 1e-3) would indeed produce more false positives, whereas a lower threshold (e.g. 1e-6) would produce more false-negatives. This is of course due to the inverse nature of the e-value, i.e., the closer to 0 the better. However, we agree that this sentence (lines 72-79) might be somewhat confusing, and have therefore have changed it to: The selection of reference HMMs is also critical, as PFA will ultimately be based on the available reference data. Whilst using unspecific HMMs to annotate a taxonomically classified sample may result in a fair amount of true-positives (correct annotations), depending on the confidence threshold used, it may also increase the rate of false-positives (over-annotation, due to a less strict confidence threshold) or false-negatives (under-annotation, due to a more strict confidence threshold).

R1C5: 5. Figure 3: The difference between heuristic and DFS is very small and only visible with the 2020 sample, so the statement in the legend "The DFS algorithm outperforms the other algorithms." seems like a bit of an overstatement. Response: We agree with the reviewer, therefore we have changed the figure legend to: “Overall, the DFS and heuristic algorithms achieve similar results, outperforming the BPO algorithm.”

R1C6: 6. Figure 4: Sometimes the dot for one of the three methods is not visible (e.g., DFS dot in Cryptococcus without taxonomy). I assume this is because it is hidden by one of the other dots; maybe you could offset one of the dots above or below the other so that it is visible? Response: Thank you for the suggestion, we added a slight offset to the dark grey and yellow circles.

R1C7: 7. Figure 5: Should the scale bar for the F1 score be spanning the radius of the circle? Currently, it does not. Response: We have rechecked the image, and, unless we misunderstand what the reviewer means, the F1 score scale seems to span the radius. Please see the image below where we highlight that particular section of the image (we drew a square where the upper left vertice is fixed on the circle’s center and the bottom right vertice is fixed on the F1 score scale number 1).

R1C8: 8. Computational efficiency: Why were pseudo-random sequences used for the efficiency test? Response: Since the reference datasets likely contain UniProt sequences (which were used for the computational efficiency test) we used pseudo-random sequences in order to add a degree of randomness to the homology search. However, we do agree this might not be necessary, therefore we have now removed it from the manuscript (edited lines 437-445).

R2

R2C1: I think that the authors would improve the manuscript by reorganizing the Methods subsections related to sample selection and benchmarking, but just to facilitate readers interested in understanding better these specific aspects of the manuscript. Response: We thank the reviewer for the suggestion, we have moved the section “Sample selection” to before the section “Establishing a test environment” (previously it was after).

R2C2: I just regret that you didn't agree with allowing using CDS as input, since it would be very useful for end users. Note that translating a CDS does not need any third party tool, but just a translation table (or using Biopython, which could be considered as using another third party package? though of broad use in python bioinformatics). I respect the author's decision though. Response: We thank the reviewer for this suggestion. In the previous revision we misunderstood the reviewer’s suggestion. We are currently implementing other enhancements, but we have added this on to the to-do list for a next version.

R2C3: Regarding reproducibility, I still disagree with the notion of "reproducible" used in the paper (e.g.: "a versalite and reproducible tool" (lines 204-205). How can a tool be reproducible?). Actually, the paper from Mangul is about "requirements to promote installability and long-term archival stability of software tools" which is not a feature of Mantis, but related to the requirements that have been asked by the editor (as you mention in the Discussion, with better criteria than in the introduction, in my opinion). Also, Snakemake, (Galaxy, other tools) may facilitate reproducibility, as a conda environment or even git versioning could do, for example. I am not saying reproducibility is not an important topic. I just don't like the use of the term in the manuscript. But I of course respect the author's point of view. Response: We agree with the reviewer that there is an ongoing discussion whether a tool can be reproducible and we respect the reviewer’s point of view, but will stay with the current wording.

R2C4: Regarding over-annotation, I have yet doubts about whether you are really showing in the paper that this is true, or to what extent. In your benchmarks, you are measuring whether there is intersection of a single ID with the references, not whether the absent IDs are TNs or FNs due to the consensus you carry out. As such, how can you be sure that "over-annotation is minimized" and that "these three annotations are more likely to be valid" (line 186), and not just that these 3 are valid but the other 2 are also good (e.g. ID3 and ID4 in Figure 7). Avoiding over-annotation would be true only when you are correctly dealing with FPs being classified as TNs? I agree about "This is clearly a problem" and that consensus generation could be of help, but not sure to what extent Mantis does "addresses" this "very relevant" issue (lines 176-177) . Maybe some results regarding Precision with and without consensus, something similar, could help. Could you provide some evidence about this among the results or in the discussion? Response: We thank the reviewer for this remark. We do provide results with and without consensus in the section “Impact of consensus generation” (second paragraph) and later provide a brief discussion on the benefit of using a consensus generation method; however, this does not directly address the reviewer’s concern. Our approach to dealing with over-annotation is similar to a voting system, where the majority is more likely to be correct. As we have previously acknowledged, a benchmark for this cannot be easily implemented (for example due to the heterogeneity of the reference data and reference annotations), therefore we have instead chosen to rephrase the sentence that mentions over-annotation: We have attempted to avoid over-annotation through the generation of a consensus-driven annotation, which identifies and merges annotations that are consistent (i.e., similar function) with each other (e.g., if three out of five independent sources point towards the same function and two others point towards other, unrelated functions, then these three annotations are more likely to be valid), and eliminating the remaining inconsistent annotations.

We have also moved the paragraph explaining how we address under-annotation, over-annotation, and redundancy to the discussion (now in lines 464-479).

R2C5: Also, after better understanding the use of free-text descriptions ("Similar functional descriptions, unless exactly the same, are kept regardless of their text similarity (e.g. "glucose degradation" and "degrades glucose"), in your answers", I disagree that "Redudancy is eliminated by removing duplicate database IDs and/or extremely similar descriptions" (line 187-189). I think it would be more accurate to say something like "Redundancy, which is a drawback inherent to consensus-driven annotation, is ameliorated by removing duplicate database IDs and/or identical descriptions." Response: Thank you for the suggestion, we agree with the suggestion, therefore the sentence was replaced (lines 469-471).

R2C6: Regarding the e-value, I would really appreciate the authors to make clear the adjustment depending on sample size, since this has an impact on users results and reproducibility. As you describe it in your answers, is the e-value going to depend on the dinamically generated chunks of the input sequences? (in this sense, does the e-value represent the odds to find false positives in the HMM references, which are the actual databases where the query is searched?). Also, the same query used as input along different queries (for example, a file with 10 sequences and a file with 1000 sequences, using different CPUs, or different distribution in a cluster) would lead to different e-values? Maybe this could be clarified in the "Multiple hits per protein" section, unless I am mistaken, in which case I would thank very much further clarification by the authors. Response: Yes, the raw e-value will depend on the dynamically generated chunk size of the input sequences, however, during processing, it is scaled to the original sequence size. As suggested by the reviewer, we added this sentence to the section “Accessibility and Scaling” (since it is here that we refer to sample splitting): To note that Mantis uses HMMER's hmmsearch for homology search, which outputs an e-value scaled to the sample/chunk size. Since Mantis splits the samples into chunks, during hit processing, the e-value is scaled to the original sample size.

R2C7: Abstract: What does "high-result ion" mean? Should it be "high-resolution"? Response: Thank you for this note, we have corrected it.

Source

    © 2021 the Reviewer (CC BY 4.0).

References

    Pedro, Q., Francesco, D., Oskar, H., Patrick, M., Paul, W. Mantis: flexible and consensus-driven genome annotation. GigaScience.