Content of review 1, reviewed on February 06, 2015

Major Compulsory Revisions

  1. The main conclusion of this manuscript is that cross-validation is a misleading over-estimate of generalization performance for gene function prediction. This result is not very novel, but it could still be useful to see it reported in a new context; instead my concern is primarily that the authors do not provide any real insight into why cross-validation is misleading. The closest they come is to claim that the effect is due to “the dynamic nature of the label distribution”, but they didn’t measure “dynamics”. As it stands this is a purely observational/descriptive work. The authors have to address this with new analyses.

The problem of inflated cross-validation performance relative to novel predictions was the topic of work from my group (Gillis and Pavlidis PLoS Comp. Bio, 2012). A cross-validation result can be highly influenced by a type of outlier association we called critical edges. We concluded, “cross-validation performance will be a useless measure of the quality of new predictions unless it is first shown that … performance is not due to a single edge”. Mostly likely this is at play here. Kahanda et al.’s contribution would be strengthened by incorporating the direct test for critical edges. I’m not going to demand they do this exact test (I don’t want to seem that self-serving) but the authors have to do something to explain their results.

  1. Similarly, Kahanda et al. further claim that predicting new functions for previously annotated proteins (NA) is harder than predicting functions for completely unannotated proteins (NP). They don’t provide any confidence intervals for the performance measures in the figures (please add this), but if we take it at face value, it is just an observation begging for an explanation. The distinction between NA and NP needs to be interpreted carefully and the authors should discuss this, and ideally directly assess the reason for the differences.

Minor essential revisions

  1. The authors suggest that leveraging known annotations to predict new ones might help (page 6). Many function prediction methods incorporate this idea via building “guilt” by annotation similarity (starting with King et al., Bioinformatics 2003). The authors should acknowledge the literature on this topic; and/or if they have something else in mind they should clarify.

  2. Kahanda et al. use the terminology “GBA” to refer to the simplest possible GBA algorithm, but this is not the terminology we used in our paper cited. In the more common way the term is used, all of the methods used by Kahanda et al. are GBA methods. The neighbor-voting method is called Basic GBA (BGBA) in our work.

Level of interest An article of limited interest Quality of written English Acceptable Statistical review No, the manuscript does not need to be seen by a statistician. Declaration of competing interests I declare that I have no competing interests .

Authors' response to reviews: (http://www.gigasciencejournal.com/imedia/2030968554168896_comment.pdf)

Source

    © 2015 the Reviewer (CC BY 4.0 - source).

Content of review 2, reviewed on May 05, 2015

I thank the authors for their responses and revisions, which are helpful and improve the manuscript. The expanded discussion of the difference between NA and NP is interesting. The explanations for the main result are consistent with previous observations. I still find the explanations to be shallow; it’s a “close look” but not as close as it could (or should) be. If the authors do not want to get deeper into these issues experimentally, at the very least they could discuss them, and (bare minimum) make all their predictions, data sets, and code publicly available so that others can explore these issues themselves.

Specifically, while the authors have described the phenomenon in more detail they still haven’t explained why it happens exactly: why are the ML methods misled? They simply state, “machine learning methods rely on the assumption…” of consistent label distributions, but how does this come about? For example, how is the BGBA method influenced by the label distribution, since there is no training step? Really they seem to mean “evaluation of machine learning methods rely on the assumption…”. But there is a likely explanation in the algorithms themselves, and in our hands, the answer has to do with term prevalence and critical edges.

Thus there are two possible things going on here that affect “actual” performance (generalization) vs. CV performance (these two explanations are not mutually exclusive) that could have been explored. I consider these all "discretionary" in terms of the experiments that would have to be done; the paper is just less impactful without the explicit tests. It would be "compulsory" (in the review guidelines terminology) to at least bring some of these points into the discussion, assuming the authors acknowledge their relevance.

  1. The “label distribution” problem might be a version of the term prevalence issue that we have previously documented, in which always predicting common terms will be a seemingly good (but actually useless) strategy. I didn’t think of this in my initial review but it surfaced once the authors made their findings clearer with the revision (especially Figure S8). If the ML method is biased to predict common terms, then over-prediction of common terms will happen, and this can be quantified. For CAFA1, we saw hints of this effect at play (see PubMed 23630983) but we did not have access to the raw data required to test it directly. Figure S8 is compatible with this explanation, but the question still hangs as to what is actually going on.

We showed (PubMed 21364756) that correlations between term prevalence and node degree in the association (network-like) data explains much, and demonstrated this directly with simulations where networks are created only from node degree information (section discussing “IPN”). The prediction for the current context is that functions predictable from node degree are the ones giving high CV performance and which also have high discrepancies, as well as poor performance outside of CV. A direct demonstration would get at the root of the matter. A discrepancy measure that retains the sign of the difference between p^tr and p^test might be informative as well. It also strikes me now that the authors don’t have a per-GO-term (or per-protein) measure of performance discrepancy for CV vs NA/NP, focusing instead on comparisons of means.

A somewhat strained but relevant example of prevalence bias masquerading as label distribution change is the post hoc removal of the GO term “protein binding” from the evaluation of CAFA1 (and then from CAFA2). The term was highly prevalent in the existing annotations, and so it turned out to be the case for the actual targets in CAFA1 – the label distribution was favorable for that term. CAFA contributors who predicted “protein binding” often did well in the assessment but in the end got knocked down because the label distribution changed: “protein binding” was removed from the assessment (see also PubMed 23630983 for our discussion).

I realize that the authors removed “protein binding” from the experiments presented here, and I say the example is strained because protein binding was not used at all in the assessment (predictions were not counted as false positives). But hypothetically, the organizers of CAFA could have enforced the distribution of terms in the targets match more closely the distribution in the past corpus, but would this have helped anybody in reality? The “naïve predictor” of CAFA would have again been among the top performers (and it was, before removing “protein binding”; it simply predicts terms based on prevalence), and everybody would hopefully agree such a situation would be useless for real-life function prediction. In light of all this, it seems misleading (or irrelevant) to say that the performance drop is due to “evolution of GO curation” (page 5). A stationary distribution of GO annotation would just hide the underlying problem.

As the authors seem to acknowledge at the end, the real problem is not the evaluation, nor the “evolution of GO”, it’s the algorithms (combination of the data and ML method) – they just don’t work well enough when put to work on a real life task (sequence-similarity methods are the only ones that have been shown to generalize well enough for practical use, see PubMed 23630983 for discussion; it might have been informative here to see analysis of each data type separately, as it might explain why performance on NP was relatively okay). The authors propose using past annotations as an explicit feature in making predictions, which is maybe reasonable if one is trying to tidy up GO annotations but probably useless for predicting really new functions (and obviously useless for the NP task). It’s hard to see how it does anything more than make the predictor even more dependent on past practice in the face of an “evolving” GO, making things even worse. We had proposed such a method as a baseline in CAFA2 (Sedeno-Cortes et al., abstract presented at AFP 2014), but we treated it strictly as a “straw man” to help explain apparent performance in other methods that might use such information. Again, see PubMed 23630983 for discussion of “post-dictions” in the context of CAFA1.

  1. The authors feel that critical edges have little role, but I don’t feel they gave it enough consideration. The behavior on average is of little interest, what matters is that critical edges cause misestimation of generalization for any given task. The observation that removing single edges can increase as well as decrease performance is inherent in our paper (Figure 4A of ref 21), showing that a network with better CV performance than the entire network can be created by choosing the right edges (and thereby excluding other edges that decrease performance in the full network). In other words, even when there are critical “negative” edges, we showed that CV performance is enormously misleading. The prediction for the current work is that the GO term predictions that are most affected by critical edges are among those which show poor generalization. Couldn’t this be readily tested from the data the authors have, if they have already figured out the critical edges?

I hope the above makes it clearer how the authors, or someone else, could look into this further. It is possible that the prevalence bias described in #1 above is enough to explain much, but without looking at the impact of critical edges on the GO terms being evaluated, it shouldn’t be dismissed. As we have described for term-centric evaluations, CV evaluation metrics that weigh sensitivity and specificity similarly (e.g., ROC curves) will be most misled by the prevalence bias, while evaluations that focus on specificity (e.g. precision-recall curves) will be most misled by critical edges.

Level of interest An article whose findings are important to those with closely related research interests Quality of written English Acceptable Statistical review No, the manuscript does not need to be seen by a statistician. Declaration of competing interests I declare that I have no competing interests.

Authors' response to reviewers: (http://www.gigasciencejournal.com/imedia/2030968554168896_comment.pdf)

Source

    © 2015 the Reviewer (CC BY 4.0 - source).

References

    Indika, K., S., F. C., Fahad, U., M., V. K., Asa, B. 2015. A close look at protein function prediction evaluation protocols. GigaScience.