Content of review 1, reviewed on January 20, 2014

Major Compulsory Revisions

This article aims to inform users of the GO of ways in which changes to GO terms and annotations of proteins, both planned and "unplanned", might affect the conclusions they reach when they attempt to apply these terms and annotations to their own data. This is important and widely interesting.

In its current form, however, the article is not very helpful. It provides vague guidelines to help an outside user understand why an area of the GO might be a candidate for change, planned or otherwise: "The ontologies need to be refined constantly in order to keep up with the latest biological knowledge and to intersect appropriately with other ontologies. This may involve small-scale changes to update a definition or add parent or child terms, or it may be a more comprehensive project involving experts in the scientific community to assist a larger restructuring of specific parts of the ontologies." [ms. pp 4-5]. True enough, but how did those guidelines lead to a comprehensive restructuring and supplementation of GO terms related to heart and kidney development, and of the processes of apoptosis and cell cycle progression (rather than, say, lung and gonads, energy metabolism and motility)?

Is there any systematic strategy in place to identify areas of the GO in need of development and revision? What are the benchmarks used to ask in an objective way whether the revisions provide a better view of biology, and to test whether the revisions might have unintended off-target effects on other related domains of biology?

In particular, how would projections of GO annotations onto a new organism based on its predicted proteome differ now from the ones that would have been obtained before restructuring? How would an attempt to discover relationships among sets of genes found to be somatically mutated in tumors differ before and after? A user might be intrigued to note that after 54 tries the GOC is still struggling to find the right words to say what apoptosis is [ms. page 5; Figure 1], but how does this information help the user interpret recurrent mutations in caspase genes better?

Perhaps the point is that an attempt to account for all of biology will always be a work in progress, always needing revision as new data drive the development of new conceptual categories. In that case, perhaps what a user needs is some sort of benchmarking, such as model data sets that can be analyzed periodically and after any major change, planned or otherwise, to provide an empirical view of the impact of changes in GO structure and content on its view of a user's favorite gene family. An example of such benchmarking is in fact given in the article [page 8]. It would be helpful to know more concretely how the added kidney and heart terms changed protein annotation, and what its impact in fact is on, say, gene expression data sets generated for visceral tissues from mouse fetuses of various gestational ages. It would also be helpful to know if this approach can be generalized both to cover more of the GO and to monitor effects of GO development on the results of GO-based data analyses in a continuing way.

Level of interest: An article of importance in its field

Quality of written English: Acceptable

Statistical review: No, the manuscript does not need to be seen by a statistician.

Declaration of competing interests: I declare that I have no competing interests.

Source

    © 2014 the Reviewer (CC-BY 4.0 - source).

Content of review 2, reviewed on February 20, 2014

In the end, I want a different paper than the one the authors have written.

As they say at the outset, GO is widely used to interpret data like results of gene expression surveys. What they don’t say add, though it is manifestly true, is that many of these interpretations are uninformative or, sometimes, outright misleading because information from GO has been misapplied. It would thus be valuable to systematically identify features of GO that are hard to apply, describe ways to do it right, and to show the importance of this extra thought and effort with before-and-after examples. A major feature of the paper is discussion of ways in which GO terms and annotation practices themselves are changing (improving), so here as well a systematic discussion of areas of change illustrated concretely with examples to show how the changes in GO change (and improve) the interpretation of experimental data would be of great value.

Instead we have anecdotes that illustrate some of the issues, and even these are often hedged with comments to the effect that while newer GO data are somehow better, analyses based on older data are still valid without standards being proposed to define “better” or “valid”.

At the same time, it’s crucial to emphasize that the problems of developing a high quality, large data set, keeping it consistent and current, and ensuring that it is optimally applied by diverse users are universal. To take a local example, the Reactome pathway database represents transcription events and their modulation according to the availability of specific transcription factors. Even annotations that are five or ten years old are beginning to look a bit quaint, and a close examination of annotations will reveal that experts focused on different gene sets related to different biological processes have emphasized different aspects of their transcriptional regulation and with different degrees of molecular detail. Here too a user who asks, “How does my favorite perturbation affect the network of events that make up reaction space?” will be hard-put to get consistent answers, and traps for the unwary user are everywhere.

That said, I think these issues of consistency, accuracy, and completeness of representation, and the effects of inconsistencies on data analysis need to be addressed head on. The paper by Huntley and colleagues is certainly a step in this direction; I wish it had done more.

Level of interest: An article whose findings are important to those with closely related research interests

Quality of written English: Acceptable

Statistical review: No, the manuscript does not need to be seen by a statistician.

Declaration of competing interests: I declare that I have no competing interests.

Source

    © 2014 the Reviewer (CC-BY 4.0 - source).

References

    P., H. R., Tony, S., J., M. M., Claire, O. 2014. Understanding how and why the Gene Ontology and its annotations evolve: the GO within UniProt. GigaScience.