Content of review 1, reviewed on June 18, 2014

Basic reporting

The method presented in this paper is of great interest for the field of comparative genomics and even bioinformatics in general. Today, the generation of a blast all-against-all, compulsory step for all large-scale orthology inference algorithms, is a real challenge because new genomes are daily released. The introduction describes this challenge in a clear way.

Consequently, I was very excited by the methodological innovations proposed by the authors and the important computational improvements that their results are demonstrating. Generally, I think this is a good manuscript but before publication, I think that some major points should be more extensively detailed by the authors. I explain my concerns below.

Experimental design

My major issue is the lack of details in the methods. I think that currently I would not be able to fully understand/reproduce the main steps of the new methodology, in particular for the following points:

  1. Initial cluster building: How are chosen the initial representative sequences? Is it a random choice? Are they sequence represented by a minimum of dissimilarity? The same number of sequences per organism ? no details are provided by the authors.

  2. Cluster extension: How is exactly decided the cluster extension, with which metrics?
    When and how the end of the extension of a cluster is decided?

Such information should be introduced in the methods, making the cluster formation approach a bit less obscure. Compared to the classical blast all-against-all, the main development of the new method seems to be this preliminary clustering step. Consequently, I was disappointed to not really understand this step in details. Similarly, when the new methodology is compared to kClust and UCLUST , the authors describe in details the sequence similarity parameters used with this programs, but not for their own method.

Validity of the findings

It would be interesting to describe the content of the preliminary clusters. Does some clusters aggregate only very similar sequences while some other contain more divergent proteins?

Concerning the false negative homologous pairs, some examples could be discussed to better highlight their origin. Does some of these low score matches are due to the nature of the sequence itself? (short repetitive domains, low complexity regions) or the fact that they are at long evolutionary distance (basically, is there more false positive between distant bacteria/fungi than between than between large divergent protein families in closely related organisms). This is a complicated subject and new results don’t necessarily need to be produced for this point. But debating the nature of the false negative in the Discussion would be an interesting addition.

To complete the bacterial and fungal datasets, a 3rd dataset showing the behaviour of the algorithm with hundreds of very distant organisms could be another addition.
The chosen datasets, fungal and bacterial proteomes, are nice examples but no homology is generally found between huge parts of their proteomes. 30% of their ORFs (and consequently proteins) are generally highly specific elements, not shared between genera (between saccharomyces strains already hundreds of ORFs are specific, see Stacia R. Engel* and J. Michael Cherry, 2014 ; up to 30% between candida and saccharomyces, Edward L. Braun et al., 2000 Genome Res. ). Consequently, if I understand it well, the sentence “This implies that half of homologous pairs have less than 30% sequence identity.” is obvious. What about studying an organism set which includes both such kind of proteome and the more conserved proteomes of animals (where few organism-specific sequences are observed)?

For instance, a mix of 100 fungi/plants/protest/animal proteomes (with balanced phyletic composition) ? Indeed, the real problem today is the generation of a blast all-against-all for hundreds of proteomes, not 12.
The paper would be more strong if similar speed improvements are shown for such a dataset. I understand that generating such dataset is a long process and could require weeks of calculations. Consequently, I ask the editor to consider this last point as a strong suggestion but not a required revision.

Comments for the author

I think that the results demonstrated in the manuscript are very promising. But, as the paper describes a new method based on a preliminary sequence clustering, then followed by a all-against-all comparison in each cluster, I really recommend a better description of this initial step, which seems to be the key to the large speed improvement.

Source

    © 2014 the Reviewer (CC-BY 4.0 - source).