Review of Speeding up all-against-all protein comparisons while maintaining sensitivity by considering subsequence-level homology

Content of review 1, reviewed on July 10, 2014

Basic reporting

The manuscript is well written. The problem to be addressed is important in this field and the basic idea is interesting.

I have only one minor comment:
In figure legend, each figure should be titled (with one sentence description).

Experimental design

I have two comments regarding the exact range of the research question considered here:

The purpose of this work is somewhat unclear. I think one of the fundamental problems in all-against-all comparison is its quadratic time scale. However, although the authors repeatedly mentioned this point, and their method have some potential to address it, I think the problem of quadratic time scale is not clearly solved in this paper. Therefore if this is a fundamental question of this paper, I must say that the experimental design is not adequate.
It seems that their method is suitable for rigorous Smith-Waterman algorithm but may not fit well to the BLAST like method where database indexing is needed before search because in their method the target database is repeatedly changed, which requests database re-indexing and reduces the efficiency. Given that BLAST is one of the most commonly used tools for this kind of analysis, the authors should mention to the applicability of their method to BLAST, and if not applicable, should compare their method based on Smith-Waterman algorithm with a simple all-against-all BLAST search.

One additional minor comment:
The definition of "Reduction in time" in Fig. 3 and 6 is somewhat unclear to me. I think that "80% reduction" is equivalent to "relative computational time is 0.2" but the latter is more intuitive to me since it is easily converted into more intuitive statement "5 times faster than the original".

Validity of the findings

I have two significant comments.

Difference of the fractions of missing homologs among methods (Fig4 and 7) is clearly demonstrated, but I think its effect is not clear. Since all-against-all search is used for some sort of clustering method (such as OMA), the method should also be evaluated whether the clustering result based on the modified all-against-all comparison is close enough to that based on the original all-against-all comparison. The result should be different among the clustering method that is intended to be applied (depending on the granularity of clustering and treatment of multi-domain proteins etc.), so the authors should discuss this point.
As a multiple-representative strategy, the authors considered only 3-representative strategy, which was excluded from their best strategy because of its inefficient runtime (page 9), but just discarding this strategy leaves the first concern mentioned at the beginning of section 2.2 (page 4) unsolved. In fact I cannot understand why they consider only a fixed number of representatives. Since it is natural that a larger cluster has more representatives, why not consider a strategy of adding a new representative sequence only when it is not similar to any existing representative with sufficiently high score.

One additional minor comment:
In Fig 5, I think displaying the fraction of missing pairs instead of the number of missing pairs in each score range on y-axis can clarify the result.

Pre-publication Review of

Speeding up all-against-all protein comparisons while maintaining sensitivity by considering subsequence-level homology

Reviewed On July 10, 2014

Submitted to

Reviewed by

Actions

Content of review 1, reviewed on July 10, 2014

Basic reporting

Experimental design

Validity of the findings

Source