Content of review 1, reviewed on June 24, 2014

Basic reporting

Start with the following definitions:
* "closed-reference" OTU picking attempts to maps each read to a sequence in a reference database and clusters reads that map to the same reference sequence

  • de novo" OTU picking runs a similarity-based clustering algorithm on the reads
  • open-reference" OTU picking first performs a closed-reference OTU picking step and runs a de novo OTU picking step on the reads that do not map to reference sequences
    This paper proposes "subsampled open-reference" OTU picking, by which the authors mean the following steps:

  • run closed-reference OTU picking, let U be the unmapped reads

  • run de novo OTU picking on a random subset of U, build a reference database of the resulting cluster centers
  • run closed-reference OTU picking on U using that newly built reference database
  • run de novo OTU picking on any reads not mapped in step 3.

The idea is nice, and the paper is clearly written. The validations, given their scope, appear correctly executed. The authors have released their analysis as an iPython notebook, which is fantastic, and the method is available via the very popular QIIME analysis platform.

However, I have some reservations about the presentation of the work and the validations.

Presentation of the work:

The "subsampled open-reference" OTU picking can be factored into two distinct parts. First, there is the closed reference part of the algorithm, which is the same as classical open-reference OTU picking. Second, there is a de novo clustering step (steps 2-4 above) which first performs clustering on a random subset of the data, attempts to map the rest of the data to those clusters (which can be parallelized), and then performs de novo clustering on the rest.

It's worth doing this split because the novelty of the paper lies in the second step. This step, however, is equivalent to the so-called "Buckshot algorithm" used to initialize k-means clustering. The Buckshot algorithm was first described in a paper from Xerox PARC in 1992 [1]. The parallelization available to Buckshot algorithm in the second step is summarized in [2] with "The third phase of the Buckshot algorithm assigns the remaining documents according to their similarity to the centroids of the initial clusters. This step of the algorithm is trivially parallelized via data partitioning." I hope that the authors will describe their algorithm in terms of this previous work.

I also suggest a more neutral exploration of the idea. For example, I would suggest "classical" open-reference OTU picking rather than the pejorative "legacy", and softening phrases such as "we recommend subsampled open-reference OTU picking as the standard OTU picking protocol in all cases where a reference collection is available." I think that a change of tone is important, especially when considering the level of validation done for the method, described next.

Experimental design

Validation:
As described above, the part of the algorithm that is new to microbial ecology is using a random subset to seed clustering, which we will call "steps 2-4". Because steps 2-4 form in fact a de novo clustering algorithm, I suggest that they be evaluated as such. I would suggest following [3] in their use of normalized mutual information to compare a de novo clustering with results from a closed-reference approach.

Instead, the authors use an aggregate approach to show that results with the subsampled approach are similar to the results of other clustering techniques. The authors show, in fact, that there is a high level of correlation of all methods, even simply throwing away all of the reads that do not map to a reference sequence ("closed-reference" OTU picking). This in itself shows that the aggregate approach to evaluating clusters as implemented here is not sufficient to distinguish between methods of dealing with reads that do not map to the reference database. I suggest that a more appropriate way of exploring performance, if the authors really want to stay within an "aggregate" framework, is to ignore reads that map to the original reference sequences and compare "steps 2-4" as a de novo clustering approach to uclust directly. This would also equalize the data set comparisons, which are currently confounded by varying representation of sequences in existing databases (see, for example, the very high levels of correlation in the moving-pictures data set, which doubtless comes from very good representation of human gut microbiome sequences in existing databases).

In addition, the authors are not consistent with their opinion of what matters in terms of OTU clustering results. For example, they say "We do recommend using the “slow” settings if clustering sequences to build reference OTUs (for example, as is performed when building the Greengenes reference OTU collection (McDonald, Price, et al. 2012)) because suboptimal OTU assignments can have further reaching consequences." This seems in conflict with the authors' insistence elsewhere that correlation of aggregate measures is sufficient to show that an OTU picking algorithm works well.

I must be confused by my reading of line 225 that the size of the random subsample used for cluster seeding does not impact the outcome: "This parameter will not affect results, only runtime." In the limit of taking this parameter to zero, the subsampled algorithm becomes classical open-reference OTU picking. They then continue by saying "optimizing this parameter is not simple" and then incongruously give a simple general recommendation rather than exploring the results by parameter regime.

Validity of the findings

No Comments

Comments for the author

Details:
28: I adore Paperpile and like that you've put in links, but you might want to note that it doesn't work at all on Firefox (and I don't think on any non-Chrome-like browsers).
36-37: the way you write this makes it sound like you are specifically talking about UCLUST's strategy. Perhaps also include other approaches that have been taken to clustering or specify your scope?
230: "However, in these cases, the results are still highly correlated, and the runtime differences are typically low enough that there is no reason to use legacy open-reference OTU picking in favor of subsampled open-reference OTU picking." Here, the authors advocate for a randomized heuristic even when the full algorithm is actually faster. (!)
278: "the same biological conclusions are derived from": no, it's the same summary statistics.

Figure 1 is difficult to follow in that the diamonds refer to a per-query-sequence question, whereas the boxes refer to actions happening on a whole collection of sequences. I'd suggest describing the questions as filters, with groups of sequences getting redirected.

Table 1 shows "De novo" and "Legacy open reference" as using the same command. Is it correct to assume that "Legacy open reference" is not currently available through QIIME?

[1] Cutting, D. R., Pedersen, J. O., Karger, D., and Tukey, J. W. Scatter/gather: A cluster-based approach to browsing large document collections. In Proceedings of 15th Annual ACM-SIGIR (1992), pp. 318–329.
[2] Jensen, Beitzel, Pilotto, Goharian, and Frieder. Parallelizing the buckshot algorithm for efficient document clustering. In CIKM '02, Proceedings of the eleventh international conference on Information and knowledge management (2002).
[3] Cai, Y. & Sun, Y. ESPRIT-Tree: hierarchical clustering analysis of millions of 16S rRNA pyrosequences in quasilinear computational time. Nucleic Acids Res. 39, e95 (2011).

Source

    © 2014 the Reviewer (CC-BY 4.0 - source).

Content of review 2, reviewed on July 28, 2014

Basic reporting

The exposition of this paper is much improved in this revision.

I suggest that it be made more clear in the abstract that the intent is to enable community diversity analysis using this new method. For example, in the first sentence I suggest replacing "microbial community analysis" with "microbial diversity analysis."

Minor comment:

263: putting something between the two instances of "subsampled" would clarify things.

Experimental design

This paper meets the experimental design guidelines.

Validity of the findings

The authors have clearly shown that a certain type of diversity analysis is enabled for very large data sets by this method.

Comments for the author

Figure 1 is improved with the addition of an explanatory legend. I still think it a little strange that the most common outcome will be to have a data set go through both the "yes" and "no" directions for different sequences, but I think the intent is comprehensible.

Source

    © 2014 the Reviewer (CC-BY 4.0 - source).

References

    Ram, R. J., Yan, H., A., N. J., A., W. W., K., U. L., M., G. S., John, C., Daniel, M., Antonio, G., Adam, R., C., C. J., A., G. J., M., H. S., Hong-Wei, Z., Rob, K., Gregory, C. J. 2014. Subsampled open-reference clustering creates consistent, comprehensive OTU definitions and scales to billions of sequences. PeerJ, 2.