Content of review 1, reviewed on June 18, 2014

Basic reporting

No comments.

Experimental design

No comments.

Validity of the findings

Have some concern over the clarity of the central message of the paper. The paper describes a modification to an existing algorithm within the QIIME package which is used to process a particularly large dataset. The strong impression given is that performance speed has improved but the evidence presented does not back up that impression. The revised algorithm is clearly a useful addition to the QIIME package and is to be welcomed but its benefits over previous QIIME algorithms are ambiguous at best. Please see my general comments to authors for further details of my argument.

Comments for the author

A modification to an existing algorithm provided by the QIIME package is presented which potentially allows for the the analysis of much larger datasets than previously. This is illustrated by applying the algorithm to 15,000 samples sequenced for the Earth Microbiome Project. Through the analysis of three smaller datasets, the authors demonstrate that results are equivalent to those generated by earlier algorithms also provided within the QIIME package but make no overt claim as to the algorithm being better.

Overall the paper is well-written and should be published as it describes a reasonable strategy for handling large datasets and give useful hints for using the authors' QIIME package. But I have a few reservations which, if addressed, would assist the reader greatly:

Firstly, let us be clear: the new "subsampled open-reference OTU picking" algorithm is an adaptation of the authors' existing "open-reference OTU picking" algorithm. The change is that those reads which do not map to the reference database are now de-novo clustered in a parallel process rather than serially as before. This is achieved by de-novo clustering a subset of the sequences that fail to map to the original reference database (a serial process) then using these to supplement the reference database allowing for a further round of (parallelized) reference matching. This, the authors contend, provides "performance-optimisation" (and hence, by implication, speed improvement) without compromising quality.
The authors demonstrate quality is not compromised by showing that results with their new algorithm are equivalent to those obtained by other strategies within the QIIME package. But they don't demonstrate that the new method is necessarily quicker. Indeed, time measurements reveal an ambiguous picture with the method sometimes preforming better, sometimes performing worse than some of the existing methods.
The authors recognise this: "It is important to note that runtime is not always reduced with subsampled open-reference OTU picking" and point out that run time is dependant on a number of parameters.

The upshot of all this is that the new method could be quicker but it just as easily might not be, depending on circumstances (e.g., more processors, character of the data). The only time that the method is clearly faster is when 1% subsample setting is used with 29 processors, and 'fast' setting (ucrss_fast_029_s1).

The authors show that their new method is able to process 15,000 records (using sufficient cpus). What they don't show is how poorly the alternative methods compare with the data so we cannot use this feat as evidence that the new algorithm is a performance improvement over the old. To be fair, the authors do not explicitly claim this - only that their new algorithm can work with a extremely large dataset. But then probably so could their other algorithms.

Where the paper is most useful is giving hints as to how to get the most out of the QIIME package as a whole. The new algorithm is a welcome addition but it doesn't necessarily perform faster than existing algorithms. The tips however, for example the value of pre-filtering, are far more performance-enhancing.

Overall, there is no evidence that the new algorithm is always faster - merely that it can do the job and might be faster under certain circumstances (contingent on various parameters). Fair enough. But I think the abstract and the text needs to make this clearer rather than leave it to the reader to make this discovery themselves having been led to believe something more substantial from choice of words. For example the conclusions careful wording implies a greater speed reduction from previously than is actually the case. The reduced runtime of the subsampled open-reference OTU picking method relative to the legacy open-reference OTU picking method is what the improvement is all about. But the "vastly decreased runtime" compared with the de novo method is the same for both the new method and the legacy method - this is not a new improvement. In a similar vein, the use of the phrase "performance-optimized" is misleading and should at least be qualified at the earliest opportunity to avoid miss-interpretation. Being more upfront in what is, and is not, claimed of the algorithm would help the reader greatly.

Final points:
I think it should be made clearer that all comparisons are made solely within the QIIME package. This is essentially a paper describing a new feature for QIIME - which is perfectly reasonable and welcome - but the authors ought to be more upfront about it.

I would ask the authors to consider whether the number of tables are justified to make one simple point: namely, that the method gives equivalent results to previous methods. I would suggest that the same point could be made by reducing the summary down to a few sentences without compromising the message (perhaps with one table to illustrate).

Lastly, I’m not sure title adequately reflects the paper and I would recommend changing it to reflect the content better. For example, Stating QIIME in the title would be very useful for those readers for whom this paper will be of genuine and welcome use.

Source

    © 2014 the Reviewer (CC-BY 4.0 - source).

Content of review 2, reviewed on August 06, 2014

Basic reporting

No comments

Experimental design

No comments

Validity of the findings

No comments

Comments for the author

The authors have now included an estimate for the improvement in running time when their new algorithm is used with their large (15,000 sample) dataset compared with their “classic” algorithm.

This greatly enhances the manuscript: it is now clear that a decrease in running time can be substantial when very large data sets are considered. This is an important finding. This is an algorithm for handling microbiome Big Data (I'm using this possibly over-used term here in an attempt to stress the large increase in scale we're talking of here). Indeed, this surely should be the core message of the paper? Namely, with very large datasets - microbiome Big Data projects - the presented algorithm can provide real speed benefits.

For this reason I will recommend acceptance of the paper.

However I still remain doubtful over the clarity with which the authors message is being delivered. The title troubles me: it fails to communicate what seems to be of key interest to any potential reader - namely, that a new algorithm has been implemented that can greatly speed-up OTU clustering at scale without compromising on quality. As it is, the current title rather short-changes the paper, which is a shame.

I would urge the authors to reconsider the title - although I will not insist on this. I no longer think it is necessary to reference QIIME in the title - the authors argument and changes to the text remove that concern - but something along the lines of, say, “New algorithm for faster reporting of OTU definitions in microbiome Big Data projects without loss of efficiency” would, I think, signal the right message to the reader.

The results and discussion have benefited from the authors' restructuring - placing the analysis of the EMP far more prominently - this is very welcome (after all, this is the exciting bit!). This helps to focus the reader on the core message of the paper. But the tables - their repetitious nature in terms of the point being made - do not, I think, help the authors in communicating their argument. Again I will not make this a condition for my acceptance but I would ask the authors to consider whether this information - which I fully accept will be useful to some readers - could be presented less prominently, e.g., as supplementary data, so as not to diminish their message for the majority of readers.

In summary, the changes the authors have made are very welcome and I am happy to recommend acceptance. However, I feel the clarity of the paper would benefit in the ways outlined above.

Source

    © 2014 the Reviewer (CC-BY 4.0 - source).

References

    Ram, R. J., Yan, H., A., N. J., A., W. W., K., U. L., M., G. S., John, C., Daniel, M., Antonio, G., Adam, R., C., C. J., A., G. J., M., H. S., Hong-Wei, Z., Rob, K., Gregory, C. J. 2014. Subsampled open-reference clustering creates consistent, comprehensive OTU definitions and scales to billions of sequences. PeerJ, 2.