Content of review 1, reviewed on March 21, 2013

The manuscript by Holland and Lynch describes the Sequence Squeeze competition for building the best compression algorithm for nextgen sequence data. I was not a participant in the competition, but I followed it very closely because effective data compression is a critical need in the community today. The paper is generally well written describing the operations of the competition, although I was frustrated with the description of the winning entries. There is only a single paragraph describing the algorithms and really only two sentences to say that it was a semi-referenced-based approach. This would be the most interesting part of the paper, and deserves expansion on how the algorithm operates. There are several interesting blog posts from Bonfield and other competitors that could be cited to describe it in more detail. It would also be very interesting to describe other approaches that were attempted and any general themes to what did or did not work.

The manuscript should also point out some of the tradeoffs of these approaches – most sequencing centers only use standard compression algorithms such as gzip or bzip2 because these algorithms are very robust and well understood. Will the data be silently corrupted with these algorithms if the input data is not formatted as expected? Will Bonfield’s approaches be useful if there are different or multiple organisms sequenced at once? Are there other complications users should be aware of?

Other comments:

The title should be rewritten to reflect this was a competition to compress sequencing data. Calling it a “Cloud-enabled Open Innovation” will be totally lost on potential readers that are not already aware of the Sequence Squeeze competition. Open innovation is meaningless in this context, and being cloud-enabled is not that relevant to the goals. I recommend something like “Sequence Squeeze: An open contest for sequence compression” or similar.

The conclusion that open competition is an effective motivator is not a new concept and should include citations to other examples – I remember similar exercises as an undergraduate some 15 years ago, and more recently the Assemblathon competitions have been a very high profile successes.

Table 1 is interesting but is overloaded with values. It would be more effective to display the winning entries for each of the categories, along with a one sentence description of the algorithm. The table should also include a baseline compression algorithm (such as gzip or bzip2) for context. Interested readers can go the Sequence Squeeze website for all the details, or perhaps could be bundled with the manuscript and archived in the Giga-Science cloud.

Indeed, considering the open nature of the contest, and requiring all of the entries are openly available without restriction, the code for executing the automated evaluation of entries should also be made openly available.

Level of interest: An article of importance in its field

Quality of written English: Acceptable

Declaration of competing interests: I declare that I have no competing interests

Source

    © 2013 the Reviewer (CC-BY 4.0 - source).

Content of review 2, reviewed on April 05, 2013

The revised paper reads really well and is ready to go as is.

Source

    © 2013 the Reviewer (CC-BY 4.0 - source).

References

    G., H. R. C., Nick, L. 2013. Sequence squeeze: an open contest for sequence compression. GigaScience.