Review of The sequence and analysis of a Chinese pig genome

Content of review 1, reviewed on April 10, 2012

Major Compulsory Revisions

A brief explanation of the decision to base BIOM on JSON should be included, detailing the benefits/disadvantages that this brings vs other file formats.
An overview of the format, as specified at http://biom-format.org/documentation/biom_format.html would assist the reader in understanding the text better (e.g. the compression efficiencies of the sparse vs dense formats). At the very least, this format description should be linked from the text.
Analyses, paragraph 1 - re-write for clarity required
Authors state that the discrepancy in file sizes arise from "the matrix positions that must be stored with all counts in the sparse representation". It's not currently possible to understand why this is the case with the information currently provided in the text - see my previous point.
Discussion, paragraphs 2 and 3 Highly repetitive when compared to the background/introduction section. These two paragraphs could also be merged together, with much more specific discussion of why BIOM in particular is a necessary step. For example, there is no mention of the challenges faced by any format if it is to be adopted by the wider research community (and how the authors propose to meet these challenges).

Minor Essential Revisions

Background, paragraph 2
I question the usefulness of Fig 1 and the Medline mining to illustrate the increasing numbers of categories of omics data. I'm not convinced that this is truly representative (isn't it reflecting, instead, scientists' penchant for leaping on a bandwagon regarding names? e.g. I wouldn't call the "kinome" or the "O-GlcNAc-ome" entirely new datatypes - they're a subset of the proteome, surely?). I would be satisfied with the authors simply asserting instead that there are increasing numbers of omic approaches to analysis, illustrated with some examples of newer omics data types. The authors could then remove reference to the MEDLINE mining.
Background, paragraph 3 - requires edit for clarity
A brief description of what a contingency table is would be beneficial, prior to the different omics examples. A potentially more readable sentence would start "Despite the different types of data involved in the various comparative omics techniques (e.g. metabolomics, proteomics or microarray-based transcriptome analyses), they all share an underlying, core data type: the sample by observation contingency table. A contingency table is... [brief explanation followed by omics examples, as in the text].
Background, paragraph 2 - edit for clarity
Suggestion: "A common data format will facilitate the sharing and publication of comparative omics data and associated metadata, as well as improving the interoperability of comparative omics software. It will enable rapid advances in omics fields by allowing researchers to focus on data analysis instead of formatting data for transfer between different software packages or reimplementing existing analysis workflows to support their specific data types."
Background, paragraph 3 - edit for clarity and brevity
Suggestion: "However, many techniques are applicable aross data types, for example Rarefaction analyses (i.e. collector curves). These are frequently applied in microbiome studies to compare how the rate of incorporation of additional sequence observations affects the rate at which new OTUs are observed. This is done to determine whether an environment is approaching the point of being fully sampled (e.g. [14]). Similarly, they can also be applied in comparative genomics [...]" (etc). This whole paragraph could be more concisely written.
Background, paragraph 3 - requires clarification
In the sentence "A standard format [...] will support interoperability of these tools and facilitate development and adoption of future analysis pipelines..." it's not immediately clear what tools are being referred to and how exactly it will facilitate pipeline development.
Background, paragraphs 2 and 3 - re-write required to remove redundancy and improve readability
A few sentences in particular ("A common data format to facilitate sharing and publication of comparative omics data and associated metadata", "The inclusion of high-quality metadata in this format, for example as defined in the MIxS standards [13], is essential for enabling future meta-analyses." and "Additionally, the incorporation of sample and observation metadata allows convenient sharing and archiving of these data within a single file.") are saying related things about metadata - would make sense to try to condense together into a single location in the text.
Background, paragraph 4 - requires clarification
Don't the authors mean "For example, differing representations of samples and observations as either rows or columns, and the mechanism for incorporating sample or observation metadata (if this is possible at all), cause the formats used by different software packages to be incompatible." i.e. the decision itself has nothing to with compatability....
Background, paragraph 5 - requires clarification
In this sentence: "The sparse representation of the QIIME OTU table with 6164 samples and 7082 OTUs (mentioned in the previous paragraph) contains 1% non-zero values in BIOM format and is over 14x smaller than the same data represented in tab-separated text (Supplementary File 1)." is confusing - surely both files contain 1% non-zero values?
Background, paragraph 5 - minor correction for readability
Suggestion: "This includes a format validator, a script to easily convert BIOM files to tab-separated text representations (useful when working with spreadsheet programs), and Python objects to support working with this data."
Box 2 legend - minor edit for clarity
Suggestion: "Comparison of QIIME OTU Table collapsing code with native QIIME OTU table data structures (Panels A-D) and biom-format Table objects with equivalent functionality. [...]"
Analyses section - re-write required for clarity
I would re-order aspects of paragraphs 1 and 2 to make it more readable. For example, describe the initial data set in the first paragraph (size of OTU tables, density range and median, file compression ratios). In the second paragraph, explain the patterns seen (e.g. explain/describe discrepancies in filesize and when each of the formats is most efficient for compression, incurred overheads with dense vs sparse representations, etc.). At the moment it's a bit of a jumble.
Analyses, paragraph 2 - minor edit for readability
Suggestion: "In the data set we analysed, the density ranges from 1.3% non-zero values to 49.8% non-zero values, with a median of 11.1%. The file compression ratio (tab-separated text file size divided by BIOM file size) increases with decreasing contingency table density for this data set (compression ratio = 0.2 × density-0.8; R2 = 0.9; Supplementary Figure 1)."
Discussion, paragraph 1 - minor edit
Suggestion: "[...]versions of Linux), and so they should be [...]"
Availability of software - minor edit
Suggestion: "It is available under GPL, and is free for all to use"

Discretionary Revisions

General: readability - try to keep sentences shorter.
General: repetitive in parts - "Collectively the ome-ome" crops up multiple times, and "useful for interacting with BIOM data in spreadsheet programs" effectively twice.
Background, paragraph 4 - edit for clarity
Defining "density" here, rather than later on in the "Analyses" section, makes more sense. Suggested edit: "Additionally, in many of these applications a majority of the values (frequently greater than 90%) in the contingency table are zero. The fraction of the table that have non-zero values is defined as the "density"; thus, a matrix with a low number of non-zero values is said to have a low density."
Background, paragraph 4 - minor correction
Suggestion: "[...]marker gene survey OTU tables with many samples (such as the one presented in Supplementary Table 1"
Background, paragraph 4 - a semantic query..
In the sentence "[...] meaning that many of the values in the matrix [...] are zero". Is it accurate to refer to these values as "zero" rather than null (i.e. no value/not observed) even if the figure "zero" is used in the file..?
Suppl data, Box 1 - minor correction for readbility
Suggestion: "Information on the data type (e.g., OTU Table, Ortholog Table, Metabolite Table) should be included, based on terms from a controlled vocabulary."
Background, paragraph 5 - minor correction, missing comma
Suggestion: "[...] and metadata in a single, standard file format, BIOM supports [...]"
Background, paragraph 5 - minor correction
Start a new paragraph just prior to the sentence beginning "To support the use of this file format..."
Data description, paragraph 1
It might be worth referring directly to an example Qiime formatted file in the supplementary material (see also "Major Compulsory Revisions").
Analyses, paragraph 2 - minor edit
If density is defined earlier, suggest sentence changed to :"The magnitude of compression [...] is a function of the density of the continency table".
Discussion, paragraph 1 - minor edit for readability
Suggestion: "[...] and to provide an efficient means for representing biological contingency tables in memory with associated convenient functionality for operating on those tables."
Discussion, paragraph 1 - minor edit for clarity
Suggestion: "The core BIOM development group will review these implementations and, if they are fully documented and tested, will add them to the biom-format repository (or grant the developers themselves direct access to the repository)."

Level of interest: An article of importance in its field

Quality of written English: Acceptable

Statistical review: No, the manuscript does not need to be seen by a statistician.

Declaration of competing interests: I have no competing interests

Source

References

Xiaodong, F., Yulian, M., Zhiyong, H., Yong, L., Lijuan, H., Yanfeng, Z., Yue, F., Yuanxin, C., Xuanting, J., Wei, Z., Xiaoqing, S., Zhiqiang, X., Lan, Y., Huan, L., Dingding, F., Likai, M., Lijie, R., Chuxin, L., Juan, W., Kui, L., Guangbiao, W., Shulin, Y., Liangxue, L., Guojie, Z., Yingrui, L., Jun, W., Lars, B., Huanming, Y., Jian, W., Shutang, F., Songgang, L., Yutao, D. 2012. The sequence and analysis of a Chinese pig genome. GigaScience.

Pre-publication Review of

The sequence and analysis of a Chinese pig genome

Reviewed On April 10, 2012

Submitted to

Reviewed by

Actions

Content of review 1, reviewed on April 10, 2012

Source

References