Content of review 1, reviewed on June 24, 2016

**Reviewers summary and feedback**
We will review this data note in accordance with the Open Science review principles
http://f1000research.com/articles/3-271/v2

Principle 1: We will sign our names to our review

Principle 2: We will review with integrity

Principle 3: We will treat the review as a discourse with you; in particular, we will provide
constructive criticism

Principle 4: We will be an ambassador for the practice of open science

We have reviewed this data note with a focus on utility, accessibility and ease of reuse of the
published data.

The paper clearly outlines in the introduction what type of data is to be expected in the paper,
namely new data from the Personal Genomes Project with openly accessible haplotype and
phenotype data along with 100x coverage whole genome sequence produced using Complete
Genomics technology. It is, however, not clear from the content of the article:

How many individual samples are included in the dataset made available with this paper
(mention of two numbers: 182 and 114).

Where and how to access specifically the haplotype and phenotype data within the datasets
provided

We think that the content of this data note publication has a lot of potential for use and broad
application in genomics, and we recommend that the manuscript be accepted with major
revisions to the data description and data organisation to allow for clearer data interpretation and
immediate utility.

Recommendation: accept with revisions.

**Suggested revisions**

Give case examples for how to extract and interpret phasing data and phenotype data for an
individual from this data package. Although the haplink column is documented in the paper and
in the extensive documentation on format given in the suppl pdf (Standard sequencing service
data file formats v. 2.5), the special focus of this paper on the availability of haplotypes and
phenotype data asks for clear examples of how to access and interpret these from the data given.

We compared the file size of var-GS000037812-ASM.tsv from the PGP website with the
corresponding tsv file from the GigaDB FTP site (var-GS000037812-
ASM.tsv_with_wellcount_exc.txt) and found that they are different file sizes (1.4G vs 2.3G) and
contain different information (extra columns: intervalId intervalAllele readCount wellCount
wellIDs exclusiveWellCount SharedWellCount MinExclusiveWellCountInThisLocus
MaxExclusiveWellCountInThisLocus). It is not described in the paper explicitly that the files
from the PGP site provide only a cut-down version of the primary results file from the assembly
for each individual. To avoid confusion, this should be stated explicitly when describing what
data is available from the PGP website.

**Required revisions**

List of IDs of PGP participants (e.g. hu786B4C) whose samples are included in this data package
including cross-references between PGP IDs (e.g. hu786B4C) and GigaDB file name IDs (e.g.
var-GS000037983-ASM.tsv.bz2) - for example as a supplementary table

The phenotype data that can be found by browsing the Personal Genomes website does not
contain a collection file of phenotype data corresponding to the data referenced in this paper
available via a unique DOI. We recommend that the specific phenotype data for this collection is
made available in GigaDB to be easily accessible along with the genomic data of this data
collection.

The link in the paper that points to dbGaP refers to a non-existent study phs000905.v1.p1
(dbGaP reports `The requested study does not exist in the dbGaP system'). We suggest the paper
is only published when this link is live or that the link is removed from the paper. The abstract
says: "Within this manuscript a link is provided for the most up-to-date list of published genomes
from this project". It is not clear whether the authors are referring to the dbGaP link. This needs
to be clarified.

By the time of review only 30 GS0000xxxxx-ASM.tgz files were available on GigaDB ftp site.
We recommend that the paper not be published until the number of files made available
correspond to the number of files reported in the paper.

We investigated the contents of one of the genome assembly files (GS0000037812-ASM.tgz)
and found that the directory and files did not correspond to the example directory structure given
in figure 6. The directory MEI was missing. Assuming this is the directory structure for all files
provided, the figure 6 should be updated to correspond to the correct content.

The Figures provided with the paper contain quality metrics for some of the samples, however
the figure 2 caption does not indicate clearly how many individuals are represented by the
number of genomic libraries plotted. And it is unclear why 229 genomic libraries are plotted in
figure 2 while 233 libraries are plotted in figure 4, and how these libraries were selected from the
182 total individual samples. These figure captions need to be clarified.

Kind regards,
Fiona G Nielsen and Richard J Shaw

**Suggested minor revisions, typos etc**

Page 3 line 31
bracoded -> barcoded

Page 3 line 48
Both higher a higher amount -> Both a higher amount

Page 5 line 6
at each loci -> at each locus

Page 7 line 6
was calculated bases -> was calculated based

Page 7 line 11
Venn diagram of PGP variants overlap with -> Venn diagram of overlap of PGP variants with

Level of interest
Please indicate how interesting you found the manuscript:

An article of importance in its field.

Quality of written English
Please indicate the quality of language in the manuscript:

Acceptable .

Declaration of competing interests
Please complete a declaration of competing interests, considering the following questions:
1. Have you in the past five years received reimbursements, fees, funding, or salary from an
organisation that may in any way gain or lose financially from the publication of this
manuscript, either now or in the future?
2. Do you hold any stocks or shares in an organisation that may in any way gain or lose
financially from the publication of this manuscript, either now or in the future?
3. Do you hold or are you currently applying for any patents relating to the content of the
manuscript?
4. Have you received reimbursements, fees, funding, or salary from an organization that
holds or has applied for patents relating to the content of the manuscript?
5. Do you have any other financial competing interests?
6. Do you have any non-financial competing interests in relation to this paper?
If you can answer no to all of the above, write 'I declare that I have no competing interests'
below. If your reply is yes to any, please give details below.

Both FGN and RJS have within the past five years worked for Illumina which has a financial
competing interest in genome sequencing technology and Illumina holds various patents related
to genome sequencing.

I agree to the open peer review policy of the journal. I understand that my name will be included
on my report to the authors and, if the manuscript is accepted for publication, my named report
including any attachments I upload will be posted on the website along with the authors'
responses. I agree for my report to be made available under an Open Access Creative Commons
CC-BY license (http://creativecommons.org/licenses/by/4.0/). I understand that any comments
which I do not wish to be included in my named report can be included as confidential comments
to the editors, which will not be published.

I agree to the open peer review policy of the journal.

Authors' response to reviews: (https://static-content.springer.com/openpeerreview/art%3A10.1186%2Fs13742-016-0148-z/13742_2016_148_AuthorComment_V1.pdf)


Source

    © 2016 the Reviewer (CC BY 4.0 - source).

References

    Qing, M., Serban, C., Yu, Z. R., P., B. M., Robert, C., Paolo, C., Nina, B., Staci, N., R., A. M., Tom, C., Abram, C., Ward, V., Wait, Z. A., W., E. P., M., C. G., Radoje, D., A., P. B. 2016. The whole genome sequences and experimentally phased haplotypes of over 100 personal genomes. GigaScience.