Content of review 1, reviewed on April 15, 2016

The article describes the sequencing, assembly and annotation of the olive tree (Olea europaea). The assembly and annotation procedures are well described and the outputs seem to be of a satisfying quality, based on the metrics provided (at the moment, I have not had access to the final assembly and annotation : they will need to be public before final acceptance).

Other points will need to be addressed:

Sequencing:
The authors report the use of MiSeq 2x500 and 1x600 (line 68 + table 1) : since those modes are not typically supported on the MiSeq platform, can the authors explain why they chose such a strategy and how they proceeded?

Line 79: Can the authors provide some details about the fosmid library (155,000 clones): fragment size, size distribution? What was the probability of two fosmids to overlap in a pool of 1600 fosmids?

Genome assembly:

Before going into the details of the procedure, it would be useful to explain the specific problem(s) of this genome (heterozygosity, as shown on the k-mer spectrum?...) and how the strategy chosen to sequence and assemble (fosmids, ASM…) is addressing them.

Line 116: the WGS assembly reaches a size of 1.94 Gb. Since the expected size is 1.38 Gb, can the authors comment on the 0.56 Gb difference? Does it correspond to heterozygous regions that were assembled separately?

line 132 : is the average scaffold N50 (33,786) obtained for the 96 pools of 1600 fosmids comparable to the size of the fosmids? Or in other terms, do you obtain ~1scaffold per fosmid or are the fosmids fragmented into several scaffolds?

line 134 : Since the "in-house" software (ASM) is a key step in the assembly process, it would be of interest to have it available online (if not published by the time the data note comes out).

lines 136-137 : The two rounds of ASM use respectively overlaps of 2400 bp (e=0.015) and 4000 bp (e=0.10). Since some transposable elements correspond to these size ranges, can you describe in more details how repeated regions were handled?

The sentence "The input scaffold information was used in both steps for solving repeats and scaffolding" is not detailed enough: how was it used for solving repeats?

lines 152-153: "The excess of assembled sequence is likely due to the presence of artificial duplications generated during the assembly process" . By "artificial duplications", do you mean cases where two alleles were assembled separately in diploid regions?

Annotation:

The annotation procedure contains many steps and is not very straightforward to understand: it would be of great help to have a figure displaying a flow chart of the strategy as was provided for the assembly process.

line 244: "our training step" : can you describe this step ? Is it the same as for geneid trainer? If not it needs to be detailed.

lines 245-250: Four different ab initio gene predictors were used, which seems a lot. Why was there a need for so many? What is the interest of each predictor? Can you describe their strengths and weaknesses?

- line 258 : can you provide the relative weights that were used for the three different sources of evidence (transcripts, ab initio and proteins ?). Especially, what weight was given to ab initio predictors compared to transcripts and protein matches? Also, were all the ab initio predictors given the same weight?

lines 271-278: "For this, we performed a BLASTP search […] with at least 50% of identity (Table 3) : Table 3 does not contain information relative to the BLASTP search, it should be cited later in the text. However, a table containing the number of proteins from the olive tree that were found in the other genome, and the corresponding number of proteins in each other genome (ratio 1:1 or different?) would be of interest, in order to identify genes split/fused in the annotation and to have a first idea of the presence of WGDs in the olive tree genome. I am obviously looking forward to the analyses!

Table 3 : Table 3 is erroneous. The line corresponding to olive tree seems correct, except that I suspect the average coding sequence length is 1031 nt, not aa. But the lines corresponding to the other species are wrong: the average coding sequence length is expected around 400 aa, the average exon size around 250-350 nt, and the transcript span around 2kb. As a matter of fact, does the column "average transcript length" correspond to the transcript span -including introns-, or just the sum of exons? The header should be corrected for clarity.

The comments about table 3 (lines 276-278) will need to be rewritten once the metrics are calculated correctly. I suspect that the olive tree genes will still appear shorter than those of other species (including the CDS part: ~40-50 aa is a significant difference) and this finding will need to be commented (splits? truncations? real feature of Olea europaea?...).

ncRNA annotation: since the RNA Seq data were used to annotate ncRNAs, I suppose the RNA extracted and sequenced was total RNA, not polyA. It should be stated in the sequencing section (lines 89-94).

Level of interest

Please indicate how interesting you found the manuscript:
An article of importance in its field

Quality of written English

Please indicate the quality of language in the manuscript:
Acceptable

Declaration of competing interests

Please complete a declaration of competing interests, considering the following questions:
1. Have you in the past five years received reimbursements, fees, funding, or salary from
an organisation that may in any way gain or lose financially from the publication of this
manuscript, either now or in the future?

2. Do you hold any stocks or shares in an organisation that may in any way gain or lose
financially from the publication of this manuscript, either now or in the future?

3. Do you hold or are you currently applying for any patents relating to the content of the
manuscript?

4. Have you received reimbursements, fees, funding, or salary from an organization that
holds or has applied for patents relating to the content of the manuscript?

5. Do you have any other financial competing interests?

6. Do you have any non-financial competing interests in relation to this paper?

If you can answer no to all of the above, write 'I declare that I have no competing interests'
below. If your reply is yes to any, please give details below.

I declare that I have no competing interests.

I agree to the open peer review policy of the journal. I understand that my name will be included
on my report to the authors and, if the manuscript is accepted for publication, my named report
including any attachments I upload will be posted on the website along with the authors'
responses. I agree for my report to be made available under an Open Access Creative Commons CC-BY license (http://creativecommons.org/licenses/by/4.0/). I understand that any comments which I do not wish to be included in my named report can be included as confidential comments
to the editors, which will not be published.

I agree to the open peer review policy of the journal.

Authors' response to reviews: (http://www.gigasciencejournal.com/imedia/8611059132015732_comment.pdf)

 


Source

    © 2016 the Reviewer (CC BY 4.0 - source).

Content of review 2, reviewed on May 27, 2016

I am happy with the answers (and modifications in the article) addressing my previous comments.
However, I still have a few questions about new table 5 and the preliminary analysis of gene duplications in Olea europea.

First, since the number of predicted genes is high and the exon number per gene is a bit lower than in the other genomes, I would like to know how many monoexonic genes were annotated? Also, how many short genes (less than 30 aa for instance) ? Was a filter applied to remove small annotations with few evidence? I have browsed the genome at http://denovo.cnag.cat/genomes/olive/browser_oe6 and noticed quite a few -but it is hard to quantify based on random browsing !- : see for example location Oe6_s00018:706201..765350 ).
What would table 5 look like if a filter on monoexonic genes and/or short genes was performed for all genomes analyzed?

The metrics displayed in table 5 for O. europea were calculated on the 56,349 protein coding loci. I assume that for each locus, a transcript was chosen as representative ? Which transcript was selected ? The one with the longest ORF? This would need to be detailed.

The term "homologs" (column 7) is very vague: according to the text, a blastP search was performed with an e-value less than 0.01 and with at least 50% of identity. How many hits were kept for each protein? Only the best ? (or maybe, best reciprocal hits, which would be more informative ?)
If all the hits above the threshold were taken into account, the number of "homologs" in O. europaea seems very low. If a WGD has occurred after divergence with E. guttata, one would expect to get two "homologs" (with a similar %id) for one E. guttata gene. The ratio of proteins in one genome vs the other (1:1 or 1:2) would be very informative to infer a whole genome duplication. The paralogous genes analysis is interesting, but do the two genes in a paralogous genes pair in O.europea share the same "best hit" in E. guttata?

Also, the paragraph needs to be re-read carefully since it contains a lot of typos (the ones I noticed are listed below):
Line 320, remove the "," after comparisons
Line 325 : suing -> using
Line 332 : remove the "," after "Although"
Line 334: leding -> leading

Level of interest

Please indicate how interesting you found the manuscript:
An article of importance in its field

Quality of written English

Please indicate the quality of language in the manuscript:
Acceptable

Declaration of competing interests

Please complete a declaration of competing interests, considering the following questions:
1. Have you in the past five years received reimbursements, fees, funding, or salary from
an organisation that may in any way gain or lose financially from the publication of this
manuscript, either now or in the future?

2. Do you hold any stocks or shares in an organisation that may in any way gain or lose
financially from the publication of this manuscript, either now or in the future?

3. Do you hold or are you currently applying for any patents relating to the content of the
manuscript?

4. Have you received reimbursements, fees, funding, or salary from an organization that
holds or has applied for patents relating to the content of the manuscript?

5. Do you have any other financial competing interests?

6. Do you have any non-financial competing interests in relation to this paper?

If you can answer no to all of the above, write 'I declare that I have no competing interests'
below. If your reply is yes to any, please give details below.

I declare that I have no competing interests.

I agree to the open peer review policy of the journal. I understand that my name will be included
on my report to the authors and, if the manuscript is accepted for publication, my named report
including any attachments I upload will be posted on the website along with the authors'
responses. I agree for my report to be made available under an Open Access Creative Commons CC-BY license (http://creativecommons.org/licenses/by/4.0/). I understand that any comments which I do not wish to be included in my named report can be included as confidential comments
to the editors, which will not be published.

I agree to the open peer review policy of the journal.

Authors' response to reviews: (http://www.gigasciencejournal.com/imedia/6086704982015732_comment.pdf)


Source

    © 2016 the Reviewer (CC BY 4.0 - source).

References

    Fernando, C., Irene, J., Jessica, G., Damian, L., Marina, M., Emilio, C., Beatriz, G., Leonor, F., Paolo, R., Sophia, D., Marta, G., Manuel, S., Jose, L. G., G., G. I., Pablo, V., S., A. T., Toni, G. 2016. Genome sequence of the olive tree, Olea europaea. GigaScience.