Content of review 1, reviewed on August 30, 2016

Andrea Zuccolo reviewed the original version of this manuscript, this is their report from 25 July, 2016:

Rui Guan et al., in this manuscript present the results of the sequencing and characterization of the Ginkgo biloba genome. These data are highly valuable because the genomic characterization of Ginkgo biloba was long overdue and this sequence fills a serious gap in the genomic resources available for comparative studies targeting gymnosperms. Actually, as far as evolutionary analyses are concerned, the importance of this data, goes beyond gymnosperms. As I said the resource provided is valuable and of interest, however in its present form the manuscript suffers from some issues that authors should positively address.

Here are my major criticisms: -all over the manuscript, including supplemental figures and tables, there is a lack of detail in the description of the work technical aspects, mainly sequencing and assembly (see below for a more detailed list). In general the whole methods section is somehow cursory in the descriptions provided. Authors claim the high contiguity of the assembly as described by the N50 stats for both contigs and scaffolds. These values are indeed remarkable and possibly represent the greatest value of this research. However all the common stats generally defined as "genome assembly forensic" are missing. Authors should provide this information in an ad hoc "genome assembly quality assessment" paragraph. The mapping of EST sequences alone on the assembly clearly does not satisfy this requirement since the information provided is limited to the completeness of the assembly nothing saying about its contiguity. -Authors should also compare Ginkgo assembly stats with those characterizing other genome assembly projects targeting gymnosperms and explain why their strategy was successful in providing contiguity figures that are at least two orders of magnitude larger than those characterizing the most curated assemblies (I am referring to the Version 4 of P. glauca assembly described in Warren et al; 2015 The Plant Journal). -Not all of the valuable resources presented and discussed in this manuscript are publicly available. I specify the missing ones in the following detailed comments.

Detailed list of comments and criticisms:

Page 4, lines 14-16. I agree that the data provided could possibly aid a better annotation of other genomes. I am absolutely convinced that they'll be helpful and valuable in evolutionary studies. However I am skeptical about their contribute to the amelioration of other gymnosperm genome assemblies. I say this because the evolutionary distances involved still encompass hundred(s) of million year and so it's difficult to envisage substantial and extended collinearity between Ginkgo and Pinus spp or Picea spp. Page 4 line 30: "analyses of structural variation". Considering the lack of a close sequenced organism for comparison, "structural variation" is definitely a definition misused in this context. Even under the most general and liberal definition, the study of structural variation presented in the manuscript is actually limited at best to the analysis of gene tandem duplications. Because of these reasons this sentence is a sort of an overstatement. Please rewrite it according to my criticisms.

Page 5, line 24: "10.002 Gbp". A precision reaching the third decimal position is definitely not needed here.

Page 6, lines 10 and following. Here authors should compare (and then discuss) the amount of predicted genes in Ginkgo with similar predictions in other gymnosperm sequenced genomes.

Page 6 Line 22: I cannot find an explanation for the acronym HPD. Please add it.

Page 6 line 28: "5316 orthologous genes". In figure 1 the figure is 5116. Please correct the manuscript (or the figure)

Page 6 lines 37-49: in presenting the GO enrichment results, authors should go in a more detailed description than the plain listing of the highest level GO terms. For instance, the enrichment of "Biotic stimulus response" terms is an interesting piece of information, however some detail better illustrating which "Bioitic stimulus" is affected would be more useful. The same goes for MF and CC categories.

Page 6 line 54: "an exceptional proportion". Similar figures have been described in other gymnosperm genomes and in maize too. Because of this I wouldn't call this proportion "exceptional". I think that "remarkable" better depicts the situation here.

Page 6, Lines 58-59 "include two predominant superfamilies". Not surprisingly because these two LTR-RT superfamilies are the only ones found in plants...and so, since there are no other competitors they cannot be described as "predominant".

Page 6, lines 56-60. Please recheck the figures proposed in this paragraph because their sum doesn't seem to be OK. Gypsy: 63.5%, Copia 20.4%. Total should be 83.9%, instead 79.37 % is presented.

Page 7, Line 1: Phylogenetic trees are mentioned. It would be important to make these trees (or at least the alignments used to build them) publicly available. Also the authors mentioned the "domains of reverse transcriptase": they should specify if the complete RT was used or just a tract. In this latter case they should specify which one and, again, make these sequences publicly available.

Page 7, lines 5-9. This comparison is not clear. In particular, how did the authors retrieve the data for the other species? Did they search other assemblies for these domains? Or did they retrieve them from supplemental Materials if available? Please specify and, if the latter case applies, cite the appropriate literature.

Page 7 lines 12-13 "Gene tree": which gene? That's the Phylogenetic tree for Ty3-gypsy elements...

Page 7 lines 12-59. I see some issues and lack of information here. In particular: -how did the authors define the clades they are referring to? I can see a clear tree topology, however I expect some sort of statistical evaluation in support of it. -Did authors perform a bootstrap analysis? -If this is the case, for how many replicates? -What is the bootstrap support for each of these clades? Also I would avoid to point to single clades as "left-most"

Page 7 line 27. Typo: P. aibes should be P. abies

Page 7 lines 35-37 from "...to maize was far more diverse..." on. I admit I have some difficulties in grasping the meaning of this sentence. I suggest to rephrase it.

Page 7 lines 47-48. Regarding the higher conservation of Ty1-copia elements there are papers that can be cited in support of this evidence for plants, in general and for gymnosperm in particular. See for instance:

Wicker, T., Keller, B., 2009. Genome wide comparative analysis of copia retrotransposons in Triticeae, rice and Arabidopsis reveals conserved ancient evolutionary lineages and distinct dynamics of individual copia families. Genome Res. 17, 1072-1081.

Smykal, P., Kalendar, R., Ford, R., Macas, J., Griga, M., 2009. Evolutionary conserved lineage of Angela-family retrotransposons as a genome wide microsatellite repeat dispersal agent. Heredity 103, 157-167

Moisy, C., Schulman, A.H., Kalendar, R., Buchmann, J.P., Pelsy, F., 2014. The Tvv1 retrotransposon family is conserved between plant genomes separated by over 100 million years. Theor. Appl. Genet. 127, 1223-1235.

Zuccolo et al. 2015. The Ty1-copia LTR retroelement family PARTC is highly conserved in conifers over 200 MY of evolution Gene 568 89-99

Page 7 line 45: why clade 1 should be "the most conserved"?

Page 7 lines 50-51: Possibly I missed something here, however why clade 1 should be the most basal considering that this tree is not rooted?

Page 7 lines 53-54: "...remarkably less expansion...". Note that the phylogenetic trees have been built using alignments from the extant population of LTR-RT. Because of this there is no evidence leading to "less expansion": it could well be the opposite i.e. a "small retention"...

Page 7, line 57. Typo: P. aibes should be P. abies.

Page 8, lines 3-8. This is an important piece of information and as such it deserves a better description of the strategy used. For instance: -how many complete elements LTR-RT elements were predicted? -the sequence of these elements should be made publicly available (or at least their coordinates in the repeats gff3 file should be pointed out). -which strategy was used to infer their insertion times? I guess it was that proposed by SanMiguel et al in 1998, so properly cite it. -most importantly: which mutation rate was used to translate the nucleotide distances into time? On a side note, I somehow disagree with the use of the term "burst" to describe an event spanning at least 8 my: you can simply say that most of the amplification occurred between 16-24 mya (please correct this all over the text).

Page 8, lines 12-34: here (or in the discussion) please discuss also "Evolution of gene structure in the conifer Picea glauca: a comparative analysis of the impact of intron size" Stival Sena et al, BMC Plant Biology 2014.

Page 8, lines 24-31 from "The intron regions..." to "percentage of repeats": please rephrase this sentence that, as it is, doesn't convey a clear description of the results.

Page 8, lines 30-33. Is this sentence a general comment of the results, as the reference would suggest or the "preferential accumulation" was proved in this study for ginkgo genes?

Page 8, line 45. Please add 4DTV to the list of abbreviations or briefly explain it.

Page 8, line 56. Typo: Z. may should be Z.mays

Page 9 line 42 "...in male is obviously..." I would change "obviously" with "clearly" here.

Pages 9-11: please provide a more detailed description of the conditions in which RNA seq was extracted from different tissues. Also, specify if any replicate was carried out. As a general comment and recommendation regarding this section of the manuscript I would suggest to properly caution the reader regarding the fact that all these data and evidence are interesting but absolutely preliminary. Especially considering that the experimental design was intended to obtain a glimpse of the complexity of these data but definitely, as it is, it is far from being defined robust.

Page 10, line 37: please add the appropriate unit to the transcriptome data figures. It's FPKM here.

Page 12, lines 25-31: authors should add also a comparison with the TE content estimates available for other gymnosperm. Also, in the case of rice it would be better to provide and to cite the most recent estimate that can be found in "The map-based sequence of rice genome" Nature 436, 793-800 2005

Page 12 lines 56-58: authors should also discuss, the evidence proposed in "Early genome duplications in conifers and other seed plants" Li et al., Science 2015

Page 13 lines 6-9: Similarly to what I pointed out before, it's not clear how the data for Norway spruce were obtained. Without this information it is pointless to speculate about the differences that can be seen. Furthermore a comparison with data available for Loblolly pine would be interesting here. Note also that the figures provided in line 7 (1506 and 686) are different from those proposed at page 7 (2416 vs 1790). Please explain and clarify this discrepancy.

Page 13, line 16: what do authors mean with "LTR/gene"?

Page 13 line 16: "might occur in ~3 mya". Indeed it occurred during a even shorter time frame. See: Baucom RS, Estill JC, Chaparro C, Upshaw N, Jogi A, et al. (2009) Exceptional Diversity, Non-Random Distribution, and Rapid Evolution of Retroelements in the B73 Maize Genome. PLoS Genet 5(11): e1000732. doi:10.1371/journal.pgen.1000732

Page 13, lines 18-22: From "The removal..." till "Norway spruce". Actually it is not the removal but the "lack of" efficient removal possibly leading to the huge genome of Norway spruce. Please reword the sentence accordingly.

Page 14, line 34: explain the meaning of the acronym ROS (or add it to the list of abbreviations)

Page 14, lines 42-44 Add the reference relative to FLS2 discovery and characterization.

Page 14, line 48: state the number of these duplicated genes in A. thaliana and provide a reference for this information (or explain how it has been retrieved).

Page 15, lines 33-41. It is not clear where the comparisons discussed were described in results. I am referring in particular to Carica papaya that is mentioned for the first time in discussion. As it is, this sentence is not supported by the data collected and presented.

Page 15, lines 52-57 "For gymnosperm..." I suspect that few words or an entire sentence is missing in this paragraph because, as it is, its meaning is quite obscure.

Page 17, line 22: "low quality bases": please use a Q-phred like value to define "low quality bases"

Page 17, lines 47-52: please made the de novo repeat libraries publicly available. This would be an extremely useful information for the scientific community.

Page 18, line 31: "de novo genes". These are not necessarily new genes, they could be just highly diverged or simply species specific ones.

Page 19, lines 28-32 from "We filtered..." on. The sentence is quite confuse, please rephrase it.

Figure 1 d: please briefly comment in text the strange placement of OSAT and ATHA mixed with Gb genes in cluster 1804

Figure 3, explain in the figure legend the meaning of the colored stars

Finally the Abstract should be rewritten according to the changes made in the results, methods and discussion sections. Are the methods appropriate to the aims of the study, are they well described, and are necessary controls included? If not, please specify what is required in your comments to the authors. No.

Are the conclusions adequately supported by the data shown? If not, please explain in your comments to the authors. No.

Does the manuscript adhere to the journal’s guidelines on minimum standards of reporting? If not, please specify what is required in your comments to the authors. Yes.

Are you able to assess all statistics in the manuscript, including the appropriateness of statistical tests used? There are no statistics in the manuscript.

Quality of written English Please indicate the quality of language in the manuscript: Needs some language corrections before being published. Declaration of competing interests Please complete a declaration of competing interests, consider the following questions: Have you in the past five years received reimbursements, fees, funding, or salary from an organization that may in any way gain or lose financially from the publication of this manuscript, either now or in the future? Do you hold any stocks or shares in an organization that may in any way gain or lose financially from the publication of this manuscript, either now or in the future? Do you hold or are you currently applying for any patents relating to the content of the manuscript? Have you received reimbursements, fees, funding, or salary from an organization that holds or has applied for patents relating to the content of the manuscript? Do you have any other financial competing interests? Do you have any non-financial competing interests in relation to this manuscript? If you can answer no to all of the above, write ‘I declare that I have no competing interests’ below. If your reply is yes to any, please give details below. I declare that I have no competing interests.

I agree to the open peer review policy of the journal. I understand that my name will be included on my report to the authors and, if the manuscript is accepted for publication, my named report including any attachments I upload will be posted on the website along with the authors' responses. I agree for my report to be made available under an Open Access Creative Commons CC-BY license (http://creativecommons.org/licenses/by/4.0/). I understand that any comments which I do not wish to be included in my named report can be included as confidential comments to the editors, which will not be published. I agree to the open peer review policy of the journal.

REVIEW 1, REVIEWED ON AUGUST 30, 2016

In the present version of the manuscript, the authors positively addressed most of the criticisms, suggestions and comments I provided in the previous review.

The manuscript quality has improved and although some sections have been removed it still describes enough data analysis to be considered as a full research article. There are some issues to address, though. I list them here:

a) as indicated in the first round of review, the quality of written English still needs some edits. I suggest to contact a professional editing service in order to address this point.

b) Page 5, paragraph "Genome annotation". Please better describe the "tandem repeats" identified; do they include SSRs?

c) Page 6: Supplementary figure 4 doesn't display in supplementary figures file

d) Page 7, paragraph "Evolution of LTR-RTs". "Phylogenetic trees suggested that..." Actually the trees didn't suggest this. Instead the similarity searches carried out by authors retrieved this amount of data.

e) Page 7, the amount of LTR-RT related sequence data stated for different species still remains a major issue for me. In particular authors claimed that 2416 and 1790 elements for Ty1- copia and Ty3-gypsy elements respectively were retrieved in P. abies. If I got the meaning of authors reply to my previous comments correctly these data for P. abies came from the search of the assembly version 1. However a quick tblastn search carried out using as a query the Ty1 copia RT sequence indicated by authors on the v1 P. abies assembly gave about 11,900 positive hits when an evalue of 1e-5 was set as significance threshold. Of these hits more than 9,000 are longer than 95 AA residues. I suspect the very same is true for the Ty3-gypsy. How can this inconsistency be explained?

f) page 8: "...to the most basal clade". Again I reiterate what I said in the first round of review: it doesn't make sense to talk of a basal clade in an unrooted phylogenetic tree. Authors replied that they used "most basal" as having the same meaning of "most conserved". This is not the case. So please correct the text accordingly i.e. change the "basal-most clade 1" with "the most conserved clade 1".

g) page 8, paragraph "TE insertions in introns". Authors provided a p-value (< 2.0 e-6) but omitted to state which statistical test was used. Please state it.

h) page 9, paragraph "gene duplications". To figure out the timing of WGD events, the authors used a mutation rate of 2.2 e-9. This has been calculated for LTR-RTs (Nysted,2013). Authors should use instead the ratio of 0.68 e-9 calculated for gymnosperm genes as described in Buscchiazzo et al, 2012 ("Slow but not low: genomic comparisons reveal slower evolutionary rate and higher dN/dS in conifers compared to angiosperms". BMC evolutionary biology.)

i) page 12 from "With comparison to Norway spruce..." on. Once more, in order to gauge properly the meaning of these figures it is important to understand how these data were obtained and why they differ so strikingly from the evidence gathered carrying out a simple tblastn search (see comment "e"). Furthermore, as a general comment regarding these comparisons involving different species, it is important to note that differences seen likely are also due to the different metrics characterizing the various genome assemblies.

l) Page 13, lines 7-10. The sentence as it is is not clear. Please rewrite.

m) Page 17, lines 5-6. These data are valuable, however authors should caution the reader about the inherent limitations of a search for LTR-RTs based only on LTR_STRUCT run under the (loose) default settings. In particular a significant amount of false positives is expected from such a search. Indeed, if authors carry out a simple dot plot analysis of the longest and shortest putative LTR-RT they identified, they'll see in the first case a significant amount of nested insertions and, in the latter, several tandem arranged repeats misidentified by the program as LTRs. All of this is expected but, again, it should be pointed out

n) Page 17, line 9: "distance" should be "nucleotide distance".

o) Supplementary figure 6: add a legend.

 

Authors' response to reviews: https://static-content.springer.com/openpeerreview/art%3A10.1186%2Fs13742-016-0154-1/13742_2016_154_AuthorComment_V1.pdf

 


Source

    © 2016 the Reviewer (CC BY 4.0 - source).

Content of review 2, reviewed on October 21, 2016

Reviewer's report:
I'm satisfied with the authors response to my comments.

Are the methods appropriate to the aims of the study, are they well described, and are
necessary controls included?If not, please specify what is required in your comments to the
authors.
Yes

Are the conclusions adequately supported by the data shown?If not, please explain in your
comments to the authors.
Yes

Does the manuscript adhere to the journal’s guidelines on minimum standards of
reporting?If not, please specify what is required in your comments to the authors.
Yes

Are you able to assess all statistics in the manuscript, including the appropriateness of
statistical tests used?(If an additional statistical review is recommended, please specify what
aspects require further assessment in your comments to the editors.)
There are no statistics in the manuscript.

Quality of written EnglishPlease indicate the quality of language in the manuscript:
Acceptable

Declaration of competing interestsPlease complete a declaration of competing interests, consider
the following questions:

1. Have you in the past five years received reimbursements, fees, funding, or salary from an
organization that may in any way gain or lose financially from the publication of this
manuscript, either now or in the future?
2. Do you hold any stocks or shares in an organization that may in any way gain or lose
financially from the publication of this manuscript, either now or in the future?
3. Do you hold or are you currently applying for any patents relating to the content of the
manuscript?
4. Have you received reimbursements, fees, funding, or salary from an organization that holds
or has applied for patents relating to the content of the manuscript?
5. Do you have any other financial competing interests?
6. Do you have any non-financial competing interests in relation to this manuscript?
If you can answer no to all of the above, write ‘I declare that I have no competing interests’ below.
If your reply is yes to any, please give details below.

I declare that I have no competing interests.

I agree to the open peer review policy of the journal. I understand that my name will be included
on my report to the authors and, if the manuscript is accepted for publication, my named report
including any attachments I upload will be posted on the website along with the authors'
responses. I agree for my report to be made available under an Open Access Creative Commons
CC-BY license (http://creativecommons.org/licenses/by/4.0/). I understand that any comments
which I do not wish to be included in my named report can be included as confidential comments
to the editors, which will not be published.

I agree to the open peer review policy of the journal.


Source

    © 2016 the Reviewer (CC BY 4.0 - source).

References

    Rui, G., Yunpeng, Z., He, Z., Guangyi, F., Xin, L., Wenbin, Z., Chengcheng, S., Jiahao, W., Weiqing, L., Xinming, L., Yuanyuan, F., Kailong, M., Lijun, Z., Fumin, Z., Zuhong, L., Ming-Yuen, L. S., Xun, X., Jian, W., Huanming, Y., Chengxin, F., Song, G., Wenbin, C. 2016. Draft genome of the living fossil Ginkgo biloba. GigaScience.