Review badges
3 pre-pub reviews
0 post-pub reviews
Abstract

Background: The ocean sunfish (Mola mola), which can grow up to a length of 2.7 m and weigh 2.3 tons, is the world's largest bony fish. It has an extremely fast growth rate and its endoskeleton is mainly composed of cartilage. Another unique feature of the sunfish is its lack of a caudal fin, which is replaced by a broad and stiff lobe that results in the characteristic truncated appearance of the fish.Results: To gain insights into the genomic basis of these phenotypic traits, we sequenced the sunfish genome and performed a comparative analysis with other teleost genomes. Several sunfish genes involved in the growth hormone and insulin-like growth factor 1 (GH/IGF1) axis signalling pathway were found to be under positive selection or accelerated evolution, which might explain its fast growth rate and large body size. A number of genes associated with the extracellular matrix, some of which are involved in the regulation of bone and cartilage development, have also undergone positive selection or accelerated evolution. A comparison of the sunfish genome with that of the pufferfish (fugu), which has a caudal fin, revealed that the sunfish contains more homeobox (Hox) genes although both genomes contain seven Hox clusters. Thus, caudal fin loss in sunfish is not associated with the loss of a specific Hox gene.Conclusions: Our analyses provide insights into the molecular basis of the fast growth rate and large size of the ocean sunfish. The high-quality genome assembly generated in this study should facilitate further studies of this 'natural mutant'.

Authors

Pan, Hailin;  Yu, Hao;  Ravi, Vydianathan;  Li, Cai;  Lee, Alison P.;  Lian, Michelle M.;  Tay, Boon-Hui;  Brenner, Sydney;  Wang, Jian;  Yang, Huanming;  Zhang, Guojie;  Venkatesh, Byrappa

Publons users who've claimed - I am an author
Contributors on Publons
  • 2 authors
  • 4 reviewers
Followers on Publons
  • Overall this is a fascinating paper that outlines the insights from genomic analyses of one of the most enigmatic and charismatic fishes in the ocean, the Ocean sunfish (Mola mola). We have a number of suggestions for the authors that will lead to a more comprehensive manuscript. Please find our comments below.

    As a referee we ask that you assess the paper on its own merits. The following list of potential issues may be helpful.

    1. Is the rationale for collecting and analyzing the data well defined? Is the work carried out on a dataset that can be described as "large-scale" within the context of its field? Does it clearly describe the dataset and provide sufficient context for the reader to understand its potential uses? Does it properly describe previous work?

    The motivation for conducting this research was well described in a solidly written introduction. The raw data appears to be of high quality and coverage relative to other work in the field and certainly this data set will be of significant interest to the research community. The types of analyses carried out are creative and will be informative but require some changes to ensure their accuracy and replicability.

    2. Is it clear how data was collected and curated?

    Credit should be given for transparency and provision of all supporting information.
    It would be helpful if the sex of the individual sequenced was specified. Because the sample was
    obtained in 1998 it would also be useful to know the storage conditions of the blood prior to
    DNA extraction or, if the DNA was extracted at that time, the storage conditions of the DNA
    between extraction and sequencing. The method of DNA extraction should also be specified.

    3. Is it clear - and was a statement provided - on how data and analyses tools used in the study
    can be accessed?

    While we make every effort to make sure this information is available, we appreciate reviewers
    providing an extra eye to make absolutely certain that this information is clearly stated and
    properly available. Data availability and access to tools are essential for reproducibility and
    provide the best means for reuse.

    There are a few instances where it is not clear what tools were used to perform certain analyses.
    Please see detailed comments below.

    4. Are accession numbers given or links provided for data that, as a standard, should be
    submitted to a community approved public repository?

    Following community standards for data sharing is a requirement of the journal. Additionally,
    data sharing in the broadest possible manner expands the ways in which data and tools can be
    accessed and used.

    At the moment there are only tentative NCBI accession numbers given for the assembled
    genome (XXXXX). I can see that on NCBI a BioProject (Accession: PRJNA305960 ID: 305960)
    and BioSample (SAMN04335856) have already been registered which is good. It would be
    helpful if all of the different insert size libraries (additional file 1: table S1) were named in the
    table and when the raw reads are submitted to the SRA for easy cross-referencing.

    5. Is the data and software available in the public domain under a Creative Commons license?

    Note, that unless otherwise stated, data hosted in our database (GigaDB) is available under a
    CC0 waiver. Additionally, did the authors indicate where the software tools and relevant source
    code are available, under an appropriate Open Source Initiative compliant license? If the source
    code is currently not in a hosted repository, we can help authors copy it over to a GigaScience
    GitHub repository.

    6. Are the data sound and well controlled?

    If you feel that inappropriate controls have been used please say so, indicating the reasons for
    your concerns, and suggesting alternative controls where appropriate. If you feel that further
    experimental/clinical evidence is required for obtaining solid biological conclusions and
    substantiating the results, please provide details.
    Portions of the analysis, especially the definition of the gene family clusters, require more careful
    definition and clarification. The text regularly refers to 'single-copy' genes but often does not
    clearly define their criteria for classifying genes as 'single-copy' and perhaps as a consequence
    the text reads as internally inconsistent, with different analyses using different datasets of 'singlecopy'
    genes ranging from 1,690 to 3,738 to 10,660. Rather than calling each of these datasets
    'single-copy genes' a more specific descriptor should be used for the phylogenetic level at which
    homology was assessed and whether or not multiple paralogs are present in each case. For
    example the 1,690 gene set could be called 'single-copy ray-finned fish homologs' as this dataset
    should comprise only cases where gar and all teleost genomes contain only a single homolog,
    while the 10,660 gene set could be called simply 'teleost homologs' as this dataset is restricted to
    teleosts but includes cases where multiple paralogs (e.g. igfr1a, igfr1b) are present and therefore
    are not 'single-copy'. It is not clear to me if or where the 3,738 'single-copy orthologous' gene set
    was used or whether this gene set includes paralogs or not. I would additionally urge caution in
    describing genes as 'orthologous' where simple phenetic (i.e. BLAST-based) methods are used to
    classify them. Orthology has a very specific phylogenetic meaning and in the context of teleost
    genomes especially it is important to distinguish between orthologous and paralogous sequences.
    Where orthology and paralogy have not been assessed using phylogenetic methods the general
    term 'homology' should be used instead.

    7. Is the interpretation (Analysis and Discussion) well balanced and supported by the data?

    The interpretation should discuss the relevance of all the results in an unbiased manner. Are the
    interpretations overly positive or negative? Note that the authors may include opinions and
    speculations in an optional 'Potential Implications' section of the manuscript; thus, if there is
    material in other parts of the manuscript that you feel would be better suited in such a section,
    please state that. Conclusions drawn from the study should be valid and result directly from the
    data shown, with reference to other relevant work as applicable. Have the authors provided
    references wherever necessary?

    The authors are appropriately careful in drawing biological conclusions from their data and
    throughout the analysis and discussion always imply potential roles rather than implying direct
    causality.

    8. Are the methods appropriate, well described, and include sufficient details and supporting
    information to allow others to evaluate and replicate the work?

    Please remark on the suitability of the methods for the study.

    If statistical analyses have been carried out, please indicate if you feel they need to be assessed
    specifically by an additional reviewer with statistical expertise.

    In some cases more detailed descriptions of the methodology including parameters are needed.
    See details below.
    9. What are the strengths and weaknesses of the methods?

    Please comment on any improvements that could be made to the study design to enhance the
    quality of the results. If any additional experiments are required, please give details. If novel
    experimental techniques were used please pay special attention to their reliability and validity.
    In some instances methodological improvements during analysis seem to be necessary to meet
    minimum requirements for publication. Please see below for details.

    10. Have the authors followed best-practices in reporting standards?

    This is an essential component as ease of reproducibility and usability are key criteria for
    manuscript publication. Please note, the methodology sections should never contain "protocol
    available upon request" or "e-mail author for detailed protocol". Have the authors followed and
    used reporting checklists recommended by the Biosharing network and if the methods are
    amenable, have the authors used workflow management systems such as Galaxy, Taverna or one
    of the many related systems listed on MyExperiment? We can also host these in our Giga-Galaxy
    server if they currently do not have a home. We also encourage use of virtual machines and
    containers such as Docker. And the use and deposition of both wet-lab and computational
    protocols in a protocols repository like protocols.io.

    In some cases additional details are required, particularly for methodology during the annotation
    and homologous gene cluster building.

    11. Can the writing, organization, tables and figures be improved?

    Although the editorial team may also assess the quality of the written English, please do
    comment if you consider the standard is below that expected for a scientific publication.

    If the manuscript is organized in such a manner that it is illogical or not easily accessible to the
    reader please suggest improvements. Please provide feedback on whether the data are presented
    in the most appropriate manner; for example, is a table being used where a graph would give
    increased clarity? Do the figures appear to be genuine, i.e. without evidence of manipulation, and
    of a high enough quality to be published in their present form?

    The manuscript is clearly written. I have suggested moving analysis of bone-forming genes to
    the main text as it warrants attention. Some minor changes to figures have been recommended.
    Please see below for details.

    12. When revisions are requested.


    Reviewers may recommend revisions for any or all of the following reasons: the data require
    additional testing to ensure their quality, additional data are required to support the authors'
    conclusions; better justification is needed for the arguments based on existing data; or the clarity
    and/or coherence of the paper needs to be improved.


    Several changes and/or clarifications are necessary prior to being published. Please see below
    for details.

    13. Are there any ethical or competing interests issues you would like to raise?

    The study should adhere to ethical standards of scientific/medical research and the authors
    should declare that they have received ethics approval and/or patient consent for the study, where
    appropriate.

    Whilst we do not expect reviewers to delve into authors' competing interests, if you are aware of
    any issues that you do not think have been adequately addressed, please inform the Editorial
    office.

    No issues.

    Detailed Revision Requests

    Introduction

    63 "other tetraodontid fishes such as pufferfish, boxfish and triggerfish"

    KM1: This should be changed to tetraodontiform fishes to refer to the whole order and to avoid
    confusion with the family tetraodontidae (pufferfishes only).
    Genome assembly and annotation

    KM2: The number and sizes of the different libraries should be mentioned in the main text along
    with (an abbreviated version perhaps reporting only N50, contig number, scaffold number, and
    total size) of the assembly metrics - possibly in a brief concatenated figure combining S1, S2,
    and S3.

    KM3: It should be made clear in the text that the estimate of 134X coverage is based on a (later
    described) k-mer counting method of genome size estimation. Table S1 also says 131X
    coverage rather than 134X so whichever is correct should be used.

    KM4: My preference would also be that the coverage statistics reported in the main text should
    refer to the reads actually used to produce the assembly and not the discarded data i.e. table S2
    "statistics of clean reads" as this is a more accurate reflection of what was used for producing the
    assembly used in all downstream analyses, so 96X coverage rather than 131X.

    KM5: The number of reads from each library actually used to produce the assembly should be
    reported so it is clear how much of the "clean" 68.87Gb was used by the assembler and how
    much was discarded. If this isn't available as a direct output from SOAPdenovo, all of the clean
    reads should be realigned to the genome assembly (e.g. using bwa or bowtie) and it should be
    reported what proportion of the clean reads align uniquely and concordantly to the genome
    assembly. This will also give a good idea of the completeness of the assembly.

    KM6: The parameters used for the SOAPdenovo assembly need to be stated. Justification for
    why these parameters and not others were used should be given, even if you only decided on
    these parameters a posteriori after comparing assemblies. Did you try a range of parameters and
    compare assembly metrics? Did you try a range of assembly programs? If yes this should be
    stated and summarized as a supplementary table. If not then it needs to be made clear that you
    only produced one assembly and did not compare, but the parameters you used still need to be
    stated.

    KM7: The programs used for filtering, trimming and/or correcting the raw reads need to be
    stated along with the thresholds for calling a read or a base "low quality" and discarding it.

    KM8: More detailed results from the CEGMA analysis should be provided. Did you identify
    98.4% predicted 'full-length' proteins, or only partial proteins? Please report both values.
    Although I think CEGMA is still a useful tool, the authors should note that CEGMA is no longer
    maintained by the creators and they have released an alternative (BUSCO):
    http://www.acgt.me/blog/2015/5/18/goodbye-cegma-hello-busco

    KM9: Given that the assembly comprised 642Mb (88%) of an estimated 730Mb genome
    estimated by the authors using a kmer counting method, it would be useful to have some
    discussion of sunfish genome size estimated by other methods e.g. flow cytometry, see (Rainerd,
    E.L.L.B. et al., 2001. Patterns of Genome Size Evolution in Tetraodontiform Fishes.55(11),
    pp.2363-2368) and some personal communications by the authors themselves communicated in
    T. Ryan Gregory's genome size database (http://www.genomesize.com/) which both suggest
    even larger genome sizes for sunfish. A stringent realignment of the clean reads to the genome
    assembly should also give an idea of what proportion of the read data has been used by the
    assembly and what proportion has been discarded.

    KM10:For the estimation of genome size using k-mer analysis, please state the tools used to
    make the calculation. How was the depth of 17mers counted? Is this an output of SOAPdenovo
    or another program like jellyfish?

    97 "The sunfish genome comprises approximately 11% repetitive sequences,
    98 which is comparable to the repeat content of the fugu genome (Figure 1)."

    KM11:It could be made clearer in the main text if the figure of 11% refers to interspersed
    repeats only or is a combination including transposable elements, tandem repeats, and simplesequence
    repeats. A breakdown of transposable element composition by type should be
    accessible from the RepeatMasker runs already carried out and would enhance this analysis and
    should be included in the supplementary data.
    99 "Using homology-based and de novo annotation methods, we predicted 19,605 protein-coding
    genes
    100 in the sunfish assembly"

    KM12:The type of homology-based and de novo annotation methods should be mentioned in the
    main text (i.e. tBLASTn against protein predictions from 5 genomes and AUGUSTUS). In the
    methods it should be described what the cut-off thresholds for tBLASTn alignments were and
    what criteria for annotating the sunfish homolog were used (i.e. where more than one protein
    aligned did you choose the one with the greatest length, %ID, E-Value?) Because the final gene
    set merged with GLEAN also contains AUGUSTUS please also report the sensitivity and
    specificity of the AUGUSTUS parameters chosen during the training.

    101 "Using a genome-wide set of 1,690 one-to-one
    102 orthologs in sunfish and seven other ray-finned fishes (fugu, Tetraodon, stickleback,
    medaka,
    103 tilapia, zebrafish and spotted gar), we reconstructed a phylogenetic tree and estimated the
    104 divergence times of various fish lineages using MCMCtree [8]."

    KM13:It needs to be clearly stated how this set of 1,690 one-to-one orthologs was chosen and
    verified. Ensembl is a large database with many types of export tools. Please specify the tools
    used and the thresholds/criteria used for defining one-to-one orthologs. Please also report the
    genome assembly and annotation version for each genome separately rather than the Ensembl
    release version. A supplementary file containing the gene names and accession numbers for
    each of the additional ray-finned fish genes and the corresponding sunfish gene model numbers
    used to form each cluster would be necessary to make this analysis reproducible.

    Figure 1

    KM14:The bootstrap support for each of the nodes in the tree should be reported on the figure.
    The figure (preferably) or at least the legend needs to specify which assembly and annotation
    version of each of the genomes reported are being used to source the values for genome size,
    repeat content, and number of genes. If the repeat content comes from your own analysis rather
    than the published genomes this should be made clear as well. The value of 1.3% for the spotted
    gar repeat content is very different from the reported value of 20% from the gar genome paper
    (Braasch, I. et al., 2016. The spotted gar genome illuminates vertebrate evolution and facilitates
    human-teleost comparisons. Nature Genetics, 48(4)) and this should be double-checked.

    Population size history.

    KM15:Having never carried out such analyses my expertise is limited here but I would
    appreciate a very brief explanation of the core methodology of PSMC in the text or methods and
    a brief justification of its use highlighting its potential strengths and weaknesses. Preferably cite
    one or two examples that show that PSMC analysis is appropriate for comparing genomes which
    diverged >50mya rather than 250 thousand years (over two orders of magnitude difference) as
    this seems like it might be problematic.

    Positively-selected and fast-evolving genes
    127 "Using a set of 10,660 one-to-one orthologues from five teleost species (sunfish, fugu,
    128 Tetraodon, medaka and zebrafish) we conducted positive selection analyses"

    KM16:Calling this 10,660 gene set 'one-to-one orthologues' is confusing as it contains multiple
    paralogs present in different quantities in different teleost genomes. It should be described how
    many sunfish paralogs are found in each case, and whether the subsequent selection analyses
    used the teleost 'a' or 'b' paralogy groups as the sunfish genes do not seem to be classified within
    the teleost 'a' or 'b' paralogy groups. For example, insulin growth factor 1 receptor (igf1r) is
    present as 2 paralogs in fugu, Tetraodon, medaka and zebrafish (igf1ra, igf1rb) but only one
    sunfish homolog (Sunfish09150) is reported in the selection analyses (Table S6, S7). Is this the
    ortholog of igf1ra or igf1rb? Table S8 suggests 2 copies of igf1r are found in sunfish and reports
    2 dN/dS values and LRT p-values but doesn't distinguish which is which. Furthermore, the LRT
    p-values reported in table S6 and S7 don't correspond with those reported in table S8 (5.78x10-4
    for the one igf1r paralog presented in S6, and 3.64x10-7, 2.3 x10-3 for the two igf1r paralogs
    presented in S8). It would help if the sunfish gene models were annotated with 'a' or 'b' if this
    has been assessed - and if orthology hasn't been assessed calling them (1 of 2) and (2 of 2) would
    be more appropriate. If different paralogs, rather than orthologs, were used in any alignments the
    dN/dS estimations and inferences of evolutionary rates are meaningless so it is crucial that the
    methods used to assess orthology are careful and clearly described.
    395 "We picked
    396 up genes whose likelihood values of H1 are significantly larger (LRT p-value of <0.05) than
    397 H0 and likelihood values of H2 are not significantly larger than H1."

    KM17:During the hypothesis testing it would also be more appropriate to select genes whose
    likelihood values of H1 (sunfish evolving independently from rest of the tree) are significantly
    greater than both H0 (all branches evolving at the same rate) and H2 (all branches evolving
    independently) before then sorting from this set which sunfish genes have a larger . It would
    also be interesting to report which sunfish genes have a lower as this might imply a greater
    amount of constraint.
    144 "Using the branch models in PAML [20], we found multiple genes in the
    145 GH/IGF1 axis (ghr1, igf1r, grb2, irs1, irs2, jak2, stat5, akt3) with significantly higher dN/dS
    146 values compared to other lineages, suggesting that these genes are evolving rapidly in the
    147 sunfish lineage"

    KM18:Contrary to the above statement, the authors are not reporting sunfish genes with
    significantly higher dN/dS than other lineages but rather sunfish genes for which hypothesis H1
    (sunfish genes evolving at a different rate from the rest of the tree) is a significantly better
    hypothesis than H0 (all branches evolving equally). There are also multiple examples (both
    paralogs of irs1, one of the paralogs of irs2, one of the paralogs of jak2, and stat5) where the
    dN/dS value in sunfish is actually lower than the background dN/dS implying the sunfish genes
    are actually evolving slower than the background.

    Table S8. Copy number and LRT p-values of sunfish genes in the GH/IGF-1 axis.

    KM19:This should be changed to "select genes in the GH/IGF-1 axis" as this is not a
    comprehensive list of genes involved in this pathway.
    131 "we identified 1117 genes that contained positively-selected sites
    132 specifically in sunfish (Additional file 3: Table S7)."

    KM20:The authors should report how many sites (either absolute number or proportion of
    coding sequence) appear to be under positive selection for each of these cases in their
    supplementary data. Could the authors please also clarify whether their claim that these 1117
    genes contained positively-selected sites specifically in sunfish means that the sites or that the
    genes show signs of positive selection only in sunfish.
    132 "Inspection of the fast-evolving and
    133 positively-selected gene sets revealed several interesting genes."

    KM21:'Positively-selected genes' should be replaced with 'genes with positively selected sites' as
    none of the genes showed outright signs of positive selection (dN/dS > 1).

    KM22:Ideally the authors would perform a type of overrepresentation analysis using for
    example GO or KEGG pathway terms to determine without bias whether the GH/IGF pathway,
    ECM components, or bone formation for example turn up more or less frequently than expected
    at random in their set of 'rapidly-evolving' or 'positively-selected' genes. Otherwise it should be
    made clear that the authors specifically looked at genes in the GH/IGF pathway and ECM. For
    example "we examined genes in the GH/IGF pathway" rather than "inspectionrevealed" as this
    implies that these genes somehow stood out form the rest of the data - which might be the case
    but without an overrepresentation analysis it is not clear.
    144 "we found multiple genes in the
    145 GH/IGF1 axis (ghr1, igf1r, grb2, irs1, irs2, jak2, stat5, akt3) with significantly higher dN/dS
    146 values compared to other lineages, suggesting that these genes are evolving rapidly in the
    147 sunfish lineage"

    KM23:Again here as I understand it the analysis tested whether there was a significant
    difference between H1 and H0, not whether there was a significant difference in dN/dS between
    sunfish and other lineages. If this is a separate analysis it should be clearly stated. Furthermore
    several dN/dS values reported for sunfish in table S8 are actually lower than the background
    reported.
    147 "We found that both copies of igf1r
    148 (igf1ra and igf1rb) are under positive selection in the sunfish (Figure 2, Additional file 1:
    Table 149 S8)"

    KM24:Here please also replace "under positive selection" with "contain sites under positive
    selection". The same applies to ECM analysis. If you have indeed assessed orthology with
    igf1ra and igf1rb please make this clear in earlier methods sections and report orthology in table
    S7, S8 and elsewhere.
    190 "However, the sunfish
    191 possesses intact orthologues for most of these genes except for some SCPP genes (see
    192 Supplementary Material)"

    KM25:I find it disjointed that this analysis alone is described in supplementary materials. As it
    is integral to the motivation for conducting the study the analysis of bone forming genes should
    be included in the main text.

    Additional File 1
    "We identified orthologues for all the above genes in the ocean sunfish genome on (a)
    scaffold10.1, (b) scaffold39.1, (c) scaffold20.1, and (d) scaffold77.1, except Optc and Omd."

    KM26:Please state how you identified these homologs. Did you perform tblastn, or tblastx
    genome wide against your assembly and what did you use as your query sequences? What were
    the similarity thresholds you used?

    "We BLASTX-searched the ocean sunfish loci of (a) and (b) to identify Optc and Omd
    respectively, but did not identify these genes."

    KM27:Again, please clarify the type of BLAST algorithm you ran and the query and target
    sequences you used. The above statement implies you used blastx to run the sunfish scaffolds as
    a query against a database containing Optc and Omd protein sequences. Is this correct? Which
    species were the Optc and Omd proteins sourced from? What were the cutoff parameters used?

    "An alignment of Runx2 proteins shows that ocean sunfish Runx2 is highly conserved (e.g. its
    DNA-binding domain is perfectly conserved; its central and C-terminal domains look intact as
    well) (data not shown)."

    KM28:I have no reason to doubt this but if you are reporting it I suggest you show the data
    especially as your supplementary data is not restricted.

    KM29:The analysis of presence/absence of each of the target bone-formation related genes
    should be presented in a table (in either the main text or SI). In each case where homologs of
    bone-formation genes were found in sunfish the exact number of homologs found should be
    stated. E.g. "For Smad4, we identified up to four copies in ocean sunfish" is confusing and the
    exact number should be reported.

    200 "However, it has lost two P/Q-rich SCPP genes (fa93e10 and scpp7) that are conserved in
    the
    201 other two teleosts"

    KM30:Before concluding gene loss please make it clear if you have searched the whole genome
    assembly and not just the identified clusters for these genes, and whether you have also searched
    the raw genomic reads which may contain unassembled reads corresponding to the missing
    genes.

    KM31:Because of the complex duplication history of SCPP genes I would consider it essential
    to carefully assess homology of each of the genes in the P/Q-rich SCPP gene cluster with
    phylogenetic methods to ensure that scpp7 is indeed lost and that additional sunfish SCPP genes
    reported as scpp3b1 and scpp3b2 for example are not actually orthologs of scpp4, and that the
    reported pseudogene of scpp4 is not in fact scpp7.

    KM32:To confirm that scpp4 is indeed a pseudogene and that the insertion of the "T" is not a
    sequencing/assembly error please report the results of a read re-mapping to this locus to verify
    that the additional "T" is present in most raw reads which realign to this site.

    Hox genes
    KM33:In figure S3 a more appropriate or additional outgroup for analysis of Hox clusters in
    teleosts would be the spotted gar, which the authors have also previously used in their own
    analyses. See (Braasch, I. et al., 2016. The spotted gar genome illuminates vertebrate evolution
    and facilitates human-teleost comparisons. Nature Genetics, 48(4)). The figure would be
    ameliorated if the authors marked the independent gene losses which occurred on each branch to
    highlight the differences in sunfish from other teleosts. It should also be reported what scaffold
    numbers in the sunfish assembly each Hox cluster corresponds to, in a similar fashion as reported
    for SCPP genes in Figure 4.

    Are the methods appropriate to the aims of the study, are they well described, and are
    necessary controls included?
    If not, please specify what is required in your comments to the authors.

    No
    Are the conclusions adequately supported by the data shown?
    If not, please explain in your comments to the authors.

    Yes

    Does the manuscript adhere to the journal’s guidelines on <a href=’http://resourcecms.springer.com/springercms/rest/v1/content/7117202/data/v1/Minimum+standards+of+reporting+checklist’target='
    new'>minimum standards of reporting?</a>
    If not, please specify what is required in your comments to the authors.

    Yes

    Are you able to assess all statistics in the manuscript, including the appropriateness of
    statistical tests used?
    (If an additional statistical review is recommended, please specify what aspects require further
    assessment in your comments to the editors.)

    Yes, and I have assessed the statistics in my report.

    Quality of written English
    Please indicate the quality of language in the manuscript:

    Acceptable

    Declaration of competing interests
    Please complete a declaration of competing interests, consider the following questions:
    1. Have you in the past five years received reimbursements, fees, funding, or salary from an
    organization that may in any way gain or lose financially from the publication of this
    manuscript, either now or in the future?
    2. Do you hold any stocks or shares in an organization that may in any way gain or lose
    financially from the publication of this manuscript, either now or in the future?
    3. Do you hold or are you currently applying for any patents relating to the content of the
    manuscript?
    4. Have you received reimbursements, fees, funding, or salary from an organization that
    holds or has applied for patents relating to the content of the manuscript?
    5. Do you have any other financial competing interests?
    6. Do you have any non-financial competing interests in relation to this manuscript?
    If you can answer no to all of the above, write ‘I declare that I have no competing interests’
    below. If your reply is yes to any, please give details below.


    I declare that I have no competing interests .

    I agree to the open peer review policy of the journal. I understand that my name will be included
    on my report to the authors and, if the manuscript is accepted for publication, my named report
    including any attachments I upload will be posted on the website along with the authors'
    responses. I agree for my report to be made available under an Open Access Creative Commons
    CC-BY license (http://creativecommons.org/licenses/by/4.0/). I understand that any comments
    which I do not wish to be included in my named report can be included as confidential comments
    to the editors, which will not be published.

    I agree to the open peer review policy of the journal.

    Authors' response to reviews: (https://static-content.springer.com/openpeerreview/art%3A10.1186%2Fs13742-016-0144-3/13742_2016_144_AuthorComment_V1.pdf)


    Published in
    Ongoing discussion
  • Overall this is a fascinating paper that outlines the insights from genomic analyses of one of the most enigmatic and charismatic fishes in the ocean, the Ocean sunfish (Mola mola). We have a number of suggestions for the authors that will lead to a more comprehensive manuscript. Please find our comments below.

    As a referee we ask that you assess the paper on its own merits. The following list of potential issues may be helpful.

    1. Is the rationale for collecting and analyzing the data well defined? Is the work carried out on a dataset that can be described as "large-scale" within the context of its field? Does it clearly describe the dataset and provide sufficient context for the reader to understand its potential uses? Does it properly describe previous work?

    The motivation for conducting this research was well described in a solidly written introduction. The raw data appears to be of high quality and coverage relative to other work in the field and certainly this data set will be of significant interest to the research community. The types of analyses carried out are creative and will be informative but require some changes to ensure their accuracy and replicability.

    2. Is it clear how data was collected and curated?

    Credit should be given for transparency and provision of all supporting information.It would be helpful if the sex of the individual sequenced was specified. Because the sample wasobtained in 1998 it would also be useful to know the storage conditions of the blood prior toDNA extraction or, if the DNA was extracted at that time, the storage conditions of the DNAbetween extraction and sequencing. The method of DNA extraction should also be specified.

    3. Is it clear - and was a statement provided - on how data and analyses tools used in the studycan be accessed?

    While we make every effort to make sure this information is available, we appreciate reviewersproviding an extra eye to make absolutely certain that this information is clearly stated andproperly available. Data availability and access to tools are essential for reproducibility andprovide the best means for reuse.

    There are a few instances where it is not clear what tools were used to perform certain analyses.Please see detailed comments below.

    4. Are accession numbers given or links provided for data that, as a standard, should besubmitted to a community approved public repository?

    Following community standards for data sharing is a requirement of the journal. Additionally,data sharing in the broadest possible manner expands the ways in which data and tools can beaccessed and used.

    At the moment there are only tentative NCBI accession numbers given for the assembledgenome (XXXXX). I can see that on NCBI a BioProject (Accession: PRJNA305960 ID: 305960)and BioSample (SAMN04335856) have already been registered which is good. It would behelpful if all of the different insert size libraries (additional file 1: table S1) were named in thetable and when the raw reads are submitted to the SRA for easy cross-referencing.

    5. Is the data and software available in the public domain under a Creative Commons license?

    Note, that unless otherwise stated, data hosted in our database (GigaDB) is available under aCC0 waiver. Additionally, did the authors indicate where the software tools and relevant sourcecode are available, under an appropriate Open Source Initiative compliant license? If the sourcecode is currently not in a hosted repository, we can help authors copy it over to a GigaScienceGitHub repository.

    6. Are the data sound and well controlled?

    If you feel that inappropriate controls have been used please say so, indicating the reasons foryour concerns, and suggesting alternative controls where appropriate. If you feel that furtherexperimental/clinical evidence is required for obtaining solid biological conclusions andsubstantiating the results, please provide details.Portions of the analysis, especially the definition of the gene family clusters, require more carefuldefinition and clarification. The text regularly refers to 'single-copy' genes but often does notclearly define their criteria for classifying genes as 'single-copy' and perhaps as a consequencethe text reads as internally inconsistent, with different analyses using different datasets of 'singlecopy'genes ranging from 1,690 to 3,738 to 10,660. Rather than calling each of these datasets'single-copy genes' a more specific descriptor should be used for the phylogenetic level at whichhomology was assessed and whether or not multiple paralogs are present in each case. Forexample the 1,690 gene set could be called 'single-copy ray-finned fish homologs' as this datasetshould comprise only cases where gar and all teleost genomes contain only a single homolog,while the 10,660 gene set could be called simply 'teleost homologs' as this dataset is restricted toteleosts but includes cases where multiple paralogs (e.g. igfr1a, igfr1b) are present and thereforeare not 'single-copy'. It is not clear to me if or where the 3,738 'single-copy orthologous' gene setwas used or whether this gene set includes paralogs or not. I would additionally urge caution indescribing genes as 'orthologous' where simple phenetic (i.e. BLAST-based) methods are used toclassify them. Orthology has a very specific phylogenetic meaning and in the context of teleostgenomes especially it is important to distinguish between orthologous and paralogous sequences.Where orthology and paralogy have not been assessed using phylogenetic methods the generalterm 'homology' should be used instead.

    7. Is the interpretation (Analysis and Discussion) well balanced and supported by the data?

    The interpretation should discuss the relevance of all the results in an unbiased manner. Are theinterpretations overly positive or negative? Note that the authors may include opinions andspeculations in an optional 'Potential Implications' section of the manuscript; thus, if there ismaterial in other parts of the manuscript that you feel would be better suited in such a section,please state that. Conclusions drawn from the study should be valid and result directly from thedata shown, with reference to other relevant work as applicable. Have the authors providedreferences wherever necessary?

    The authors are appropriately careful in drawing biological conclusions from their data andthroughout the analysis and discussion always imply potential roles rather than implying directcausality.

    8. Are the methods appropriate, well described, and include sufficient details and supportinginformation to allow others to evaluate and replicate the work?

    Please remark on the suitability of the methods for the study.

    If statistical analyses have been carried out, please indicate if you feel they need to be assessedspecifically by an additional reviewer with statistical expertise.

    In some cases more detailed descriptions of the methodology including parameters are needed.See details below.9. What are the strengths and weaknesses of the methods?

    Please comment on any improvements that could be made to the study design to enhance thequality of the results. If any additional experiments are required, please give details. If novelexperimental techniques were used please pay special attention to their reliability and validity.In some instances methodological improvements during analysis seem to be necessary to meetminimum requirements for publication. Please see below for details.

    10. Have the authors followed best-practices in reporting standards?

    This is an essential component as ease of reproducibility and usability are key criteria formanuscript publication. Please note, the methodology sections should never contain "protocolavailable upon request" or "e-mail author for detailed protocol". Have the authors followed andused reporting checklists recommended by the Biosharing network and if the methods areamenable, have the authors used workflow management systems such as Galaxy, Taverna or oneof the many related systems listed on MyExperiment? We can also host these in our Giga-Galaxyserver if they currently do not have a home. We also encourage use of virtual machines andcontainers such as Docker. And the use and deposition of both wet-lab and computationalprotocols in a protocols repository like protocols.io.

    In some cases additional details are required, particularly for methodology during the annotationand homologous gene cluster building.

    11. Can the writing, organization, tables and figures be improved?

    Although the editorial team may also assess the quality of the written English, please docomment if you consider the standard is below that expected for a scientific publication.

    If the manuscript is organized in such a manner that it is illogical or not easily accessible to thereader please suggest improvements. Please provide feedback on whether the data are presentedin the most appropriate manner; for example, is a table being used where a graph would giveincreased clarity? Do the figures appear to be genuine, i.e. without evidence of manipulation, andof a high enough quality to be published in their present form?

    The manuscript is clearly written. I have suggested moving analysis of bone-forming genes tothe main text as it warrants attention. Some minor changes to figures have been recommended.Please see below for details.

    12. When revisions are requested.

    Reviewers may recommend revisions for any or all of the following reasons: the data requireadditional testing to ensure their quality, additional data are required to support the authors'conclusions; better justification is needed for the arguments based on existing data; or the clarityand/or coherence of the paper needs to be improved.

    Several changes and/or clarifications are necessary prior to being published. Please see belowfor details.

    13. Are there any ethical or competing interests issues you would like to raise?

    The study should adhere to ethical standards of scientific/medical research and the authorsshould declare that they have received ethics approval and/or patient consent for the study, whereappropriate.

    Whilst we do not expect reviewers to delve into authors' competing interests, if you are aware ofany issues that you do not think have been adequately addressed, please inform the Editorialoffice.

    No issues.

    Detailed Revision Requests

    Introduction

    63 "other tetraodontid fishes such as pufferfish, boxfish and triggerfish"

    KM1: This should be changed to tetraodontiform fishes to refer to the whole order and to avoidconfusion with the family tetraodontidae (pufferfishes only).Genome assembly and annotation

    KM2: The number and sizes of the different libraries should be mentioned in the main text alongwith (an abbreviated version perhaps reporting only N50, contig number, scaffold number, andtotal size) of the assembly metrics - possibly in a brief concatenated figure combining S1, S2,and S3.

    KM3: It should be made clear in the text that the estimate of 134X coverage is based on a (laterdescribed) k-mer counting method of genome size estimation. Table S1 also says 131Xcoverage rather than 134X so whichever is correct should be used.

    KM4: My preference would also be that the coverage statistics reported in the main text shouldrefer to the reads actually used to produce the assembly and not the discarded data i.e. table S2"statistics of clean reads" as this is a more accurate reflection of what was used for producing theassembly used in all downstream analyses, so 96X coverage rather than 131X.

    KM5: The number of reads from each library actually used to produce the assembly should bereported so it is clear how much of the "clean" 68.87Gb was used by the assembler and howmuch was discarded. If this isn't available as a direct output from SOAPdenovo, all of the cleanreads should be realigned to the genome assembly (e.g. using bwa or bowtie) and it should be reported what proportion of the clean reads align uniquely and concordantly to the genomeassembly. This will also give a good idea of the completeness of the assembly.

    KM6: The parameters used for the SOAPdenovo assembly need to be stated. Justification forwhy these parameters and not others were used should be given, even if you only decided onthese parameters a posteriori after comparing assemblies. Did you try a range of parameters andcompare assembly metrics? Did you try a range of assembly programs? If yes this should bestated and summarized as a supplementary table. If not then it needs to be made clear that youonly produced one assembly and did not compare, but the parameters you used still need to bestated.

    KM7: The programs used for filtering, trimming and/or correcting the raw reads need to bestated along with the thresholds for calling a read or a base "low quality" and discarding it.

    KM8: More detailed results from the CEGMA analysis should be provided. Did you identify98.4% predicted 'full-length' proteins, or only partial proteins? Please report both values.Although I think CEGMA is still a useful tool, the authors should note that CEGMA is no longermaintained by the creators and they have released an alternative (BUSCO):http://www.acgt.me/blog/2015/5/18/goodbye-cegma-hello-busco

    KM9: Given that the assembly comprised 642Mb (88%) of an estimated 730Mb genomeestimated by the authors using a kmer counting method, it would be useful to have somediscussion of sunfish genome size estimated by other methods e.g. flow cytometry, see (Rainerd,E.L.L.B. et al., 2001. Patterns of Genome Size Evolution in Tetraodontiform Fishes.55(11),pp.2363-2368) and some personal communications by the authors themselves communicated inT. Ryan Gregory's genome size database (http://www.genomesize.com/) which both suggesteven larger genome sizes for sunfish. A stringent realignment of the clean reads to the genomeassembly should also give an idea of what proportion of the read data has been used by theassembly and what proportion has been discarded.

    KM10:For the estimation of genome size using k-mer analysis, please state the tools used tomake the calculation. How was the depth of 17mers counted? Is this an output of SOAPdenovoor another program like jellyfish?

    97 "The sunfish genome comprises approximately 11% repetitive sequences,98 which is comparable to the repeat content of the fugu genome (Figure 1)."

    KM11:It could be made clearer in the main text if the figure of 11% refers to interspersedrepeats only or is a combination including transposable elements, tandem repeats, and simplesequencerepeats. A breakdown of transposable element composition by type should beaccessible from the RepeatMasker runs already carried out and would enhance this analysis andshould be included in the supplementary data.99 "Using homology-based and de novo annotation methods, we predicted 19,605 protein-codinggenes100 in the sunfish assembly"

    KM12:The type of homology-based and de novo annotation methods should be mentioned in themain text (i.e. tBLASTn against protein predictions from 5 genomes and AUGUSTUS). In themethods it should be described what the cut-off thresholds for tBLASTn alignments were andwhat criteria for annotating the sunfish homolog were used (i.e. where more than one proteinaligned did you choose the one with the greatest length, %ID, E-Value?) Because the final geneset merged with GLEAN also contains AUGUSTUS please also report the sensitivity andspecificity of the AUGUSTUS parameters chosen during the training.

    101 "Using a genome-wide set of 1,690 one-to-one102 orthologs in sunfish and seven other ray-finned fishes (fugu, Tetraodon, stickleback,medaka,103 tilapia, zebrafish and spotted gar), we reconstructed a phylogenetic tree and estimated the104 divergence times of various fish lineages using MCMCtree [8]."

    KM13:It needs to be clearly stated how this set of 1,690 one-to-one orthologs was chosen andverified. Ensembl is a large database with many types of export tools. Please specify the toolsused and the thresholds/criteria used for defining one-to-one orthologs. Please also report thegenome assembly and annotation version for each genome separately rather than the Ensemblrelease version. A supplementary file containing the gene names and accession numbers foreach of the additional ray-finned fish genes and the corresponding sunfish gene model numbersused to form each cluster would be necessary to make this analysis reproducible.

    Figure 1

    KM14:The bootstrap support for each of the nodes in the tree should be reported on the figure.The figure (preferably) or at least the legend needs to specify which assembly and annotationversion of each of the genomes reported are being used to source the values for genome size,repeat content, and number of genes. If the repeat content comes from your own analysis ratherthan the published genomes this should be made clear as well. The value of 1.3% for the spottedgar repeat content is very different from the reported value of 20% from the gar genome paper(Braasch, I. et al., 2016. The spotted gar genome illuminates vertebrate evolution and facilitateshuman-teleost comparisons. Nature Genetics, 48(4)) and this should be double-checked.

    Population size history.

    KM15:Having never carried out such analyses my expertise is limited here but I wouldappreciate a very brief explanation of the core methodology of PSMC in the text or methods and a brief justification of its use highlighting its potential strengths and weaknesses. Preferably citeone or two examples that show that PSMC analysis is appropriate for comparing genomes whichdiverged >50mya rather than 250 thousand years (over two orders of magnitude difference) asthis seems like it might be problematic.

    Positively-selected and fast-evolving genes127 "Using a set of 10,660 one-to-one orthologues from five teleost species (sunfish, fugu,128 Tetraodon, medaka and zebrafish) we conducted positive selection analyses"

    KM16:Calling this 10,660 gene set 'one-to-one orthologues' is confusing as it contains multipleparalogs present in different quantities in different teleost genomes. It should be described howmany sunfish paralogs are found in each case, and whether the subsequent selection analysesused the teleost 'a' or 'b' paralogy groups as the sunfish genes do not seem to be classified withinthe teleost 'a' or 'b' paralogy groups. For example, insulin growth factor 1 receptor (igf1r) ispresent as 2 paralogs in fugu, Tetraodon, medaka and zebrafish (igf1ra, igf1rb) but only onesunfish homolog (Sunfish09150) is reported in the selection analyses (Table S6, S7). Is this theortholog of igf1ra or igf1rb? Table S8 suggests 2 copies of igf1r are found in sunfish and reports2 dN/dS values and LRT p-values but doesn't distinguish which is which. Furthermore, the LRTp-values reported in table S6 and S7 don't correspond with those reported in table S8 (5.78x10-4for the one igf1r paralog presented in S6, and 3.64x10-7, 2.3 x10-3 for the two igf1r paralogspresented in S8). It would help if the sunfish gene models were annotated with 'a' or 'b' if thishas been assessed - and if orthology hasn't been assessed calling them (1 of 2) and (2 of 2) wouldbe more appropriate. If different paralogs, rather than orthologs, were used in any alignments thedN/dS estimations and inferences of evolutionary rates are meaningless so it is crucial that themethods used to assess orthology are careful and clearly described.395 "We picked396 up genes whose likelihood values of H1 are significantly larger (LRT p-value of <0.05) than397 H0 and likelihood values of H2 are not significantly larger than H1."

    KM17:During the hypothesis testing it would also be more appropriate to select genes whoselikelihood values of H1 (sunfish evolving independently from rest of the tree) are significantlygreater than both H0 (all branches evolving at the same rate) and H2 (all branches evolvingindependently) before then sorting from this set which sunfish genes have a larger . It wouldalso be interesting to report which sunfish genes have a lower as this might imply a greateramount of constraint.144 "Using the branch models in PAML [20], we found multiple genes in the145 GH/IGF1 axis (ghr1, igf1r, grb2, irs1, irs2, jak2, stat5, akt3) with significantly higher dN/dS146 values compared to other lineages, suggesting that these genes are evolving rapidly in the147 sunfish lineage"

    KM18:Contrary to the above statement, the authors are not reporting sunfish genes withsignificantly higher dN/dS than other lineages but rather sunfish genes for which hypothesis H1(sunfish genes evolving at a different rate from the rest of the tree) is a significantly betterhypothesis than H0 (all branches evolving equally). There are also multiple examples (bothparalogs of irs1, one of the paralogs of irs2, one of the paralogs of jak2, and stat5) where thedN/dS value in sunfish is actually lower than the background dN/dS implying the sunfish genesare actually evolving slower than the background.

    Table S8. Copy number and LRT p-values of sunfish genes in the GH/IGF-1 axis.

    KM19:This should be changed to "select genes in the GH/IGF-1 axis" as this is not acomprehensive list of genes involved in this pathway.131 "we identified 1117 genes that contained positively-selected sites132 specifically in sunfish (Additional file 3: Table S7)."

    KM20:The authors should report how many sites (either absolute number or proportion ofcoding sequence) appear to be under positive selection for each of these cases in theirsupplementary data. Could the authors please also clarify whether their claim that these 1117genes contained positively-selected sites specifically in sunfish means that the sites or that thegenes show signs of positive selection only in sunfish.132 "Inspection of the fast-evolving and133 positively-selected gene sets revealed several interesting genes."

    KM21:'Positively-selected genes' should be replaced with 'genes with positively selected sites' asnone of the genes showed outright signs of positive selection (dN/dS > 1).

    KM22:Ideally the authors would perform a type of overrepresentation analysis using forexample GO or KEGG pathway terms to determine without bias whether the GH/IGF pathway,ECM components, or bone formation for example turn up more or less frequently than expectedat random in their set of 'rapidly-evolving' or 'positively-selected' genes. Otherwise it should bemade clear that the authors specifically looked at genes in the GH/IGF pathway and ECM. Forexample "we examined genes in the GH/IGF pathway" rather than "inspectionrevealed" as thisimplies that these genes somehow stood out form the rest of the data - which might be the casebut without an overrepresentation analysis it is not clear.144 "we found multiple genes in the145 GH/IGF1 axis (ghr1, igf1r, grb2, irs1, irs2, jak2, stat5, akt3) with significantly higher dN/dS146 values compared to other lineages, suggesting that these genes are evolving rapidly in the147 sunfish lineage"

    KM23:Again here as I understand it the analysis tested whether there was a significantdifference between H1 and H0, not whether there was a significant difference in dN/dS betweensunfish and other lineages. If this is a separate analysis it should be clearly stated. Furthermoreseveral dN/dS values reported for sunfish in table S8 are actually lower than the backgroundreported.147 "We found that both copies of igf1r148 (igf1ra and igf1rb) are under positive selection in the sunfish (Figure 2, Additional file 1:Table 149 S8)"

    KM24:Here please also replace "under positive selection" with "contain sites under positiveselection". The same applies to ECM analysis. If you have indeed assessed orthology withigf1ra and igf1rb please make this clear in earlier methods sections and report orthology in tableS7, S8 and elsewhere.190 "However, the sunfish191 possesses intact orthologues for most of these genes except for some SCPP genes (see192 Supplementary Material)"

    KM25:I find it disjointed that this analysis alone is described in supplementary materials. As itis integral to the motivation for conducting the study the analysis of bone forming genes shouldbe included in the main text.

    Additional File 1"We identified orthologues for all the above genes in the ocean sunfish genome on (a)scaffold10.1, (b) scaffold39.1, (c) scaffold20.1, and (d) scaffold77.1, except Optc and Omd."

    KM26:Please state how you identified these homologs. Did you perform tblastn, or tblastxgenome wide against your assembly and what did you use as your query sequences? What werethe similarity thresholds you used?

    "We BLASTX-searched the ocean sunfish loci of (a) and (b) to identify Optc and Omdrespectively, but did not identify these genes."

    KM27:Again, please clarify the type of BLAST algorithm you ran and the query and targetsequences you used. The above statement implies you used blastx to run the sunfish scaffolds asa query against a database containing Optc and Omd protein sequences. Is this correct? Whichspecies were the Optc and Omd proteins sourced from? What were the cutoff parameters used?

    "An alignment of Runx2 proteins shows that ocean sunfish Runx2 is highly conserved (e.g. itsDNA-binding domain is perfectly conserved; its central and C-terminal domains look intact aswell) (data not shown)."

    KM28:I have no reason to doubt this but if you are reporting it I suggest you show the dataespecially as your supplementary data is not restricted.

    KM29:The analysis of presence/absence of each of the target bone-formation related genesshould be presented in a table (in either the main text or SI). In each case where homologs ofbone-formation genes were found in sunfish the exact number of homologs found should bestated. E.g. "For Smad4, we identified up to four copies in ocean sunfish" is confusing and theexact number should be reported.

    200 "However, it has lost two P/Q-rich SCPP genes (fa93e10 and scpp7) that are conserved inthe201 other two teleosts"

    KM30:Before concluding gene loss please make it clear if you have searched the whole genomeassembly and not just the identified clusters for these genes, and whether you have also searched the raw genomic reads which may contain unassembled reads corresponding to the missinggenes.

    KM31:Because of the complex duplication history of SCPP genes I would consider it essentialto carefully assess homology of each of the genes in the P/Q-rich SCPP gene cluster withphylogenetic methods to ensure that scpp7 is indeed lost and that additional sunfish SCPP genesreported as scpp3b1 and scpp3b2 for example are not actually orthologs of scpp4, and that thereported pseudogene of scpp4 is not in fact scpp7.

    KM32:To confirm that scpp4 is indeed a pseudogene and that the insertion of the "T" is not asequencing/assembly error please report the results of a read re-mapping to this locus to verifythat the additional "T" is present in most raw reads which realign to this site.

    Hox genesKM33:In figure S3 a more appropriate or additional outgroup for analysis of Hox clusters inteleosts would be the spotted gar, which the authors have also previously used in their ownanalyses. See (Braasch, I. et al., 2016. The spotted gar genome illuminates vertebrate evolutionand facilitates human-teleost comparisons. Nature Genetics, 48(4)). The figure would beameliorated if the authors marked the independent gene losses which occurred on each branch tohighlight the differences in sunfish from other teleosts. It should also be reported what scaffoldnumbers in the sunfish assembly each Hox cluster corresponds to, in a similar fashion as reportedfor SCPP genes in Figure 4.

    Are the methods appropriate to the aims of the study, are they well described, and arenecessary controls included?If not, please specify what is required in your comments to the authors.

    NoAre the conclusions adequately supported by the data shown?If not, please explain in your comments to the authors.

    Yes

    Does the manuscript adhere to the journal’s guidelines on <a href=’http://resourcecms.springer.com/springercms/rest/v1/content/7117202/data/v1/Minimum+standards+of+reporting+checklist’target='new'>minimum standards of reporting?</a>If not, please specify what is required in your comments to the authors.

    Yes

    Are you able to assess all statistics in the manuscript, including the appropriateness ofstatistical tests used?(If an additional statistical review is recommended, please specify what aspects require furtherassessment in your comments to the editors.)

    Yes, and I have assessed the statistics in my report.

    Quality of written EnglishPlease indicate the quality of language in the manuscript:

    Acceptable

    Declaration of competing interestsPlease complete a declaration of competing interests, consider the following questions:1. Have you in the past five years received reimbursements, fees, funding, or salary from anorganization that may in any way gain or lose financially from the publication of thismanuscript, either now or in the future?2. Do you hold any stocks or shares in an organization that may in any way gain or losefinancially from the publication of this manuscript, either now or in the future?3. Do you hold or are you currently applying for any patents relating to the content of themanuscript?4. Have you received reimbursements, fees, funding, or salary from an organization thatholds or has applied for patents relating to the content of the manuscript?5. Do you have any other financial competing interests?6. Do you have any non-financial competing interests in relation to this manuscript?If you can answer no to all of the above, write ‘I declare that I have no competing interests’below. If your reply is yes to any, please give details below.

    I declare that I have no competing interests .

    I agree to the open peer review policy of the journal. I understand that my name will be includedon my report to the authors and, if the manuscript is accepted for publication, my named reportincluding any attachments I upload will be posted on the website along with the authors'responses. I agree for my report to be made available under an Open Access Creative CommonsCC-BY license (http://creativecommons.org/licenses/by/4.0/). I understand that any commentswhich I do not wish to be included in my named report can be included as confidential commentsto the editors, which will not be published.

    I agree to the open peer review policy of the journal.

    Authors' response to reviews: (https://static-content.springer.com/openpeerreview/art%3A10.1186%2Fs13742-016-0144-3/13742_2016_144_AuthorComment_V1.pdf)


    Published in
    Reviewed by
    Ongoing discussion
  • This manuscript characterizes the genomic property of the ocean sunfish and provides insights into its phenotypic specialization. The primary product of the study, the genome assembly, is well prepared and exhibits very high completeness and continuity, largely thanks to the relatively small genome size and the relatively low frequency of repetitive elements in the genome. Overall, this study, supported by the high-quality genome assembly, should contribute to an advancement of the research field for actinopterygian fish genomics, and I recommend publication of this manuscript in GigaScience, provided that the points below are reconsidered for improving the manuscript.

    1. Some morphological characteristics including the loss of caudal fin are mentioned, in search of possible genomic causes. But, little information in included in the manuscript, regarding the developmental process of the unique body plan. Is this because of the lack of records of embryological development?

    2. Page 6 / line 101, 'show similarity to sequences in public databases' : similarity to nucleotide or protein sequences?

    3. Some parts of the description in 'Analyses' include highly speculative expressions such as 'may have led to the extremely fast growth rate .' in page 9 line 169. It is recommended to move such speculative expressions to 'Discussion'

    4. Page 11 / line 207, 'complete loss fa93e10 and scpp7': insert 'of' between 'loss' and 'fa93e10', if I understand correctly

    5. Page 11 / line 211, 'an exact orthology' : Is 'exact' ortholog opposed to 'non-exact' orthology? It can't be. Thus, remove 'exact' from this sentence.

    6. Has any phenotypic evolution been clearly shown to be attributed to a Hox gene loss? If that has not been shown before, it may not be justified to analyze Hox gene repertoire for identifying causes of the sunfish's unique morphology.

    7. Page 15 / line 280, 'identified 98.4% of CEGs': I wonder if this is a figure for 'Complete' or 'Partial' gene detection in the CEGMA result?

    Table S1: For mate pair libraries, the figures for the column 'insert size' should not be 'insert size' but something like 'mate distance'.

    Are the methods appropriate to the aims of the study, are they well described, and are
    necessary controls included?
    If not, please specify what is required in your comments to the authors.

    Yes

    Are the conclusions adequately supported by the data shown?
    If not, please explain in your comments to the authors.

    Yes

    Does the manuscript adhere to the journal’s guidelines on <a href=’http://resourcecms.springer.com/springercms/rest/v1/content/7117202/data/v1/Minimum+standards+of+reporting+checklist’target='
    new'>minimum standards of reporting?</a>
    If not, please specify what is required in your comments to the authors.

    Yes

    Are you able to assess all statistics in the manuscript, including the appropriateness of
    statistical tests used?
    (If an additional statistical review is recommended, please specify what aspects require further
    assessment in your comments to the editors.)

    Yes, and I have assessed the statistics in my report.

    Quality of written English
    Please indicate the quality of language in the manuscript:
    Acceptable

    Declaration of competing interests

    Please complete a declaration of competing interests, consider the following questions:
    1. Have you in the past five years received reimbursements, fees, funding, or salary from an
    organization that may in any way gain or lose financially from the publication of this
    manuscript, either now or in the future?
    2. Do you hold any stocks or shares in an organization that may in any way gain or lose
    financially from the publication of this manuscript, either now or in the future?
    3. Do you hold or are you currently applying for any patents relating to the content of the
    manuscript?
    4. Have you received reimbursements, fees, funding, or salary from an organization that
    holds or has applied for patents relating to the content of the manuscript?
    5. Do you have any other financial competing interests?
    6. Do you have any non-financial competing interests in relation to this manuscript?
    If you can answer no to all of the above, write ‘I declare that I have no competing interests’
    below. If your reply is yes to any, please give details below.

    I declare that I have no competing interests.

    I agree to the open peer review policy of the journal. I understand that my name will be included
    on my report to the authors and, if the manuscript is accepted for publication, my named report
    including any attachments I upload will be posted on the website along with the authors'
    responses. I agree for my report to be made available under an Open Access Creative Commons
    CC-BY license (http://creativecommons.org/licenses/by/4.0/). I understand that any comments
    which I do not wish to be included in my named report can be included as confidential comments
    to the editors, which will not be published.

    I agree to the open peer review policy of the journal.

     

    Authors' response to reviews: (https://static-content.springer.com/openpeerreview/art%3A10.1186%2Fs13742-016-0144-3/13742_2016_144_AuthorComment_V1.pdf)

     


    Published in
    Reviewed by
    Ongoing discussion
All peer review content displayed here is covered by a Creative Commons CC BY 4.0 license.