Content of review 1, reviewed on September 12, 2016

This manuscript describes the genome assembly of the intriguing Chinese clearhead icefish. Overall, the sequencing and assembly meet the standards for a genome based on Illumina technology, as do the annotation and validation.

There are a few issues I would like to ask the authors to clarify:

1) Is it correct that a full third of the original sequencing data was discarded (252.1 Gbp -> 169.0 Gbp)? I could not find the exact meaning of SOAPfilter settings. (I think this tool does not include k-mer-based error correction or read trimming?)
2) The reason I ask, is because the genome size calculations (lines 97-101) are incorrect. Given N = 10.5 billion, k-depth = 20, it is easy to see how the 525 Mbp genome size was derived. However, the formula is not G = N/k-depth, and there should have been only 2 billion original reads, so this is clearly not the read number. Calculating N using the correct formula (line 98), I get 525 million = N * (125-17+1)/20, so N = 96 million, which is also nowhere near the (filtered) number of reads. Was a subset used? (Also note that the formula is only valid if all reads are of identical length, therefore trimmed reads should be omitted). In any case, a k-mer depth of only 20 must be incorrect (or based on a subset) in itself, as the genome coverage (table 1) is 315x.
3) Line 106: it should be reported that the 536 Mbp in scaffolds contain 121.7 Mbp in gaps. Whether the assembly then still qualifies as high quality is debatable, this depends fully on whether the genome size is really expected to be 525 Mbp (in which case the assembly misses 21% of the genome - not high quality), or whether the genome size is actually much smaller and the gaps between contigs are artificially large because of uncertainties in read library insert sizes.

Typos:
Line 104: 'fulfilled' -> better 'filled'
Line 124: ReBase -> RepBase. Also, please fix the author list of the corresponding entry [19] in the references.
Line 142: 'six … genomes, including…' then lists all six.

Level of interest
Please indicate how interesting you found the manuscript:
An article whose findings are important to those with closely related research interests

Quality of written English
Please indicate the quality of language in the manuscript:
Acceptable

Declaration of competing interests
Please complete a declaration of competing interests, considering the following questions:
Have you in the past five years received reimbursements, fees, funding, or salary from an organisation that may in any way gain or lose financially from the publication of this manuscript, either now or in the future?
Do you hold any stocks or shares in an organisation that may in any way gain or lose financially from the publication of this manuscript, either now or in the future?
Do you hold or are you currently applying for any patents relating to the content of the manuscript?
Have you received reimbursements, fees, funding, or salary from an organization that holds or has applied for patents relating to the content of the manuscript?
Do you have any other financial competing interests?
Do you have any non-financial competing interests in relation to this paper?
If you can answer no to all of the above, write 'I declare that I have no competing interests' below. If your reply is yes to any, please give details below.

I declare that I have no competing interests.

I agree to the open peer review policy of the journal. I understand that my name will be included on my report to the authors and, if the manuscript is accepted for publication, my named report including any attachments I upload will be posted on the website along with the authors' responses. I agree for my report to be made available under an Open Access Creative Commons CC-BY license (http://creativecommons.org/licenses/by/4.0/). I understand that any comments which I do not wish to be included in my named report can be included as confidential comments to the editors, which will not be published.
I agree to the open peer review policy of the journal.

Authors' response to reviews:

Reviewer #1: This manuscript describes the genome assembly of the intriguing Chinese clearhead icefish. Overall, the sequencing and assembly meet the standards for a genome based on Illumina technology, as do the annotation and validation.
There are a few issues I would like to ask the authors to clarify:
1) Is it correct that a full third of the original sequencing data was discarded (252.1 Gbp -> 169.0 Gbp)? I could not find the exact meaning of SOAPfilter settings. (I think this tool does not include k-mer-based error correction or read trimming?)

Answer: Yes, we are really convinced of the filtering steps and the data of sequenced reads. In fact, we took necessary filtering processes to remove 3 and 1 low-quality bases in left and right edges of the raw reads and to discard those duplicated reads produced by sequencing PCRs. Usually, around 20-30% of raw reads were removed in our previous reports (You et al. 2014, Nature Communications, 5:5594; Yang et al., 2015, BMC Biology, 14:1; Bian et al., 2016, Scientific Reports, 6:24501; Chen et al., 2016, GigaScience, 5:39; Lin et al., 2016, Nature, in press) when the same processes were applied. For our present work, the discarding percentage is a little bit higher because we generated much more sequence (over 300×; without beforehand estimation of the genome size) than the necessary 150~200×, therefore, the parameters are more stringent so as to obtain a much better assembly. We indeed estimated the genome size (~0.5 Gb) with the k-mer analysis (lines 85-96 and Table 1).

SOAPfilter contains the trimming process that can remove those reads with both edges of low-quality bases. The input file for the SOAPfilter likes: 150822_I178_FCC7F9KANXX_L2_WHPROfreDAABDLAAPEI-106_1.fq.gz 3 1 40 or
150822_I178_FCC7F9KANXX_L2_WHPROfreDAABDLAAPEI-106_2.fq.gz 3 1 10. 150822_I178_FCC7F9KANXX_L2_WHPROfreDAABDLAAPEI-106_1.fq.gz stands for the file of reads, the followed 3 and 1 represent that the SOAPfilter will discard reads with 3 and 1 low-quality bases in the left and right edges, respectively. The last number, 40 or 10, indicates that the SOAPfilter software will discard the reads with over 40% low-quality bases or with more than 10 Ns (not determined sequence bases).


2) The reason I ask, is because the genome size calculations (lines 97-101) are incorrect. Given N = 10.5 billion, k-depth = 20, it is easy to see how the 525 Mbp genome size was derived. However, the formula is not G = N/k-depth, and there should have been only 2 billion original reads, so this is clearly not the read number. Calculating N using the correct formula (line 98), I get 525 million = N * (125-17+1)/20, so N = 96 million, which is also nowhere near the (filtered) number of reads. Was a subset used? (Also note that the formula is only valid if all reads are of identical length, therefore trimmed reads should be omitted). In any case, a k-mer depth of only 20 must be incorrect (or based on a subset) in itself, as the genome coverage (table 1) is 315x.

Answer: Thank you for the question, which may be generated by our ambiguous statements regarding the N values for calculation of the estimated genome size. Hence, we rewrote the related section in the revised manuscript (lines 92-96) to make clear statements.
In fact, as we know, the start positions of sequenced reads follow a Poisson distribution pattern. When the read length (L) is far shorter than the genome size (L<<g), the="" bases="" and="" k-mers="" can="" be="" thought="" to="" generated="" by="" random="" processes="" their="" coverage="" depth="" will="" also="" follow="" poisson="" distributions="" [liu="" et="" al.,="" 2013].="" <br="">Based on the Poisson theory, we actually applied the following formula to calculate the genome size: G=K_num/K_depth=b_num/b_depth. K_num is the total number of K-mers from the sequencing data, K_depth is the expected coverage depth for k-mers, b_num is the total number of bases, b_depth is the expected coverage depth of bases; As one read with length L generates L-K+1 k-mers, K_num /b_num = (L-K+1) / L. In our manuscript (lines 92-96), the K_num was 10, 500,000,000, and the K_depth was 20. Therefore, the estimated genome size is 525 Mb.
Reference: Liu B, Shi Y, Yuan J, Hu X, Zhang H, Li N, Li Z, Chen Y, Mu D, Fan W: Estimation of genomic characteristics by analyzing k-mer frequency in de novo genome projects. Quantitative Biology 2013, 35(s 1–3):62-67.


3) Line 106: it should be reported that the 536 Mbp in scaffolds contain 121.7 Mbp in gaps. Whether the assembly then still qualifies as high quality is debatable, this depends fully on whether the genome size is really expected to be 525 Mbp (in which case the assembly misses 21% of the genome - not high quality), or whether the genome size is actually much smaller and the gaps between contigs are artificially large because of uncertainties in read library insert sizes.

Answer: Thanks for your comment. Our assembly has a long contig at 17.2 kb, which was sufficient for the further genome analyses, including annotation and evolution discussion. However, “high quality” should be rewritten; hence, we change it to be “a draft genome”.
It is the first version for the clearhead icefish genome assembly, and we will sequence more with Pacbio to improve the assembly quality of this valuable fish.


Typos:
Line 104: 'fulfilled' -> better 'filled'
Line 124: ReBase -> RepBase. Also, please fix the author list of the corresponding entry [19] in the references.
Line 142: 'six … genomes, including…' then lists all six.

Answer: Thanks for your nice suggestions. We revised these sentences according to your advice.



Reviewer #2: The paper presented here provides efficient combination of different programs used to characterise the genome of the Chinese clear head icefish.
I only can regret that the basic results provided here are way too succinct to get a full appreciation for the reviewers and the readers, lately. As an example, we are left with an final average value for the total number of transposable elements in this genome, but we are left with no idea about the proportions of each TE subdivisions. No evolutionary values are provided. Teleost fish are known to have a extra round of whole genome duplication, this result was not searched, nor discussed. Synteny hasn't been considered as well.
The authors put forward the methods they use and they provide minimalist details of the results.
Why hasn't the homology annotation done using the tilapia and platyfish?
Having in hand such a genome could have been the opportunity for more results, to enhance the interest of this publication. As an example, some phylogenetical analyses of key genes.

Answer: Thanks for your instructive suggestions. We really agree with you that providing the detailed classification of repeat sequences and adding the sections of synteny blocks and phylogentical tree will enhance the interests of readers. Therefore, we provide detailed descriptions of these three areas (lines 115-126 & 166-206) and related Table 2 and Figures 1 & 2 in the revised manuscript.
On the other hand, we selected six representative species, including Danio rerio, Oryzias latipes, Takifugu rubripes, Tetraodon nigroviridis, Esox lucius and Gasterosteus aculeatus, to perform the homolog annotation. As reported in our previous genome papers (mudskipper: You et al., 2014; channel catfish: Chen et al., 2016), genome data from these six species were sufficient for gene annotation. Therefore, the final predicted gene set in our present icefish work, evaluated by the BUSCO software, was indeed relatively complete.
Reference:
1.You X, Bian C, Zan Q, Xu X, Liu X, Chen J, Wang J, Qiu Y, Li W, Zhang X et al: Mudskipper genomes provide insights into the terrestrial adaptation of amphibious fishes. Nature communications 2014, 5:5594.
2.Chen X, Zhong L, Bian C, Xu P, Qiu Y, You X, Zhang S, Huang Y, Li J, Wang M et al: High-quality genome assembly of channel catfish, Ictalurus punctatus. GigaScience 2016, 5(1):39.


Minor typos:
l15 Missing blank space
l42 no plural at cavefish (fishes only used when an exact number is provided)
l124 missing "p" in RepBase
l128 missing "o" in MaxPeriod
l151 we emplyed GLEAN (no need of article before GLEAN)
The bibliography has a strong record of missing names (e.g. ref 19 and 39, or layout problems, like in ref. 20, 21, 23)

Answer: Thanks for your advice. We have corrected these issues according to your instructions.

 


Source

    © 2016 the Reviewer (CC BY 4.0).

Content of review 2, reviewed on October 19, 2016

I thank the authors for responding to the minor issues I raised, and for their inclusion of additional evolutionary analyses. I am satisfied with their responses, however I still have a few minor suggestions:

- Regarding the genome size calculation, thank you for the clarification. As I read the formula, it is now clear that this calculation is based on a subset of the data. This is of course fine (and wise), but to avoid confusion I would suggest explicitly stating this in line 88 or 95.
- Regarding the assembly size, I think the total gap length could be included in table 1.

As for the evolutionary analyses, these are interesting and appropriate, however in both cases I suggest rephrasing the final lines summarizing the results, as these are now slightly ambiguous.

- Line 185/186: 'the close relationship between clearhead icefish and zebrafish & medaka'. Zebrafish and medaka are themselves not closely related fish species at all. Perhaps it is better to simply state that the data demonstrate the phylogenetic position of the clearhead icefish.

-Line 206: 'clearhead icefish also experienced the WGD, and it appeared more recently than medaka'. This could be read as a more recent WGD than in the case of medaka, when it is of course the same WGD (at exactly the same time).

Level of interest
Please indicate how interesting you found the manuscript:
An article of importance in its field.

Quality of written English
Please indicate the quality of language in the manuscript:
Acceptable

Declaration of competing interests
Please complete a declaration of competing interests, considering the following questions:
Have you in the past five years received reimbursements, fees, funding, or salary from an organisation that may in any way gain or lose financially from the publication of this manuscript, either now or in the future?
Do you hold any stocks or shares in an organisation that may in any way gain or lose financially from the publication of this manuscript, either now or in the future?
Do you hold or are you currently applying for any patents relating to the content of the manuscript?
Have you received reimbursements, fees, funding, or salary from an organization that holds or has applied for patents relating to the content of the manuscript?
Do you have any other financial competing interests?
Do you have any non-financial competing interests in relation to this paper?
If you can answer no to all of the above, write 'I declare that I have no competing interests' below. If your reply is yes to any, please give details below.

I declare that I have no competing interests.

I agree to the open peer review policy of the journal. I understand that my name will be included on my report to the authors and, if the manuscript is accepted for publication, my named report including any attachments I upload will be posted on the website along with the authors' responses. I agree for my report to be made available under an Open Access Creative Commons CC-BY license (http://creativecommons.org/licenses/by/4.0/). I understand that any comments which I do not wish to be included in my named report can be included as confidential comments to the editors, which will not be published.
I agree to the open peer review policy of the journal.

Authors' response to reviews:

Reviewer #1: I thank the authors for responding to the minor issues I raised, and for their inclusion of additional evolutionary analyses. I am satisfied with their responses, however I still have a few minor suggestions:
- Regarding the genome size calculation, thank you for the clarification. As I read the formula, it is now clear that this calculation is based on a subset of the data. This is of course fine (and wise), but to avoid confusion I would suggest explicitly stating this in line 88 or 95.

Answer: Thanks for your nice suggestion. We corrected these sentences in our revised manuscript (lines 89-94).

- Regarding the assembly size, I think the total gap length could be included in table 1.

Answer: Yes, it is included in the revised Table 1.

As for the evolutionary analyses, these are interesting and appropriate, however in both cases I suggest rephrasing the final lines summarizing the results, as these are now slightly ambiguous.
- Line 185/186: 'the close relationship between clearhead icefish and zebrafish & medaka'. Zebrafish and medaka are themselves not closely related fish species at all. Perhaps it is better to simply state that the data demonstrate the phylogenetic position of the clearhead icefish.

Answer: Thanks for your advice. We modified the description (lines 185-186) of phylogenetic tree according to your suggestion.

-Line 206: 'clearhead icefish also experienced the WGD, and it appeared more recently than medaka'. This could be read as a more recent WGD than in the case of medaka, when it is of course the same WGD (at exactly the same time).

Answer: Thanks for your comment. We changed the related sentence of WGD incident in lines 204-206. We also used Nile tilapia as the reference, instead of the previous medaka for calculation of 4DTV distances, because Nile tilapia owns a more completed gene set.



Reviewer #2: I have not much to say about the quality of the work mastered here by the authors to generate the assembly of another fish genome, but among the Stomatii, a taxonomic location not yet been investigated.
Sentence line 185-186 is not well written. The phylogeny does not match the statement. Maybe the authors should rather name the taxonomic groups rather than naming two species that twisted their meaning.

Answer: Thanks for your suggestion. We changed the species names to their corresponding names of taxonomic groups in the phylogenetic tree (Figure 1).

Sentence line 204-206. This sentence shows that the authors have been mislead. There was only one event of whole genome duplication at the base of the Teleost genome. The different average rate of evolution in the different species observed by gene comparison has a different meaning that the one stated here. In this case, it is more suitable to rephrase this sentence. See similar comparisons made in the trout genome paper by Berthelot et al. 2014. Therefore, fig.2 is just a glimpse of what can be done, and this interpretation is too scarce and misleading.
Besides, recently published genomes have a finer tuned gene descriptions as better annotated sequences and more genes are described. Oreochromis niloticus, the Nile tilapia is a good example. On the contrary, the medaka genome lacks several genes, found in several other fish species, and sometimes key genes.
One suggestion would be to find out the number of duplicates originating from the teleost fish duplications compared to what have been described in the other species.

Answer: Thanks for your instructive suggestion. We used Nile tilapia as the reference to perform the analysis of WGD again. We also revised the related sentence to describe the WGD incident of clearhead icefish (lines 204-206). On the other hand, the distribution of 4DTV is indeed professional method for identification of the WGD incidents. It has been wildly used in massive genome papers, such as common carp [1], Japanese lampery [2], etc. Our new result of 4DTV shows that the clearhead icefish experienced the same WGD incident as Nile tilapia.
[1] Xu P, Zhang X, Wang X, Li J, Liu G, Kuang Y, Xu J, Zheng X, Ren L, Wang G et al: Genome sequence and genetic diversity of the common carp, Cyprinus carpio. Nature genetics 2014, 46(11):1212-1219.
[2] Mehta T K, Ravi V, Yamasaki S, et al. Evidence for at least six Hox clusters in the Japanese lamprey (Lethenteron japonicum). Proceedings of the National Academy of Sciences, 2013, 110(40): 16044-16049.


What TRF stands for in Table 2?

Answer: TRF represents the Tandem Repeat Finder. We provided its full name in the revised Table 2.

Line 184 typo in the word "sequence".
several "&" instead of "and".

Answer: Thanks for your advice. We have corrected these errors according to your instructions.


Source

    © 2016 the Reviewer (CC BY 4.0).

References

    Kai, L., Dongpo, X., Jia, L., Chao, B., Jinrong, D., Yanfeng, Z., Minying, Z., Xinxin, Y., Yang, Y., Jieming, C., Hui, Y., Gangchun, X., Di-an, F., Jun, Q., Shulun, J., Jie, H., Junmin, X., Qiong, S., Zhiyong, Z., Pao, X. 2017. Whole genome sequencing of Chinese clearhead icefish, Protosalanx hyalocranius. GigaScience.