Content of review 1, reviewed on November 19, 2017

This manuscript presented the genome assembly and annotation of Ammopiptanthus nanus. This dataset could add more resources for plant genomics, especially for the dry land plant research. Below are my detailed concerns:

1) As a genomics resource, the authors didn't release the data properly. Even though the raw sequencing reads are in SRA, a lot of future data users would look for the genome assembly and annotation. A genome browser and ftp will be requested. Reference genome would request further improvement and maintenance. I would suggest the authors to have a genome browser up or deposit the genome to the large databases like NCBI, Ensembl-plants. In this way, users would have better access to the genome. On the hand, the feedback from the users could help to improve this geneome. 2) As a manuscript it is not well organized. There are seven tables, five of which only have one row except for the header. These numbers have been described in the text. These tables should be reorganized. 3) It doesn't make a lot of sense to have a main figure to only show the look of the plant. As the author tried to introduce this species, an evolutionary species tree would be more informative. 4) The last paragraph about the background (Line 61-63): "Most of the de novo assemblies of plant genomes reported recently have been performed using the next generation sequencing technologies such as Illumina or 454 sequencing platforms" is too subjective, as the authors would like to highlight their genome was done by PacBio. Recently there is also a list of complicated plant genomes being done by third generation sequencing: maize, sunflower, Oropetium thomaeum, rice, Chenopodium quinoa, thomaeum. Authors should not ignore these high genomes. They should be included in introduction. 5) After the genome assembly, the contigs were only polished by Pilon using short reads. I would also recommend to also use PacBio reads to correct the base calling, as illumina reads may not be able to cover all the contigs due to alignment and sequencing bias. The number of bases were checked and corrected should be mentioned, too. 6) Authors deployed several approaches for gene annotation. Figure 2 is not very informative. A flowchart figure describing how the gene annotation was done could be better. I also have concerns that only default parameters from the gene predictors were used for ab initio prediction. Many of these default prediction parameters were not suitable for plants. They request training models to generate accurate gene models. A comparative analysis with close related species will be strong evidence to show the quality of gene models. With the current data, there is no evaluation of the quality of gene models.

Level of interest Please indicate how interesting you found the manuscript:
An article whose findings are important to those with closely related research interests

Quality of written English Please indicate the quality of language in the manuscript:
Not suitable for publication unless extensively edited

Declaration of competing interests Please complete a declaration of competing interests, considering the following questions: Have you in the past five years received reimbursements, fees, funding, or salary from an organisation that may in any way gain or lose financially from the publication of this manuscript, either now or in the future? Do you hold any stocks or shares in an organisation that may in any way gain or lose financially from the publication of this manuscript, either now or in the future? Do you hold or are you currently applying for any patents relating to the content of the manuscript? Have you received reimbursements, fees, funding, or salary from an organization that holds or has applied for patents relating to the content of the manuscript? Do you have any other financial competing interests? Do you have any non-financial competing interests in relation to this paper? If you can answer no to all of the above, write 'I declare that I have no competing interests' below. If your reply is yes to any, please give details below.
I declare that I have no competing interests.

I agree to the open peer review policy of the journal. I understand that my name will be included on my report to the authors and, if the manuscript is accepted for publication, my named report including any attachments I upload will be posted on the website along with the authors' responses. I agree for my report to be made available under an Open Access Creative Commons CC-BY license (http://creativecommons.org/licenses/by/4.0/). I understand that any comments which I do not wish to be included in my named report can be included as confidential comments to the editors, which will not be published.
I agree to the open peer review policy of the journal.

Authors' response to reviews: Reviewer #1: This report describes a standard, PacBio-based sequence and assembly for Ammonpiptanthus. The underlying sequencing technology is reasonable, apparently involving DNA template of acceptable length and quality and leading to sequence reads and depth suitable for whole genome assembly. Subsequent analyses (k-mer, repeat, genome feature annotation) are standard with methods described in sufficient detail. Accompanying files are typical and sufficient for most follow-up analyses researchers may pursue. However, there are numerous English grammar errors in the text that should be corrected. Response:The manuscript has been improved with the help of an English editing service, Editage.

Citation(s) supporting the statement that the genus is interesting and characterized in terms of abiotic stress tolerance are missing (that is, plant physiology research, not just transcriptomic analyses). Response:We have added three additional plant physiology research related literatures in the reference section.

Non-genomic research supporting the very low level of heterozygosity, should accompany the value reported here. Finally, locations in the text where accompanying data files are available should be highlighted within the text. Response:We have deposited the related data according to Gigascience's requirement. And we have added the NCBI SRA accession numbers of the related data in the corresponding locations in the text.

Reviewer #2: This manuscript presented the genome assembly and annotation of Ammopiptanthus nanus. This dataset could add more resources for plant genomics, especially for the dry land plant research. Below are my detailed concerns:

1) As a genomics resource, the authors didn't release the data properly. Even though the raw sequencing reads are in SRA, a lot of future data users would look for the genome assembly and annotation. A genome browser and ftp will be requested. Reference genome would request further improvement and maintenance. I would suggest the authors to have a genome browser up or deposit the genome to the large databases like NCBI, Ensembl-plants. In this way, users would have better access to the genome. On the hand, the feedback from the users could help to improve this geneome. Response: We have deposited the related data according to Gigascience's requirement.

2) As a manuscript it is not well organized. There are seven tables, five of which only have one row except for the header. These numbers have been described in the text. These tables should be reorganized. Response:Thank you for your suggestion. The five tables with one row were deleted (Table 1), or moved to the supplementary material (Table 2, 4, 6, 7 and 9).

3) It doesn't make a lot of sense to have a main figure to only show the look of the plant. As the author tried to introduce this species, an evolutionary species tree would be more informative. Response:The picture of Ammopiptanthus nanus will be kept according to editor’s advice.

4) The last paragraph about the background (Line 61-63): "Most of the de novo assemblies of plant genomes reported recently have been performed using the next generation sequencing technologies such as Illumina or 454 sequencing platforms" is too subjective, as the authors would like to highlight their genome was done by PacBio. Recently there is also a list of complicated plant genomes being done by third generation sequencing: maize, sunflower, Oropetium thomaeum, rice, Chenopodium quinoa, thomaeum. Authors should not ignore these high genomes. They should be included in introduction. Response:We have listed several plant species whose genomes were recently sequenced using third PacBio sequencing platform in introduction.

5) After the genome assembly, the contigs were only polished by Pilon using short reads. I would also recommend to also use PacBio reads to correct the base calling, as illumina reads may not be able to cover all the contigs due to alignment and sequencing bias. The number of bases were checked and corrected should be mentioned, too. Response:When we use Pacbio or Illumina data to correct a genome, the same strategy is applied: reads were mapped to the genome and SNPs & INDELs were corrected. But after quiver correction, a lot of errors were still in the genome as the error ratio of the subreads is high. For example, Pacbio data was used to corrected the gorilla genome in the first step. After this step the QV of the genome was 30. Finally, Illumina data was used to corrected the genome and the author mentioned that “After error correction, we estimate that Susie3 has less than one error per 5000 bp (QV > 35)” (David Gordon, 2016).

In total, 56 Gb Illumina data (~70 fold coverage) was used to correct the final genome of A. nanus by multi-rounds. The data covered 99% of the assembled genome. The data is huge enough to cover all the genome and the depth is high.

And in the newest Pilon version (https://github.com/broadinstitute/pilon/releases) (the version was used to polish the A. nanus assembly in our study), the parameter “-fix bases” were supplied to correct both SNPs and INDELs. Furthermore, in the statistics of the final assembly, the number of SNPs and INDELs were very low. The table for error state of the genome is provided as below. In addition, all the genome assessment result showed that the final genome quality is good. In summary, we think our flowchart can achieve a good assembly without quiver or arrow correction.

Table 1 The error rate of the A. nanus genome after Pilon correction Round snp_num insert_num delete_num error_ratio (%) Round1 99,239 25,118 83,385 0.03 Round2 2,734 1,736 4,637 0.00

6) Authors deployed several approaches for gene annotation. Figure 2 is not very informative. A flowchart figure describing how the gene annotation was done could be better. I also have concerns that only default parameters from the gene predictors were used for ab initio prediction. Many of these default prediction parameters were not suitable for plants. They request training models to generate accurate gene models. A comparative analysis with close related species will be strong evidence to show the quality of gene models. With the current data, there is no evaluation of the quality of gene models. Response:Figure 2 has been replaced by a new Venn diagram plotted using UpSetR (Figure S2), and the new figure clearly depicted the integration of gene predictions using the three approaches.

Augustus, Genscan, GlimmerHMM, GeneID, and SNAP were used to conduct the ab initio gene prediction and all the original gene model was set to use the Arabidopsis gene model as the training models, a gene model almost used in gene model prediction of all plants. For Augustus' prediction, we took the Arabidopsis's gene model as the ab initial gene model, but also, the PASA's gene model was used as initial gene model for training. Finally, the best gene model with higher precision and specificity was selected as the gene model. At last the Ab initio-based, homolog-based and transcriptomic-based genes were integrated together using EVM. Finally, models with EVM score value more than 1000, length >=300 bp and the length of full CDS is the 3 times of integers were considered as genes.

We used the transcriptome data to assess the quality of the gene models by mapping the data to the whole genome using Tophat (Table S5).

Reviewer #3: The article (Manuscript#: GIGA-D-17-00264) entitled "Long-read sequencing and de novo genome assembly of Ammopiptanthus nanus, a desert shrub", by Fei Gao and collaborators delivers "a de novo assembly of the rare broad-leaved shrub Ammopiptanthus nanus". The work presents the completeness of the assembled genome as well its gene and repeat annotation. It's major finding is the high amonunt of repetitive elements and the authors suggest their work as a valuable source for comparative genomics analysis in the family of legumes. I agree with most of what the authors state but have the following suggestions:

1) The size of the genome was estimated via k-mer distribution. Figure S1 shows a major and a minor peak. I agree with the authors that the highest peak represents the diploid genome given the low amounts of BUSCO duplicates. Nevertheless, a in-depth description of the repetitve peak (better contigs associated with it) would be interesting and should be added to the manuscript. Please describe how much contigs and how much total sequence are duplicated (showing k-mers from the second peak) and elaborate on the annotated genes/repeats you can find of there. Consider moving the k-mer histogram and maybe a graphical (GO enrichment based) summary of duplicated genes/repeats to the main text as figure. Response:Thank you for your kind advices. As we know, few genes were located at the repeat region and we predicted the gene model using the masked genome. The kind of repeat elements and their contents were obtained in the repeat annotation step and supplied in table S4. Indeed, there exists genes with multi-copies but few genes with high copies might be counted and appeared at the second peak when their sequences were cut to be K-mers as the length of their sequences were quite small compared to the whole genome.

2) The genome was assembled with Canu and polished with Pilon. Could the authors explain while the initial Canu assembly was not polished with Arrow prior to Pilon polishing. A lot of medium sized InDels usually remain in the assembly if only Pilon is used for assembly polishing. The final genome release should include a Arrow polishing step. The genome assembly should be deposited in one of the public databases. Response: Please see our response to the questions 5 of the review2. The genome assembly was deposited in GigaDB according to Gigascience's requirement.

3) Repeats and genes were predicted with various tools. The text is missing details on which models the ab initios gene predictors were used with (e.g., Augustus, SNAP). Please describe how these models were generated. Response: The Ab initio gene predictors include Augustus, GeneID, Genescan, GlimmerHMM and SNAP. All these softwares were trained using the Arabidopsis gene model before gene prediction. For Augustus, PASA's gene model was also used as initial gene model for training. The following descriptions have been added in our manuscript: “and all these software packages were trained using the Arabidopsis gene model before gene prediction. For gene prediction using Augustus, besides the Arabidopsis's gene model, the PASA's gene model was also used as initial gene model for training. Finally, the best gene model with higher accuracy and specificity was used. Quality evaluation of gene models was conducted by aligning transcriptome sequences to the whole genome assembly using Tophat (Table S5).”.

4) Line 145-148: Please remove the term "functionally annotated" and repleace it with something appropriate (e.g., classified into families and predicting domains and important sites). The tools mentioned are annotating evolutionary conserved domains or lift over putative functions via homology-based methods but none of them are able to annotate genes functionally. Response: The term "functionally annotated" was replaced by “classified into families according to their putative functions”.

5) Please add a short section/sentence about the gene models that were choosen by EVM. Simply indicate which was the prefered gene prediction tool/model that EVM selected from. Response: The following descriptions have been added in our manuscript.

“Higher weights were assigned to the PASA predicted transcripts from unigenes and GeMoMa predicted homologous transcripts than to the ab initio predicted transcripts when conducting the EVM integration.”

6) Please indicate whether default mapping parameters were used to assess the genome completeness via short read mappings and adopt the quality assessment method from (Bickhart et al. 2017, Nat Genet.; Jain et al. 2017, bioRxiv). Response: Default mapping parameters were used to assess the genome completeness via Illumina short read mappings. We adopted a similar assessment method from (Bickhart et al. 2017, Nat Genet.) to assess the genome completeness via short read mappings. We used BWA (0.7.10-r789) to map the Illumina data to the genome using default parameters. The Q30 of our data is 91.67% and the properly mapping rate is up to 98%. We also used BUSCO and EST to assess the genome and the scores are 92.22% and 100% respectively. All assessment results indicated that the completeness of our genome is pretty high.

7) Please indicate that the data used for Pilon polishing is the same that is used during the completeness assessment and state any eventual bias. Response: The same Illumina data were used to proceed the Pilon polishing and genome completeness assessment. The bias details were supplied in Table S8.

8) Please replace Figure 2 with either a scaled Venn diagram but better with a UpSetR plot to ease interpretation of the gene prediction integration. Further consider moving the Figure 2 to the supplement since Table 3 is more than enough. Response: Thank you for your advice. We replaced Figure 2 with a new Venn diagram plotted using UpSetR, and the new figure were moved to supplementary file as Figure S2.

Source

    © 2017 the Reviewer (CC BY 4.0).

Content of review 2, reviewed on January 18, 2018

The authors have clarified important issues and questions raised by the reviewers and added helpful citations.

Minor points:
 1. Even though the authors stated that they have deposited the data as Gigascience requested, an additional genome browser will be still very helpful for users to navigate this genome as a reference. I would suggest the authors to work with other plant genomics databases (PlantDB, Gramene and so on) to make their genome more accessible to users. 2. Line 27-28: BUSCO can only estimate the completeness of gene space. The statement of "the genome completeness" is inaccurate. Please revise. 3. Line 65: N50 is only a statistic to measure assemblies. Please revise this sentence to "these assemblies generally content very fragmented sequences", or something similar.

Level of interest Please indicate how interesting you found the manuscript:
An article whose findings are important to those with closely related research interests

Quality of written English Please indicate the quality of language in the manuscript:
Acceptable

Declaration of competing interests Please complete a declaration of competing interests, considering the following questions: Have you in the past five years received reimbursements, fees, funding, or salary from an organisation that may in any way gain or lose financially from the publication of this manuscript, either now or in the future? Do you hold any stocks or shares in an organisation that may in any way gain or lose financially from the publication of this manuscript, either now or in the future? Do you hold or are you currently applying for any patents relating to the content of the manuscript? Have you received reimbursements, fees, funding, or salary from an organization that holds or has applied for patents relating to the content of the manuscript? Do you have any other financial competing interests? Do you have any non-financial competing interests in relation to this paper? If you can answer no to all of the above, write 'I declare that I have no competing interests' below. If your reply is yes to any, please give details below.
I declare that I have no competing interests.

I agree to the open peer review policy of the journal. I understand that my name will be included on my report to the authors and, if the manuscript is accepted for publication, my named report including any attachments I upload will be posted on the website along with the authors' responses. I agree for my report to be made available under an Open Access Creative Commons CC-BY license (http://creativecommons.org/licenses/by/4.0/). I understand that any comments which I do not wish to be included in my named report can be included as confidential comments to the editors, which will not be published.
I agree to the open peer review policy of the journal.

Authors' response to reviews: Reviewer #2: The authors have clarified important issues and questions raised by the reviewers and added helpful citations.

Minor points: 1. Even though the authors stated that they have deposited the data as Gigascience requested, an additional genome browser will be still very helpful for users to navigate this genome as a reference. I would suggest the authors to work with other plant genomics databases (PlantDB, Gramene and so on) to make their genome more accessible to users.

Reply: Thank you for your advice. We will contact the staff of PlantGDB or other genomics databases to further release our genome data after acceptance of this article.

  1. Line 27-28: BUSCO can only estimate the completeness of gene space. The statement of "the genome completeness" is inaccurate. Please revise.

Reply: The statement of "the genome completeness was evaluated by BUSCO " was replaced by “the gene annotation completeness was evaluated by BUSCO”

  1. Line 65: N50 is only a statistic to measure assemblies. Please revise this sentence to "these assemblies generally content very fragmented sequences", or something similar.

Reply: The statement of " these assemblies generally have low N50 values and a large number of contigs" was replaced by “these assemblies generally contain very fragmented sequences”

Reviewer #3: The revised manuscript addressed most of my comments properly. Two major and one minor concerns however remain open.

2) The genome was assembled with Canu and polished with Pilon. Could the authors explain while the initial Canu assembly was not polished with Arrow prior to Pilon polishing. A lot of medium sized InDels usually remain in the assembly if only Pilon is used for assembly polishing. The final genome release should include a Arrow polishing step.

The authors response to my comment and a similar one from reviewer 2 (5th comment) is not satisfying. Yes, long read polishing is not increasing assembly quality to the same extent as a final short read polishing. But long read polishing removes spurious insertions and deletions at a length were short reads fail. The authors obviously did not understand that each polishing method addresses different types of assembly errors and therefore both methods should be applied in modern genomes projects, beginning with long reads and ending with short reads. The manuscript should specifically state that a long read polishing was not applied, and longer spurious insertion and deletions introduced during assembly might not be corrected.

Reply: Thank you for your constructive advice. We have re-run the assembly polishing by adding a polishing step with Arrow prior to Pilon polishing. All the subsequent analysis, including the repeat annotation and gene prediction, and assessment of the genome assembly, were also conducted again. We have updated the related dataset and the statistical tables.

6) Please indicate whether default mapping parameters were used to assess the genome completeness via short read mappings and adopt the quality assessment method from (Bickhart et al. 2017, Nat Genet.; Jain et al. 2017, bioRxiv).

The second concern is the error evaluation analysis the authors applied. They did not what I asked for and/or misunderstood the Bickhart et al. 2017, Nat Genet. reference I added. Authors simply reported re-mapping ratios instead of calling SNPs with the re-mappings and inferring Q-value as pointed out by Bickhart et al. The manuscript should contain a sentence that only completeness was tested but accuracy of the assembly was not specifically assessed or, and that would be my preferred option, the authors should re-run the analysis. A simple workflow can be taken from https://github.com/fbemm/onefc-oneasm/wiki/Assembly-Validation.

Reply: The SNP-based assembly quality assessment was performed referring the method provided by Bickhart et al and the erroneous bases in the genome assembly were identified using the variant calling software FreeBayes with default parameters. The QV value of our genome assembly was calculated out to be 38.95, which shows that the genome quality in base level is good.

3) The size of the genome was estimated via k-mer distribution. Figure S1 shows a major and a minor peak. I agree with the authors that the highest peak represents the diploid genome given the low amounts of BUSCO duplicates. Nevertheless, a in-depth description of the repetitve peak (better contigs associated with it) would be interesting and should be added to the manuscript. Please describe how much contigs and how much total sequence are duplicated (showing k-mers from the second peak) and elaborate on the annotated genes/repeats you can find of there. Consider moving the k-mer histogram and maybe a graphical (GO enrichment based) summary of duplicated genes/repeats to the main text as figure.

Authors misunderstood my request. The idea was, to isolates k-Mers from the k-Mer distribution (can be simply done with Jellyfish dump by specifying a low and high cutoff that fits to the 2nd peak) and use them as baits to isolate contigs that contain these k-Mer (can be done using "bbduk2 in=contigs.fasta out=baited.fasta ref=2nd-peak-kmers-fasta hdist=0 mm=f). It would have been interesting to see if these elements are a) properly assembled, thus appear on contigs with regions from the first peak or b) are simply collapsed during the assembly. On top of that an repeat/annotation analysis could have revealed the nature of the contigs and thereby have helped to understand any functional implication of this partial genome duplication. It is up to the authors to add that analysis, but it would have improved the understand of the A. nanus genome for sure.

Reply: Thank you for your advice, but we met some difficulties in conducting the analysis you recommend and failed to complete the analysis.

Source

    © 2018 the Reviewer (CC BY 4.0).

Content of review 3, reviewed on May 17, 2018

Declaration of competing interests Please complete a declaration of competing interests, considering the following questions: Have you in the past five years received reimbursements, fees, funding, or salary from an organisation that may in any way gain or lose financially from the publication of this manuscript, either now or in the future? Do you hold any stocks or shares in an organisation that may in any way gain or lose financially from the publication of this manuscript, either now or in the future? Do you hold or are you currently applying for any patents relating to the content of the manuscript? Have you received reimbursements, fees, funding, or salary from an organization that holds or has applied for patents relating to the content of the manuscript? Do you have any other financial competing interests? Do you have any non-financial competing interests in relation to this paper? If you can answer no to all of the above, write 'I declare that I have no competing interests' below. If your reply is yes to any, please give details below.
I declare that I have no competing interests.

I agree to the open peer review policy of the journal. I understand that my name will be included on my report to the authors and, if the manuscript is accepted for publication, my named report including any attachments I upload will be posted on the website along with the authors' responses. I agree for my report to be made available under an Open Access Creative Commons CC-BY license (http://creativecommons.org/licenses/by/4.0/). I understand that any comments which I do not wish to be included in my named report can be included as confidential comments to the editors, which will not be published.
I agree to the open peer review policy of the journal.

Source

    © 2018 the Reviewer (CC BY 4.0).

References

    Fei, G., Xue, W., Xuming, L., Mingyue, X., Huayun, L., Merhaba, A., Huigai, S., Shanjun, W., Jinchao, F., Yijun, Z. 2018. Long-read sequencing and de novo genome assembly of Ammopiptanthus nanus, a desert shrub. GigaScience.

Would you like to get recognition for your own reviews?
Click or tap here to register.