Review of Improved de novo genome assembly and analysis of the Chinese cucurbit Siraitia grosvenorii, also known as monk fruit or luo-han-guo

Content of review 1, reviewed on January 09, 2018

This is a nicely written data note about an important genome.
MAJOR COMMENTS I would like these comments addressed before the paper is accepted for publication 1. The English must be improved, especially singular/plural verbs such as in this sentence on line 112: " ...the alignment files WAS manipulated...". I suggest that the authors ask a native English speaker to proof-read the paper. 2. I have a few concerns about the experimental design and methods. First, quality of the assembled consensus was evaluated by mapping Illumina RNAseq reads to the consensus. Naturally only reads containing few differences would map, yielding a biased consensus quality measurement. The real consensus quality is likely lower than the authors estimated. Instead I suggest estimating the consensus quality of the assembly by mapping the assembly to the contigs from the previous Ilumina-only based assembly and evaluating the fidelity of long (10Kb+) mutual best matches.
3. I would like also to see how BUSCO results improved compared to initial Illumina-only assembly.

MINOR COMMENTS/SUGGESTIONS Authors do not have to satisfy these comments for publication -- these are merely suggestions One other reason I am concerned about the consensus quality is that the genome is not inbred, and 73x total PacBio coverage (which works out to about 37x per haplotype) may not be enough to generate high enough consensus quality in regions of high heterozygosity from PacBio -only data. I would recommend getting some 60-100x whole genome Illumina data for the same sample and polishing the assembly with Pilon. Also for the same reason using only 25x of the corrected reads may not be optimal -- I suspect assembly contiguity could be better it 35 or 40x of the longest corrected reads are used.

Level of interest Please indicate how interesting you found the manuscript:
An article whose findings are important to those with closely related research interests

Quality of written English Please indicate the quality of language in the manuscript:
Needs some language corrections before being published

Declaration of competing interests Please complete a declaration of competing interests, considering the following questions: Have you in the past five years received reimbursements, fees, funding, or salary from an organisation that may in any way gain or lose financially from the publication of this manuscript, either now or in the future? Do you hold any stocks or shares in an organisation that may in any way gain or lose financially from the publication of this manuscript, either now or in the future? Do you hold or are you currently applying for any patents relating to the content of the manuscript? Have you received reimbursements, fees, funding, or salary from an organization that holds or has applied for patents relating to the content of the manuscript? Do you have any other financial competing interests? Do you have any non-financial competing interests in relation to this paper? If you can answer no to all of the above, write 'I declare that I have no competing interests' below. If your reply is yes to any, please give details below.
I declare that I have no competing interests.

I agree to the open peer review policy of the journal. I understand that my name will be included on my report to the authors and, if the manuscript is accepted for publication, my named report including any attachments I upload will be posted on the website along with the authors' responses. I agree for my report to be made available under an Open Access Creative Commons CC-BY license (http://creativecommons.org/licenses/by/4.0/). I understand that any comments which I do not wish to be included in my named report can be included as confidential comments to the editors, which will not be published.
I agree to the open peer review policy of the journal.

Authors' response to reviews: Reviewer #1 (Major comments):

98:"This genome size was slightly larger than the estimated 420 Mb [8], which was probably due to the high genome heterozygosity." - A k-mer analysis or SNP density analysis should be done and included in the manuscript to substantiate this assertion.

Yes. We have over 100x additional resequencing reads used for k-mer analysis with KmerGenie. The sampled histogram and fit for best k value showed the heterozygous peak substantiate that assertion. In addition, the high genome heterozygosity of monk fruit is observed as it is diecious.

99: Was the genome assembly polished after assembly to correct sequencing errors? This is normally done for PacBio assemblies and should be included in the methods if it was done. Yes. We performed the assembly using Quiver with raw PacBio RSII H5 files, and polished the assembly using over 100x whole genome Illunima short reads. The polished assembly and annotations have been uploaded to GigaDB.

105/Table 3:13.9% missing BUSCOs seems high for a high coverage PacBio assembly. How does this compare to the original assembly by Itkin et al.? We analyzed the genome completeness after genome polishing described above, and the missing BUSCOs declined to 8.1%. We were not able to compare the assembly to the original assembly by Itkin et al., because we cannot obtain the assembly. We sent e-mails to the corresponding author and also PNAS editorial for the assembly but the authors did not provide it. In order to compare our assembly with the original assembly by Itkin et al, we aligned both our resequencing short reads and their released whole genome short reads to our assembly using BWA mem program and estimated the average base error rates. They were all less than 1E-3 when using the two datasets as the Table 5 showed in the manuscript, which suggested a high-quality assembly. The differences of base error rates between our resequencing data and the one released earlier were probably due to the variety difference.

Reviewer #1 (Minor momments):

We thank the reviewer for the suggestions on English language, and we have corrected these tissues as suggested one by one and sent the revised manuscript to English native speakers for language editing.

20: platforms “Platfroms” has been revised as “platforms”.

63: is a useful resource “Useful resources” has been revised as “a useful resource”.

Table 1: fix units in the table, they are correct in the text We have checked the units in Table 1, and there is no inconformity with the test.

84: C after The Chinese symbol has been revised as suggested.

87: an insert size “Insersion size” has been revised as “an insert size”.

94: This sentence was somewhat confusing. I recommend rewriting it so it is clearer, e.g. : "25x coverage of the longest corrected reads was extracted with Perl scripts and assembled" This sentence has been revised as “25x coverage of the longest corrected reads was extracted with Perl scripts and assembled”.

110: All 15 RNA-seq libraries were mapped to the assembly This sentence has been revised as “All 15 RNA-seq libraries were mapped to the assembly”.

115: low quality variants “Variations” has been revised as “variants”. 116: unique “Uniq” has been revised as “unique”.

117: "error rate was calculated as the ration of double variation (1/1 and 1/2) number" - This is very confusing and needs to be rewritten. This sentence has been revised as “error rate was calculated as the average number of single-nucleotide polymorphisms (SNP) and indels that appear at both alleles (labeled as 1/1 and 1/2 in Table 5) per base”.

127: "the S. grosvenorii genome sequences were subjected to 3 gene" - the S. grosvenorii genome assembly was annotated using 3 This sentence has been revised as “the S. grosvenorii genome was annotated using 3 gene prediction pipelines”.

133: "with a repeat masked genome, while repeat masking was done by RepeatMasker." - with the repeat masked genome. This sentence has been revised as “whith the repeat masked genome”.

134: "from Hisat2 to transcriptome with the assembly as reference," - from HISAT2 using the assembly as the reference - correct other instances of Hisat2 to HISAT2 This sentence has been revised as “from HISAT2 using the assembly as the reference”, and all “Hisat2” have been corrected.

140 (and others): "non-redundant database" : be more specific such as NCBI non-redundant protein database (nr) “Non-redundant database” has been revised as “NCBI non-redundant protein database (nr)”.

Reviewer #2 (Major momments): 1. The English must be improved, especially singular/plural verbs such as in this sentence on line 112: " ...the alignment files WAS manipulated...". I suggest that the authors ask a native English speaker to proof-read the paper. Yes. This sentence has been revised as “the alignment files were manipulated” and we have sent the revised manuscript to English native speakers for language editing.

I have a few concerns about the experimental design and methods. First, quality of the assembled consensus was evaluated by mapping Illumina RNAseq reads to the consensus. Naturally only reads containing few differences would map, yielding a biased consensus quality measurement. The real consensus quality is likely lower than the authors estimated. Instead I suggest estimating the consensus quality of the assembly by mapping the assembly to the contigs from the previous Ilumina-only based assembly and evaluating the fidelity of long (10Kb+) mutual best matches. We were not able to compare the assembly to the Illumina-only assembly, because we cannot obtain the assembly. We sent e-mails to the corresponding author and also PNAS editorial for the assembly but the authors did not provide it. The evaluation by mapping RNA-Seq reads to the consensus was biased indeed, so we carried out the genome quality assessment by mapping our resequencing short reads and whole genome short reads released earlier to the assembly instead. The coverages of resequencing datasets were 92.99% and 90.79% of the genome assembly, so we believe that this evaluation was able to estimate the accuracy of our assembly.
I would like also to see how BUSCO results improved compared to initial Illumina-only assembly. We analyzed the genome completeness after genome polishing described above, and the missing BUSCOs declined to 8.1%. We were not able to compare the assembly to the original assembly by Itkin et al., because we cannot obtain the assembly. We sent e-mails to the corresponding author and also PNAS editorial for the assembly but the authors did not provide it. In order to compare our assembly with the original assembly by Itkin et al, we aligned both our resequencing short reads and their released whole genome short reads to our assembly using BWA mem program, and estimated the average base error rates. They were all less than 1E-3 when using the two datasets as Table 5 showed in the manuscript, which suggested a high-quality assembly. The differences of base error rates between our resequencing data and the one released earlier were probably due to the variety difference.

Reviewer #2 (Minor comments):

Authors do not have to satisfy these comments for publication -- these are merely suggestions. One other reason I am concerned about the consensus quality is that the genome is not inbred, and 73x total PacBio coverage (which works out to about 37x per haplotype) may not be enough to generate high enough consensus quality in regions of high heterozygosity from PacBio -only data. I would recommend getting some 60-100x whole genome Illumina data for the same sample and polishing the assembly with Pilon. We thank the reviewer for this suggestion, and we have gotten 50G (over 100x) whole genome Illumina short reads for variety Qingpiguo and used this dataset to polish the assembly, and the genome quality has been improved to a certain extent.

Also for the same reason using only 25x of the corrected reads may not be optimal -- I suspect assembly contiguity could be better it 35 or 40x of the longest corrected reads are used. As a matter of fact, we tried some different scales of corrected long reads to assemble the genome, while 25x was the best dataset as the result assembly had the longer total size and contig N50 length. Corrected_40X_long_reads Corrected_25X_long_reads Number_of_contigs 4,282 4,128 Total_size(bp) 465,219,980 467,072,951 Contig_N50(bp) 349,315 433,684 Longest_contig(bp) 7,653,141 7,657,852 GC_content 33.60% 33.57%

Source

Content of review 2, reviewed on April 06, 2018

Thank you for the revisions. One further comment that is have is that since the Illumina-only assembly could not be obtained, you could assemble your Illumina Paired end reads yourself with MaSuRCA or SOAPdenovo2 and use big contigs from the assembly (>10000bp) to measure the consensus quality of your genome. Illumina-only assemblies are fast and relatively easy to produce.

Level of interest Please indicate how interesting you found the manuscript:
An article of importance in its field

Quality of written English Please indicate the quality of language in the manuscript:
Acceptable

Authors' response to reviews:

We thank the reviewer 2 for the comment of assembling long contigs using paired-end reads to measure the consensus quality of our final assembly. We did not use this method to assess the genome quality because we have measured the sequence accuracy using paired-end reads mapping to the assembly, and the second method was widely used in genome quality assessment.

Source

References

Mian, X., Xue, H., Hang, H., Renbo, Y., Gang, Z., Xiping, J., Beijiu, C., Wang, D. X. 2018. Improved de novo genome assembly and analysis of the Chinese cucurbit Siraitia grosvenorii, also known as monk fruit or luo-han-guo. GigaScience.

Pre-publication Review of

Improved de novo genome assembly and analysis of the Chinese cucurbit Siraitia grosvenorii, also known as monk fruit or luo-han-guo

Reviewed On January 09, 2018 , and April 06, 2018

Submitted to

Reviewed by

Actions

Content of review 1, reviewed on January 09, 2018

Source

Content of review 2, reviewed on April 06, 2018

Source

References