Review of Significantly improving the quality of genome assemblies through curation

Content of review 1, reviewed on September 28, 2020

The authors provide a concise overview of issues that arise during efforts to establish a reference quality genome sequence assembly, especially for organisms with complex genomes. The relevant published literature and software sources are cited. Whilst the authors' own infrastructure for reviewing and correcting genome assemblies is an in-house bespoke system that is not portable they describe the key processes involved in reviewing and assessing genome assemblies. This brief editorial / review provides a useful checklist for groups generating genome assemblies. Whilst the generation of the primary sequence data from which a first pass contig level assembly can be built is readily within the capacity of well-founded and funded research groups, the conversion of the resulting contigs into a high quality chromosome level assembly requires time and skill. This review provides a useful guide to navigating this transition and those who aspire to contribute to the growing resource of high quality reference genomes would be well served by reading this guide. This guide is largely set in the context of the current widely adopted paradigm of single pseudo-haploid representations of an organism's genome. As some of the errors that the procedures described in this paper seek to address concern the challenges of resolving an individual's different haplotypes some comment on graph based genome approaches to capture rather than 'resolve' such haplotypic differences would be appropriate.

Declaration of competing interests Please complete a declaration of competing interests, considering the following questions: Have you in the past five years received reimbursements, fees, funding, or salary from an organisation that may in any way gain or lose financially from the publication of this manuscript, either now or in the future? Do you hold any stocks or shares in an organisation that may in any way gain or lose financially from the publication of this manuscript, either now or in the future? Do you hold or are you currently applying for any patents relating to the content of the manuscript? Have you received reimbursements, fees, funding, or salary from an organization that holds or has applied for patents relating to the content of the manuscript? Do you have any other financial competing interests? Do you have any non-financial competing interests in relation to this paper? If you can answer no to all of the above, write 'I declare that I have no competing interests' below. If your reply is yes to any, please give details below.

I declare that I have no competing interests.

I agree to the open peer review policy of the journal. I understand that my name will be included on my report to the authors and, if the manuscript is accepted for publication, my named report including any attachments I upload will be posted on the website along with the authors' responses. I agree for my report to be made available under an Open Access Creative Commons CC-BY license (http://creativecommons.org/licenses/by/4.0/). I understand that any comments which I do not wish to be included in my named report can be included as confidential comments to the editors, which will not be published. I agree to the open peer review policy of the journal.

Authors' response to reviews: Reviewer #1: The authors provide a concise overview of issues that arise during efforts to establish a reference quality genome sequence assembly, especially for organisms with complex genomes. The relevant published literature and software sources are cited. Whilst the authors' own infrastructure for reviewing and correcting genome assemblies is an in-house bespoke system that is not portable they describe the key processes involved in reviewing and assessing genome assemblies. This brief editorial / review provides a useful checklist for groups generating genome assemblies. Whilst the generation of the primary sequence data from which a first pass contig level assembly can be built is readily within the capacity of well-founded and funded research groups, the conversion of the resulting contigs into a high quality chromosome level assembly requires time and skill. This review provides a useful guide to navigating this transition and those who aspire to contribute to the growing resource of high quality reference genomes would be well served by reading this guide. This guide is largely set in the context of the current widely adopted paradigm of single pseudo-haploid representations of an organism's genome. As some of the errors that the procedures described in this paper seek to address concern the challenges of resolving an individual's different haplotypes some comment on graph based genome approaches to capture rather than 'resolve' such haplotypic differences would be appropriate.

Thanks very much to reviewer #1 for the kind assessment of our manuscript. Our curation recommendations are not restricted to pseudo-haplotype-based assemblies, but are also used for haplotype resolved assemblies, where both haplotypes are curated. Full haplotype resolution has its own challenges and these are described in detail in https://www.biorxiv.org/content/10.1101/2020.05.22.110833v1.abstract as cited in the manuscript.

For graph-based assemblies, the current workflow of e.g. the Human Pangenome Project (https://humanpangenome.org/) we are participating in is based on generation of fully haplotype-resolved assemblies first, which are subsequently used to build a graph. We are therefore applying the curation process as described. We are vigilant regarding future requirement and constantly adapt.

For the current manuscript we adapted the conclusions as follows in order to highlight the general suitability of the curation recommendations and hope this addresses the concern:

“Our experiences in curating partially and fully haplotype-resolved genome assemblies for GRC, VGP and DToL have driven improvements in assembly software (e.g. purge_dups [15], salsa2 [52]), assembly pipelines (VGP, DToL) and assembly assessment tools (e.g. Asset [32,38]). Genome assembly generation is a fast-moving field and we are constantly adapting the curation software and processes to include novel data types and novel ways of generating assemblies whilst being conscious of the need to maximise throughput. This ensures ongoing involvement of assembly curation in high-throughput projects to produce the best possible data for the community to base their research upon. This ensures ongoing involvement of assembly curation in high-throughput projects to produce the best possible data for the community to base their research upon.”

Reviewer #2: The authors provide much welcome guidelines and recommendations for genome assembly curation derived from their experience curating hundreds of assemblies. Their recommendations are clear but the manuscript could benefit from more examples of what misassembly signals look like in different technologies. The authors mention that gEVAL is tied into their local infrastructure and not portable, but the original gEVAL manuscript mentions that it is downloadable for use with any organism. It should be made more clear why gEVAL cannot be used. If gEVAL indeed cannot be used outside of their group, it would be nice to see how similar views could be generated with publicly available tools. Finally, I think that it would be hugely beneficial for readers to have a workflow figure with their recommendations incorporated from the initial coherence check to final ordering and orientation.

Many thanks to reviewer #2 for the thoughtful comments and suggestions and the detailed corrections.

Concerns:

1) Their recommendations are clear but the manuscript could benefit from more examples of what misassembly signals look like in different technologies.

We have extended Figure 1 (now Fig. 2) to include misassembly signatures detectable in HiC 2D maps, in addition to the already presented signals from read coverage, BioNano maps and synteny analyses. We hope this widens the information from gEVAL-based misassembly information to useful instructions for assessing HiC maps that can be generated with a variety of publicly accessible methods.

2) It should be made more clear why gEVAL cannot be used.

When we published the gEVAL paper in 2013, gEVAL was a database for reference genome assemblies maintained by the Genome Reference Consortium. As such it was publicly accessible, and code and database content were offered for download. Whilst the gEVAL browser is still publicly accessible, and the plugins and data described in the publication can be downloaded, the Ensembl version 93 code gEVAL is built on is not publicly available anymore and this is sadly outside our influence. The 2013 publication pertains to all data provided at https://geval.sanger.ac.uk/.

gEVAL has moved on over nearly a decade and has evolved from a low throughput vehicle for reference curation to a high throughput, fully automated assembly analyser that takes its strength from being totally integrated into the institute’s data infrastructure, allowing immediate data retrieval from multiple sources. It is an essential part of the overall assembling pipeline and not promoted as free-standing software. Detangling this to make it publicly available is not possible without additional workforce that we are not funded for. All gEVAL databases we build for the assemblies we are curating are publicly available at vgp-geval.sanger.ac.uk/index.html. This site and its sister site mentioned above are both accessible from geval.org.uk.

The current manuscript does NOT focus on gEVAL as a software packet, but rather on the process of assembly curation and its importance for generating high quality assemblies. We describe what we have successfully applied for our purposes whilst fully disclosing the logic around assembly curation and the tools publicly available to design an assembly curation pipeline that fits the requirements of the respective user.

We have amended the manuscript to hopefully explain this better without taking up too much space and distracting from the core message on curation rather than software:

“gEVAL is tied into our local infrastructure and as such sadly not portable, yet fully publicly accessible at geval.org.uk.”

“The pipeline that GRIT deploys has much evolved since its first implementation [10], and is now so closely tied into the Wellcome Sanger Institute’s internal data structure that it cannot be ported, but is described here as an example of a successful implementation that mixes automated and manual processes and significantly improves genome assemblies in a time and resource sensitive way that allows its use within high-throughput projects. All assembly projects loaded into gEVAL are publicly accessible at geval.org.uk.“

The gEVAL functionality can be largely replicated with any tool that visualises sequence and accepts sequence annotation overlays. In the manuscript, we recommend to use ASSET as it also provides the multi-data analyses that are the core of gEVAL. We have extended the text to make this clearer by adding

“ASSET evaluates multiple data types in parallel and is therefore an excellent tool to assess and visualise potential misassemblies [32].”

3) Finally, I think that it would be hugely beneficial for readers to have a workflow figure with their recommendations incorporated from the initial coherence check to final ordering and orientation.

Thank you for this excellent suggestion, we completely agree and have provided a workflow (Fig. 1) to summarise our recommendations for assembly curation.

Specific comments:

line 100 - extra period at end of sentence

removed

line 106 - spell out Segmental Duplication Assembler.

done

line 113 - comma after "For polishing"

inserted

line 117 - clarify that they can be assembled independently from the raw reads used for genome assembly.

amended: “They can be assembled independently from the raw reads, e.g. using the mitoVGP pipeline”

line 118-119 - This is confusing, it was just stated above that the organelle genome must be included for polishing and now this says to process it independently.

changed from “Contigs/scaffolds that represent the organelle genomes should be identified and processed independently of the primary, nuclear assembly.” to “Contigs/scaffolds that represent the organelle genomes should be identified and submitted as such to the INSDC archives.“ to specify that the different handling applies to the submission process.

line 208 - typo "gata"

corrected to “data”

line 223 - provide a link to a public code repository with the nextflow pipeline

This pipeline is intricately intertwined with the Sanger infrastructure and it would require additional staff to rewrite it to make it publicly useable. A similar public pipeline already exists, and we have added it to the manuscript:

“Before being loaded into gEVAL, all assemblies are run through a nextflow [39] pipeline that performs contamination detection and separation or removal as described in Table 1, combined with removal of trailing Ns [39]. Brief manual checking of the results prevents the erroneous removal of regions likely derived from horizontal gene transfer. This pipeline was inspired by the contamination checking process conducted by Genbank [40].”

Figure 1: This example is a little confusing. It looks like some of the bionano maps agree with the join and span the drop in pacbio read coverage.

The confusion was likely caused by the lack of annotation on the in silico digest tracks and the BioNano map alignments’ colour scheme. We have extended the feature track annotation to the in silico tracks and added further explanations to the figure legend (yellow = aligned, beige = not aligned BioNano map).

Source

References

Kerstin, H., William, C., Joanna, C., Sarah, P., Damon-Lee, P., Ying, S., James, T., Alan, T., Jonathan, W. Significantly improving the quality of genome assemblies through curation. GigaScience.

Pre-publication Review of

Significantly improving the quality of genome assemblies through curation

Reviewed On September 28, 2020

Submitted to

Reviewed by

Actions

Content of review 1, reviewed on September 28, 2020

Source

References