Review of A single chromosome assembly of Bacteroides fragilis strain BE1 from Illumina and MinION nanopore sequencing data

Content of review 1, reviewed on September 11, 2015

This manuscript describes an easy workflow for generating a complete bacterial genome assembly from a hybrid dataset og Illumina MiSeq and Oxford nano pore MinION reads. It is the first such publication and therefore of interest to researchers interested in performing such assemblies. The paper is a useful guide on how to run hybrid assemblies and how to finish the genomes with available tools. The methods used can be considered standard methods to use for such data and are available to all researchers. The manuscript is well written. There is no ground truth for the genome sequence of this species, thus non-reference based methods are needed to validate the assembly. The authors validate their single-contig assembly by mapping the reads back to the assembly, and report some basis mapping metrics. The assembly is compared to the sequence of a closely related species with a reference genome in GenBank, and regions of difference are reported to have been manually inspected. However, we feel that a more thorough documentation of the validity of the genome assembly is in place. Suggestions include showing the alignment to the NC_003228 genome (e.g. using Mauve or mummerplot) and showing alignments of the reads at the regions of difference. This demonstrate why the hybrid assembly is so useful and it would give a little more information on the specifics of BE1. Could a round of polishing the bases (using a tool such as Pilon) be used to validate the per-base accuracy - arguing that few bases should be changed by such a program if the assembly is of high quality? One of our concerns with the manuscript as it stands is the lack of sufficient detail to allow reproducing the results fully - and reusing of the pipeline by other researchers. Version numbers for the programs used are mentioned for a few tools, but not for all. Exact commands are missing for many programs also, for example for Trimmomatic, poRe and SPADES, SSPACELongRead, GapFiller, Prokka, bwa, samtools, and LAST. One of us attempted to reproduce the main results nonetheless, using educated guesses for the different parameters, and was able to perform many steps but not always with the same resulting outcome. During the process, the following was noted:  there are seven md5 checksum files in the MinION dataset that do not have a corresponding fast5 file  the author struggled to get poRe dependencies installed due to factors beyond his control, and decided to use poretools (poretools.readthedocs.org) instead to extract the 2D reads in fastq format (command poretools fastq --type 2D FAA37759_GB2974_MAP005_20150423__2D_basecalling_v1.14_2D/ >FAA37759_GB2974_MAP005_20150423_2D.fastq)  details on the trimmomatic command were missing, and a best guess based on information from the manuscript ("Sequencing adapters were removed, as were bases less than Q20. Any reads less than 126bp in length after trimming were discarded.") did not result in an identical number of trimmed reads (656976 instead of 898420 reads). The command used was java -jar trimmomatic-0.33.jar PE -threads 24 -phred33 ERR973713_1.fastq.gz ERR973713_2.fastq.gz ERR973713_forward_paired.fq.gz ERR973713_forward_unpaired.fq.gz ERR973713_reverse_paired.fq.gz ERR973713_reverse_unpaired.fq.gzILLUMINACLIP:/path/to/Trimmomatic- 0.33/adapters/NexteraPE-PE.fa:2:30:10 LEADING:20 TRAILING:20 SLIDINGWINDOW:4:20 MINLEN:126  based on the last author's blog post https://biomickwatson.wordpress.com/2015/08/23/assembling-b-fragilis-from-minionand-illumina-data/ an assembly was performed using untrimmed reads, with the following spades command (version 3.5.0): spades.py -o spades_fragilis_raw_ilmn_2D -t 16 -1 ERR973713_1.fastq.gz -2 ERR973713_2.fastq.gz --nanopore FAA37759_GB2974_MAP005_20150423_2D.fastq. This resulted in an assembly with the longest five scaffolds having lengths of 3980468, 827231, 362398, 13363 and 5146 bp, respectively, which with the exception of the second largest scaffold (a 6 bp difference), are identical lengths to those reported in the paper.  the manuscript mentions "removal of short and/or low-coverage contigs" but does not describe the cutoffs, or how the coverage was obtained, and no attempt was made to determine the per-scaffold coverage other than using the numbers reported in the scaffold ID's  scaffolding using SSPACE-Longreads resulted in a single scaffold of 5188980 bp after the first round. This is in contrast to what was described in the manuscript, where a second scaffolding round was needed to achieve that. Also, the scaffold obtained was 23 bp longer. The command run was: perl SSPACE-LongRead.pl -c top_five_scaffolds.fasta -p FAA37759_GB2974_MAP005_20150423_2D.fastq  two gaps remained, both just over 300 bp, but no gap filling was attempted, nor annotation.  reads were mapped using bwa 0.7.12 and comparable mapping statistics were obtained (using samtools flagstat)  no attempts were made to reproduce figure 2 and 3 Finally, the authors may wish to compare their method with the one described in http://www.biomedcentral.com/1471-2164/16/327 The gigascience website suggests we look at the following aspects, which we reproduce with our comments below each of them: Is the rationale for collecting and analyzing the data well defined? Yes, although the dataset can not really be called "large-scale" within the context of its field Is it clear how data was collected and curated? Yes Is it clear - and was a statement provided - on how data and analyses tools used in the study can be accessed? Not completely, see above. Are accession numbers given or links provided for data that, as a standard, should be submitted to a community approved public repository? Yes Is the data available in the public domain under a Creative Commons license? The ENA seems to have waived rights (https://www.ebi.ac.uk/ena/standards-and-policies) so tentatively yes. Are the data sound and well controlled? One can argue about controls for this type of data, but having the corresponding Illumina data from the same samples suffices in our opinion Is the interpretation (Analysis and Discussion) well balanced and supported by the data? yes Are the methods appropriate, well described, and include sufficient details and supporting information to allow others to evaluate and replicate the work? No, see comments above. What are the strengths and weaknesses of the methods? Possible improvements are discussed above. The paper describes the strengths very well. The method does not has any inherent weaknesses. Have the authors followed best-practices in reporting standards? The authors have not employed any of the checklists, or workflow management systems as described under this point Can the writing, organization, tables and figures be improved? Figures are discussed above. The writing and organisation suffices. ------------- Minor and more detailed comments (using page numbering as at the bottom of the provided PDF): p.3 First sentence: "... a major cause soft tissue infections." Please add "of". p. 3 "Illumina's higher-throughput sequencers produce up to 1.8 terabases of sequence per run" Please specify which instrument is used (HiSeq X), as the NextSeq 500 could also be considered a higher-throughput sequencer p. 3 "Whilst PacBio assemblies are of higher quality, they come at approximately 3-4 times the cost". Cost compared to what approach? p. 4 "By using a hairpin adapter, each molecule is read twice" —> can this be made more clear, e.g. "By attaching a hairpin adapter to one end of the target molecules during library preparation, each molecule is read twice" p. 4 Which Vrije Universiteit from which city? There is more than one Vrije universiteit in the world... p. 6 "...primed with sequencing buffer then 220ng of freshly prepared library diluted in sequencing…" should there be a comma after 'buffer'? p. 7 "Mapping statistics were calculated using count-errors.py [28], modified slightly to work with our read IDs." Please provide a copy of the final count-errors.py script p. 8 "The 2D alignment lengths were all approximately equal to the read length, albeit with a slight tendency for the alignment length to be greater than 2D sequence length" Please explain why, is this due to deletions in the MinIon reads? p. 9 "the assembly was created using free, open-source bioinformatics tools" Is SSPACELongreads really an open source software? It is free for academic use, but not open source, as far as I can tell... Figures: it would help the reader if you could provide the code (probably written in R) that was used to generate figures 1-3 Signed: Lex Nederbragt and Thomas Haverkamp, Centre for Ecological and Evolutionary Synthesis (CEES) Dept. of Biosciences, University of Oslo, Norway Level of interest Please indicate how interesting you found the manuscript: An article whose findings are important to those with closely related research interests Quality of written English Please indicate the quality of language in the manuscript: Acceptable Declaration of competing interests Please complete a declaration of competing interests, considering the following questions: 1. Have you in the past five years received reimbursements, fees, funding, or salary from an organisation that may in any way gain or lose financially from the publication of this manuscript, either now or in the future? 2. Do you hold any stocks or shares in an organisation that may in any way gain or lose financially from the publication of this manuscript, either now or in the future? 3. Do you hold or are you currently applying for any patents relating to the content of the manuscript? 4. Have you received reimbursements, fees, funding, or salary from an organization that holds or has applied for patents relating to the content of the manuscript? 5. Do you have any other financial competing interests? 6. Do you have any non-financial competing interests in relation to this paper? If you can answer no to all of the above, write 'I declare that I have no competing interests' below. If your reply is yes to any, please give details below. I declare that I have no competing interests I agree to the open peer review policy of the journal. I understand that my name will be included on my report to the authors and, if the manuscript is accepted for publication, my named report including any attachments I upload will be posted on the website along with the authors' responses. I agree for my report to be made available under an Open Access Creative Commons CC-BY license (http://creativecommons.org/licenses/by/4.0/). I understand that any comments which I do not wish to be included in my named report can be included as confidential comments to the editors, which will not be published. I agree to the open peer review policy of the journal.

The reviewed version of the manuscript can be seen here:

All revised versions are also available:

Source

References

Judith, R., Marian, T., Sheila, P., Garry, B., Georgios, K., Mark, B., Mick, W. 2015. A single chromosome assembly of Bacteroides fragilis strain BE1 from Illumina and MinION nanopore sequencing data. GigaScience.

Pre-publication Review of

A single chromosome assembly of Bacteroides fragilis strain BE1 from Illumina and MinION nanopore sequencing data

Reviewed On September 11, 2015

Submitted to

Reviewed by

Actions

Content of review 1, reviewed on September 11, 2015

Source

References