Review of NxRepair: error correction in de novo sequence assembly using Nextera mate pairs

Content of review 1, reviewed on January 14, 2015

Basic reporting

- The introduction refers to "assembly errors" but does not distinguish between types of errors, like SNPs, indels, or contig joining mistakes
- No explanation of "insert size" or "mate pair" and "paired end" is given, many readers may not understand these concepts
- No reference for "Nextera Mate Pair" is given
- Existing tools REAPR and ALE are described, and a "Bayesian" method is mentioned but no motivation provided for its mentioning
- The phrase "de novo" should be italicised
- "de Bruijn Graph" should be lowercase "graph"
- missing space at "W is 200 bases"
- interval [i-W,i+W] is 2W+1 not 2W as reported

Experimental design

- It is not clear if you re-sequenced the exact same strains as the reference genomes in NCBI and where these strains were obtained from.
- Versions of software (bwa, samtools, etc) need to be reported
- BWA was used with default parameters, which includes lots of partially mapped reads and alternative mappings. It is unclear how nxRepair handled these.
- It should be made clearer that you are using the same reads for both assembly and post-assembly correcting

Validity of the findings

- The sequencing data is only available on Illumina BaseSpace. This needs to be rectified by placing the reads into a Study on NCBI SRA or into ENA so they are guaranteed to be publicly available.
- Table 1 can be improved by adding in the full species name, the genome size, and the global mate pair statistics that were estimated
- Some measure of the yield, quality and average read length (after clipping) should be provided
- It is claimed the nxRepair fixed 6 of 9 genomes, but Table 1 shows only changes to 3 of the 9 genomes?

Comments for the author

- Could this method be incorporated into Spades? Spades already re-aligns the reads back with BWA to correct some errors, so adding in a MP consistency check would be good.
- Do you really need the interval tree data structure, or could the stats you need be computed in a 1-pass manner?
- The use of a uniform distribution for the non-MP reads was interesting. I would have thought most non-MP reads were shadow PE reads, so their distribution would be Gaussian with a low mean and smaller standard deviation, rather than uniform.
- When you break an identified mis-assembly, the trimming part concerns me. Does this mean you are removing a chunk of genomic DNA from the final result? So we could lose genes?

Pre-publication Review of

NxRepair: error correction in de novo sequence assembly using Nextera mate pairs

Reviewed On January 14, 2015

Submitted to

Reviewed by

Actions

Content of review 1, reviewed on January 14, 2015

Basic reporting

Experimental design

Validity of the findings

Comments for the author

Source