Content of review 1, reviewed on September 09, 2015

Manuscript: OPTIMA: Sensitive and accurate whole-genome alignment of error-prone genomic maps by combinatorial indexing and technology-agnostic statistical analysis

Authors: D Verzotto, ASM Teo, AM Hillmer and N Nagarajan

Brief description: The authors develop an alignment algorithm and a statistical scoring procedure to compare restriction-like single molecule maps to sequence. Such computations are central in computational pipelines for high-throughput mapping, such as optical mapping. The new alignment method, called OPTIMA, uses a seed-and-extend glocal scheme to map continuous valued mapping data scores relating an observed map with an in silico digestion of a reference genome. The scoring procedure uses Z-score statistics computed from the population of feasible alignments to evaluate characteristics of an optimal alignment. Numerical experiments compare OPTIMA to other competing approaches.

Comments:

I enjoyed this generally well written and interesting article on genome alignment. I think the general problem is important and the the proposed solution, as far as I can decode it, is well reasoned and supported by useful numerical experiments. My comments are primarily for the purpose of clarification.

1. Gentig is usually described as an assembly algorithm, which of course involves the alignment primitive (e.g. Mendelowitz, Lee, and Mihai Pop. "Computational methods for optical mapping." GigaScience 3.1 (2014): 33, which might be good to cite.) Insofar as Gentig is being used as an aligner in the present context, this point should be emphasized.

2. The operation of the statistical procedure is not clear. It appears that each Rmap is scored for alignment against all feasible matches in the reference, building a population of multiple feature scores that drive Z score construction. The alignment having optimal score need not be the one with most significant Z score.

a. the first question is to clarify that this is so; i.e. that the optimal scoring alignment need not be the reported one.
b. a clear rationale for using the population of sub-optimal feasible alignments as the reference population for significance/uniqueness is not provided.
c. Rmap-specific final Z scores are computed from `orthogonal' (presumably `independent') component Z scores on the number of matches, the number of cut errors, and the WHT sizing-error statistic. If such component Z's are independent, then the combination indicated by equation (6) would be straightforward. It seems possible that the components in fact are correlated: if so, the combination suggested by (6) will not have the standard normal null distribution, and would not be the ideal way to combine the scores. Can the authors confirm at least approximate independence in some realistic setting?
d. On bottom of page 6, ``We start by identifying a set F of orthogonal features, with respect to random alignments, that are expected to follow the Normal distribution (e.g. under the law of large numbers) ''
The meaning of `random alignment' is not clear, but seems to mean alignment of the fixed Rmap to various feasible locations in the reference. Again (i.e., b), this calibration is not well justified.
Also, might the central limit theorem be relevant?

3. The authors repeatedly justify their statistical calculation by indicating that it does not require expensive permutation, evidently in reference to the Sarkar et al procedure for identifying significant optimal alignments. The emphasis is not warranted. On one hand, the Sarkar et al procedure is not particularly expensive computationally, as it uses only a handful of genome permutations plus regression to adjust optimal scores for map effects (alignments are very fast, in comparison to assemblies, for example). On the other hand, the authors' calculations do not make specific comparisons to the Sarkar et al procedure, and so computational benefits of OPTIMA's statistics over Sarkar's procedure have not been established in this report.

4. It would be helpful if the authors clarified the relationship between the alignment contributions in the present manuscript to the contributions previously reported in Verzotto et al RECOMB-Seq 2015.

5. `glocal' may be a buzzword in the alignment field, but it ought to be defined I think.

Authors' response to reviews: (http://www.gigasciencejournal.com/imedia/1163627124192524_comment.pdf)


The reviewed version of the manuscript can be seen here:
http://www.gigasciencejournal.com/imedia/8481523751718841_manuscript.pdf
All revised versions are also available:
Draft - http://www.gigasciencejournal.com/imedia/8481523751718841_manuscript.pdf

Source

    © 2015 the Reviewer (CC BY 4.0 - source).

References

    Davide, V., M., T. A. S., M., H. A., Niranjan, N. 2016. OPTIMA: sensitive and accurate whole-genome alignment of error-prone genomic maps by combinatorial indexing and technology-agnostic statistical analysis. GigaScience.