##### Content of review 1, reviewed on February 21, 2014

Paper: Applying compressed sensing to genome-wide association studies by Shashaank Vattikuti, James J Lee, Christopher C Chang, Stephen D H Hsu and Carson C Chow

I am not a geneticist so I will probably use words that look like that of a neophyte.

The authors follow the current state of the art in GWAS analysis in order to devise a connection between SNPs and traits (height,....) in a large population. The problem is set up as a simple linear algebra problem:

A x = b (1)

where A represents a population SNPs with values 0, 1, or 2, b represent traits found in the population (here height), while x represents the regression coefficients linking the elements of the genome (A) and traits (b).

A has many more columns than rows, which makes this system of linear equations very underdetermined. As such there are an infinite number of solutions x. The authors take the viewpoint that in order to produce a reasonable connection between traits (height,...) and SNPs (gene locii), then x must be sparse, in fact it must be the sparsest.

Instead of using Least Squares or even Marginal Regression which do not take into account the sparsity of the regression coefficients, the authors decide to solve the problem as a LASSO and use an L1 minimization sparsity seeking solver in order to find the sparsest solution of this underdetermined system of linear equations.

Some historical perspective first: Since 2004, the field of Compressive Sensing has gained much interest from the engineering and science community because it has made more obvious the connection between underdetermined systems of linear equations ( a typical problem with what is called Big Data nowadays ), sparsest solutions to said systems and polynomial-time reconstruction solvers. Before 2004, such solvers were thought to be combinatorial in nature or when there were polynomial solvers, not much was known about their applicability (i.e. the type of linear systems that they could be applied to). With the arrival of polynomial time solvers and attendant applicable measurement ensembles (sensing matrices) sharp phase transitions [a] in the parameter space of these problems were discovered that delimited the region between full and no recovery of these sparsest solutions. Equivalently, these phase transitions are in effect delimiting the regions between polynomial time and combinatorial solvers.

Because of the size of the problems, searching for the sparsest solution yield two possibilities:

Combinatorial solvers but these are simply not foreseeable investigation tools for use in very large GWAS problems that links gene locii and a genetic trait (height,....).

L1 solvers and their limitations. All is not lost however as the authors can use results of deeper high dimensional geometry combinatorics surfacing in convex programming (L1 solvers) [a] to evaluate, using model (1), population sampling requirements, number of regression coefficients in GWAS. It needs to be said that the phase transitions arise out of the assumption of sparsity on the solution that is being sought. While Least Squares or Marginal Regression might be more convenient methods, they do not enforce solution simplicity.

However, because model [1] may not be optimal for a number of reasons ("Missing genotypes were replaced with 0's" for instance is not a simple "calibration" issue) , the exact recovery observed in compressive sensing is not observed here. This can be seen readily in Figure 1: In compressive sensing, the blue section of Figure 1 is equivalent to perfect recovery while in this paper, only a very large subset of the solution has been recovered (up to 80 percent of the amplitude). One can however readily observe a relatively sharp phase transition in Figure 1 and the authors inspect new measures of recovery beyond NE ( (mu P-value), FPR, PPV ) in order to evaluate whether a solution is close to an optimal one. In the paper, those parameters are shown to follow the sharp phase transitions of the ideal setting (compressive sensing). Of particular importance is the mu P-value parameter ( median P-value of the L1-selected nonzeros) because it can computed without knowing the solution in advance as is the case in most GWAS investigations.

The results with heritability less than 1 are consistent with the noisy phase transition (multiplicative noise) found in the literature ( the first occurrence of which is in [b]). One wonders if newer approaches such as [c] could help in this setting.

The fact that the phase transition can be observed in "dirtier" setting than simple compressive sensing is new and is likely to yield to sharper investigations in future modelling in GWAS. This paper and its findings are as far as I can tell, absolutely novel and in my humble view constitutes a gateway as it ought to provide future direction in GWAS studies.

This paper makes a connection between a central problem in large datasets found in GWAS with deep high dimensional geometry combinatorics and establishes how some of these very high dimensional problems ought to be considered in the future.

There will be a "before" and "after" this paper.

[a]. Observed Universality of Phase Transitions in High-Dimensional Geometry, with Implications for Modern Data Analysis and Signal Processing, David L. Donoho, Jared Tanner, Phil. Trans. R. Soc. A 2009 367, doi: 10.1098/rsta.2009.0152, published 5 October 2009

```
<http://rsta.royalsocietypublishing.org/content/367/1906/4273.full.pdf+html>
```

[b]. Compressive Radar Imaging Using White Stochastic Waveforms by Mahesh C. Shastry, Ram M. Narayanan and Muralidhar Rangaswamy. http://ieeexplore.ieee.org/xpl/login.jsp?tp=&arnumber=5592367&url=http%3A%2F%2Fieeexplore.ieee.org%2Fxpls%2Fabs_all.jsp%3Farnumber%3D5592367

[c]. Statistical Estimation and Testing via the Sorted L1 Norm, M. Bogdan, E. van den Berg, W. Su, and E. J. Candès, http://statweb.stanford.edu/~candes/SortedL1/

Please number your comments and divide them into:

**Major Compulsory Revisions**

none.

**Minor Essential Revisions**

The author can be trusted to make these. For example, missing labels on figures, the wrong use of a term, spelling mistakes.

**Discretionary Revisions**

These are recommendations for improvement which the author can choose to ignore. For example clarifications, data that would be useful but not essential.

Please note that both the comments entered here and answers to the questions below constitute the report, bearing your name, that will be passed on to the authors and published on the website if the article is accepted.

The authors should be clearer in the set of values taken by elements of matrix A, elements of x and elements of b after the different normalization undertaken. For all the examples (height, chromosome 22) there should be a clear description of each element in the linear algebra problem.

Parameter mu P-value is very important in this paper but there is no proper definition for it. There should be one.

Could the authors clarify why they chose a value of 0.5 for h^2

The paper should probably include this, in my view, central reference:

[a]. Observed Universality of Phase Transitions in High-Dimensional Geometry, with Implications for Modern Data Analysis and Signal Processing, David L. Donoho, Jared Tanner, Phil. Trans. R. Soc. A 2009 367, doi: 10.1098/rsta.2009.0152, published 5 October 2009

http://rsta.royalsocietypublishing.org/content/367/1906/4273.full.pdf+html

- In the text one can read: "generalized Proposition 1 to several non-normal distributions [33]"

Reference [33] refers to deterministic constructions of the measurement matrices, it should be noted that [a] already shows a very diverse set of random measurement ensembles that fulfills the phase transition mentioned in the paper. The authors probably need to make sure the reader is not confused between the distribution of the measurement/sensing matrix and the distribution of the underlying unknown vector.

- In the text one can read: "Missing genotypes were replaced with 0's after standardization."

I think the authors need to state that this assumption is really a good path for improvement in future work as there is really no expectation for 0 to be a good answer.

- In the text one can read: "While possibly overly conservative, a distance of 500 kb was judged close enough for one SNP to be a proxy for another."

Isn't this criteria more stringent, not more conservative ?

- in the text, one can read "The magnitudes of the s nonzeros in x were chosen from the set -1; 1"

shouldn't the magnitude oscillate in a square bracket interval instead of parenthesis (indicating only two values for x), see comment 1

- In the text, one can read "The datasets were merged but not subjected to imputation. The SNP genotype matrix (A) consisted of 12,464 subjects and 693,385 SNPs. SNPs were coded by their minor allele, resulting in values of 0, 1, or 2. Each column of A was standardized to have mean zero and variance unity. Missing genotypes were replaced with 0's after standardization.

We simulated phenotypes according to Equation 1, rescaling each term to leave the phenotypic variance equal to unity and the variance of the breeding values in Ax to match the target heritability. The magnitudes of the s nonzeros in x were chosen from the set -1; 1."

I think it is important that this data becomes available with the paper.

- In the text, one can read " The datasets were merged but not subjected to imputation. The SNP genotype matrix (A) consisted of 12,464 subjects and 693,385 SNPs. SNPs were coded by their minor allele, resulting in values of 0, 1, or 2. Each column of A was standardized to have mean zero and variance unity. Missing genotypes were replaced with 0's after standardization."

Unless I am reading this wrong, the datasets are subjects to imputation because missing genotypes were replaced with 0's. If I misunderstood the text, the authors probably need to clear those sentences up.

- In the text, one can read " The null P P V ∗ derived from randomly chosen SNPs, however, was smaller than the observed P P V ∗ (Figure 6A); this was consistent with the detection of some true signal."

The author probably need on more sentence to explain this better.

- In the text, one can read "may show a similar phase transition—although CS theory suggests that, among convex optimization methods, those within the L1 class are closest to the optimal combinatorial L0 search."

As stated somewhere else in the paper, this phase transition also depends on the statistics of the unknown x and other solvers such as AMP solvers can help recover those sharper phase transitions. There is still a fierce on-going debate as I type this on the applicability of AMP solvers to a wide variety of sensing matrices.

- In the text, one can read "In one set of simulations, both methods clearly outperformed ridge regression (a non-L1 method), which exhibited no phase transition away from poor performance. What this suggests to us is that the factor s log p appearing in many CS theorems is fundamental and cannot be circumvented. Fortunate fine-tuning of the prior distribution in a Bayesian method may shrink the constant factor, but the requirement that n > s will persist."

This is probably a little pessimistic, by using additional prior information on x (such as block sparsity, etc...) this requirement may fall, this is the subject of current investigation in compressive sensing.

**Level of interest**: An exceptional article

**Quality of written English**: Acceptable

**Statistical review**: Yes, but I do not feel adequately qualified to assess the statistics.

**Declaration of competing interests**: I declare that I have no competing interests

##### Source

##### References

Shashaank, V., J., L. J., C., C. C., H., H. S. D., C., C. C. 2014. Applying compressed sensing to genome-wide association studies. GigaScience.