Content of review 1, reviewed on March 12, 2014

The paper presents a compressive sensing approach to analyzing genomic data in genome wide association studies.

The main focus is on studying the relationship between the genetic architecture (number of loci) and sample size required to detect the loci via L1-regularized regression (the lasso). The authors predict the number of samples required to identify the associated SNPs for a trait as a function of the number of associated SNPs.

Overall, the paper is clearly written, technically sound and interesting. Some comments are below.

Major (essential):

  1. The authors simulate all SNPs with non-zero effect to have the same effect size on phenotype (+/- 1), but in reality different SNPs have different effect sizes, where the thus the author's results might be less valid for different effect-size distributions encountered in practice. A simulation with more realistic effect sizes distribution (possibly with a dependency of effect sizes on allele frequencies) as related to the allele frequencies) would be beneficial here.

  2. Besides the L1 approach, the authors study the marginal regression (MR) approach which is currently perhaps the most popular, where each SNP at a time is regressed against all SNPs. But no direct comparison between the two methods is made. Since the design matrix X is close to orthogonal (because there is LD only between close SNPs), I would expect the two approaches to yield similar results. A main question might be power - for which approach would we need fewer samples in order to recover the true non-zero coefficients? (I would expect the lasso), what factor do we gain? is it negligible or substantial. I think that such a comparison (either theoretical or empirical) would greatly benefit the paper.

  3. The paper's benefit to the community would be greatly increased if the authors provide code and data used for conducting their simulations, in order to reproduce the author's results, simulate under different parameter values, using the method and benchmarks measures proposed by the authors on new datasets etc.

Minor (essential):

  1. The authors do not explain in detail what is the theoretical recipe for choosing the regularization parameter lambda provided by CS theory (in page 14). What is the role of lambda_min and lambda_max, and how to choose between thm? what is the choice of lambda trying to optimize? prediction accuracy? guarantee of recovery of non-zero values? something else?

  2. There is a common problem in GWAS analysis of population structure, which may confound the analysis and yield false positive predictions. This is dealt usually with principle components corrections or using mixed effects models when testing each SNP individually (the MR approach). Can you do the same with the lasso approach? or are there any difficulties? this point should be addressed in order to successfully apply the lasso approach for real GWAS studies.

  3. In page 5 top - why would markers with very low frequency make the matrix coherent? please explain

Discretionary:

  1. The authors show that the LD structure makes parameter identification harder, as is exhibits correlations between nearby SNPs which makes fine mapping hard. But they don't discuss the effect on prediction - that is, predicting for an individual the phenotype - I would expect that the effect would be the opposite. More generally, the authors don't discuss prediction at all (except that I think that the choice of the optimal parameter lambda for regularization is chosen to optimize prediction error) which is fine but should be more explicitly stated and reasoned. There is a vast literature, including some that the authors cite (e.g. Visscher's papers) where people are trying to predict the phenotype from genotype for an individual, and/or estimate the heritability. How would the lasso perform in terms of prediction? are there also phase transitions with respect to this measure? (perhaps adding a prediction measure in addition to the four estimation and identification measures studied could help)

Level of interest: An article of importance in its field

Quality of written English: Acceptable

Statistical review: Yes, and I have assessed the statistics in my report.

Declaration of competing interests: I declare that I have no competing interests

Source

    © 2014 the Reviewer (CC-BY 3.0 - source).

References

    Shashaank, V., J., L. J., C., C. C., H., H. S. D., C., C. C. 2014. Applying compressed sensing to genome-wide association studies. GigaScience.