Content of review 1, reviewed on March 13, 2018

In "BLINK: A Package for Next Level of Genome Wide Association Studies with Both Individuals and Markers in Millions" by Huang et al, the authors provide a detailed background of the evolution of GWAS approaches that improve computation time and statistical power. Then they provide a new algorithm that address both issues called BLINK. BLINK is built from their previous GWAS implementation called FarmCPU but they replace REML with a BIC algorithm and they include LD information to remove the previous assumption that QTNs are evenly distributed across the genome (which they are not!). The article addresses an ongoing scientific problem of great importance.

Major The datasets used for testing are not clearly explained. There is no description of the mouse and pig datasets for example. At the very least, a brief description is warranted and not making the reader read another paper for any sense of genetic context. Which maize datasets was used? Is this the NAM population? If so, it need to be made more clear how this population was constructed. The human datasets needs more context. Are these Asian Americans? What is the population structure in a general sense?

Why was the human dataset replicated in such a severe manner? Each population group is amplified perfectly up to 10 times to make a bigger dataset. Wouldn't it be "better" to amplify with variation?

The authors compare two of their workflows (BLINK and FarmCPU)against PLINK. The authors clearly explain that there are newer and more powerful options than PLINK. Please provide a rationale as to why only PLINK was compared.

P8L4-5 Why does PLINK exhibit "strongly inflated P values"? Is there evidence to support this statement?

P8L10-13 I don't understand why BLINK is better than FarmCPU as it detected 9/14 QTNs. Does this imply that FarmCPU (upon which BLINK is built) is ineffective? Also, were other genes cloned in maize that control flowering time? What is the possible number of flowering time genes that the 3 "true positives" were drawn from? FOAM found "1003 genes". Did BLINK associate any of these loci?

The compute time discussions do not provide sufficient descriptions of resources to be of any meaning. Define things like the CPU that contained a "core(s)". e.g. Was the "single-core CPU" on P10 an 8088 or an I7? Which Mac Pro? Which Linux kernels and distros were tested? Etc.

Minor. Please change the title to explain "millions". Do you need to rearrange or insert "in the" before "millions"? P6L12 What multiple test correction was used? You might want to define what you mean by LD in this paragraph. The authors might want to remove the "Big Data" phrase. This is not really a big data study. P9L4 "Change "FAOM" to "FOAM" P9L1 Why was +-50KB chosen as a flowering time gene interval . Change "upper" to "up". Fig3. What are the axes on the right side insets? P12L18 I don't know what a PCC of 70% means. P8L22-28 Is FOAM a product of the authors as it is not cited?

Declaration of competing interests Please complete a declaration of competing interests, considering the following questions: Have you in the past five years received reimbursements, fees, funding, or salary from an organisation that may in any way gain or lose financially from the publication of this manuscript, either now or in the future? Do you hold any stocks or shares in an organisation that may in any way gain or lose financially from the publication of this manuscript, either now or in the future? Do you hold or are you currently applying for any patents relating to the content of the manuscript? Have you received reimbursements, fees, funding, or salary from an organization that holds or has applied for patents relating to the content of the manuscript? Do you have any other financial competing interests? Do you have any non-financial competing interests in relation to this paper? If you can answer no to all of the above, write 'I declare that I have no competing interests' below. If your reply is yes to any, please give details below.
I declare that I have no competing interests.

I agree to the open peer review policy of the journal. I understand that my name will be included on my report to the authors and, if the manuscript is accepted for publication, my named report including any attachments I upload will be posted on the website along with the authors' responses. I agree for my report to be made available under an Open Access Creative Commons CC-BY license (http://creativecommons.org/licenses/by/4.0/). I understand that any comments which I do not wish to be included in my named report can be included as confidential comments to the editors, which will not be published.
I agree to the open peer review policy of the journal.

Authors' response to reviews: Reviewer #1 I'm primarily qualified to review the comparisons with PLINK. I'll briefly remark that using linkage disequilibrium information to refine FarmCPU's bin method makes sense to me; and I'm not currently qualified to assess whether this use of BIC to replace REML is sound, so it's crucial for another reviewer to make that judgment.

Response: Thank you for reviewing our manuscript and we appreciated your comment that makes sense to refine our previous FarmCPU method by replacing bins with linkage disequilibrium information in our new method, BLINK.

The computation time comparisons against PLINK 1.9 GLM are clearly fair. (The results could have been made to look better by comparing against PLINK 1.07.) A method that has an implementation that's twice as fast as PLINK 1.9 and also has better support for multicore systems is probably fast enough for 2018. The PLINK 2.0 alpha codebase has some additional improvements to GLM speed which the authors may find to be worth borrowing (single-core speed is >50 times faster than PLINK 1.9 in some cases, and of course it adds the multithreading that was missing from 1.9's basic GLM), but this probably isn't urgent.

Response: Thank you for your comments. You are correct that PLINK 1.9 had a significant improvement on computing speed from compared with 1.9. In our previous study on FarmCPU, we found that PLINK 1.7 was even slower than FarmCPU. In our current study, we found that PLINK 1.9 was faster than FarmCPU and BLINK was faster than PLINK 1.9. It is definitely worthy to borrow PLINK 2.0 for the improvement to GLM in BLINK, although this is not urgent as you indicated. Your suggestion is really appreciated.

However, the claim that 28% of 397,323 tested maize SNPs had PLINK p-values below the Bonferroni threshold suggests that the PLINK GLM procedure was not used properly. More precisely, (i) while the stated computational complexity for PLINK's GLM procedure included a c^2 term, it is not clear whether top principal components were actually included as covariates in the maize test, even though PLINK 1.9 has a built-in function to compute them, and (ii) it is also unclear what QC steps were performed before the GLM run; in particular, was the sample set pruned of close relationships? While I would expect a suitable fixed-effects model to produce fewer false positives than properly employed PLINK GLM, I've only seen numbers like "112,998/397,323 genome-wide significant" when improper QC and/or lack of principal component covariates were involved.

Response: Thank you for pointing out the issue. Great minds think the same. Reviewer #2 also pointed out the same issue. We did not make it very clear if the inflation of P values was due to the inappropriate usage of PLINK GLM model, or due to the limitation of the method itself. First, we used PLINK to conduct PCA (Principal Component Analysis) on all the markers. The first two PCs and their product were fit as covariates in screening all the markers. Obviously, this did not solve the inflation problem. Second, we did conducted QC before GWAS, including excluding the SNPs with MAF less than 0.05. You pointed out a critical issue about pruning samples. We feel it might solve the inflation problem if a appropriate pruning method is identified. In some datasets, particularly from human, elimination of closely related individuals such as full or half sibs could eliminate inflation 2 with a GLM model with either PCs or structure (Q matrix) fitted as covariates. The lines in the maize dataset not only have strong population structure, such as tropical population, stiff stalk and non- stiff stalk populations, but also have strong sub (family) structure. As the their relationship varied continuedly from non-correlated to closely correlated, it is not trivial to set a threshold for pruning. Instead, incorporating their Kinship (K) in a fixed effect and random effect mixed linear model solved the problem well by Jianming Yu (Nature Genetics, 2005, https://www.nature.com/articles/ng1702). That research demonstrated that GLM suffered from severe inflation of P values even with fitting Q matrix as covariates. The method of Q+K completely solved the problem with improved statistical power. We have revised our manuscript to make this clearer. Please see the highlighted section on page 7.

On the flip side, if your method is harder to misuse than most existing methods, that is a legitimate practical advantage.

Response: Thank you for bringing the point out. It was our objective to develop a method that is hard to misuse. We have revised our manuscript to make this clearer. Please see the last paragraph of Introduction (highlighted). In any case, in a situation like this, it is critical to post the scripts, or at least the key PLINK and BLINK command lines, that were used in the comparison. I could not find these in the Supplementary Material or GitHub repositories, though I did find that the demo_data/PLINK/myData.fam file in the BLINK-C repository was not valid (instead of distinct sample IDs, the first two columns were always "0 0", "1 1", or "2 2").

Response: Thank you for your suggestion. The comparison of command line between PLINK and BLINK has been added into supplementary (Table S3). The mistakes in the demo data was also fixed.

I also find it odd that the only other statistical power comparison was with FarmCPU; there was no comparison against e.g. the most recent version of BOLT-LMM or FaST-LMM. This is partly excused by the presence of such comparisons in the recent FarmCPU paper from the same lab, but they really still should be in here for completeness.

Response: We agree that the whole picture is clearer to have all of these in one place. As you suggested, we have added BOLT-LMM into the comparation for completeness. Please see Fig 2&3, S4-9 Fig. 3

Reviewer #2 In "BLINK: A Package for Next Level of Genome Wide Association Studies with Both Individuals and Markers in Millions" by Huang et al, the authors provide a detailed background of the evolution of GWAS approaches that improve computation time and statistical power. Then they provide a new algorithm that address both issues called BLINK. BLINK is built from their previous GWAS implementation called FarmCPU but they replace REML with a BIC algorithm and they include LD information to remove the previous assumption that QTNs are evenly distributed across the genome (which they are not!). The article addresses an ongoing scientific problem of great importance.

Response: Thank you for briefing our findings and their significance to the scientific community. For your remaining concerns, we have addressed them correspondingly. Please read the details of our responses under each of your comments.

Response: Thank you for pointing out the problem. More descriptions about the datasets have been added into context. The population structure of all the datasets were ilustrated in Fig S1.

Why was the human dataset replicated in such a severe manner? Each population group is amplified perfectly up to 10 times to make a bigger dataset. Wouldn't it be "better" to amplify with variation?

Response: Thank you for your suggestion. We did exactly as you suggested. Although the conclusions stay the same, we replaced the original results with you ones you suggested (Fig 5).

Why was the real genotype and phenotype data simulated? Why not test on real data and see if you reproduce previous findings? The authors do this a little bit with the maize analysis. Please explain the rationale for not using real data when it is clearly available. The authors compare two of their workflows (BLINK and FarmCPU) against PLINK. The authors clearly explain that there are newer and more powerful options than PLINK. Please provide a rationale as to why only PLINK was compared.

Response: We feel sorry that we did not make clear on these issues. First, we did not simulated genotypes. All the genotypes are real. Second, we tested our method on both simulated and real phenotypes as you noticed. The simulated phenotypes allow us to validate true positives and false negatives as we know where the simulated genes. The real phenotypes allow us to test if the new method works in real phenotypes that may have the complexity that we were not able to simulate. To validate the findings from the new method, we conducted the enrichment study. We compared our new method with both GLM which has the most efficient theoretical computing time, and FarmCPU, which was the most recently developed method with high statistical power (Fig 2 and 3). As also 4 suggested by the reviewer #1, we also added BOLT-LMM into comparison in our revised manuscript. We have revised our manuscript to clarity on all issues you mentioned.

P8L4-5 Why does PLINK exhibit "strongly inflated P values"? Is there evidence to support this statement?

Response: Great minds think the same. Very glad that you and Reviewer #1 pointed out the same issue. Please read our corresponding response (the third response for Reviewer

1). Please excuse us for avoiding duplication.

Response: The reason that BLINK is better than FarmCPU is that BLINK eliminate the assumption of FarmCPU that genes are evenly distributed on a genome, which is barely true. From this point, FarmCPU is ineffective compared with BLINK. Regarding cloned flowering time genes, ZmCCT, ZCN8, and VGT1 are the three genes cloned so far for maize flowering time. The 3 true positives were from seven candidate genes listed by maizegdb.org. Among the 49 loci detected by BLINK, 12 of them overlapped with FOAM 1003 genes (Fig. 4). We have revised our manuscript for clarification. Please see the highlighted section on page 8.

The compute time discussions do not provide sufficient descriptions of resources to be of any meaning. Define things like the CPU that contained a "core(s)". e.g. Was the "singlecore CPU" on P10 an 8088 or an I7? Which Mac Pro? Which Linux kernels and distros were tested? Etc.

Response: The detail of test computers and their OS version have been added (Table S2).

Minor. Please change the title to explain "millions". Do you need to rearrange or insert "in the" before "millions"?

Response: “in the” has been inserted before “millions”.

P6L12 What multiple test correction was used? You might want to define what you mean by LD in this paragraph.

Response: The method and threshold of multiple correction has been added. LD here means correlation, has been revised.

The authors might want to remove the "Big Data" phrase. This is not really a big data study. Response: “Big data” has been revised to “big dataset”.

P9L4 "Change "FAOM" to "FOAM" Response: Has been revised.

P9L1 Why was +-50KB chosen as a flowering time gene interval . Change "upper" to "up". Response: Because 100Kb is moderate size of gene interval in maize. “upper” has been changed to “up”.

Fig3. What are the axes on the right side insets? Response: Observed and expected -log10 P value.

P12L18 I don't know what a PCC of 70% means. Response: It refers to the threshold of correlated SNPs filtering. This part has been edited to make more clear.

P8L22-28 Is FOAM a product of the authors as it is not cited? Response: FOAM has been correctly cited.

Reviewer #3 (David Ries, PhD) Review for “BLINK: A Package for Next Level of Genome Wide Association Studies with both Individuals and Markers in Millions" by Huang et al. , submitted to GigaScience. In this publication, the authors describe a method to identify phenotype controling traits by applying genome-wide association studies (GWAS) to large sets of phenotype associated marker data. The method consists of an algorithm named Linkage-disequilibrium Iteratively Nested Keyway (BLINK), which improves upon an already published algorithm FarmCPU by exchanging the computationally expensive random effect model (REM) by a computationally efficient fixed effect model (FEM) and eliminating the assumption of randomly distributed Quantitative Trait Nucleotides (QTNs) by introducing linkage disequilibrium (LD) information to the algorithm. Furthermore, a command line programm (i.e. the implementation of named algorithm) to apply the algorithm to suitable data is presented and tested. The two advantages, in comparison to other presented methods, lie in a better false positive to false negatives ratio of identified phenotype associated markers, as well as a significant improvement in computing time. To validate the reduction in false positives and false negatives, the authors tested BLINK on already published datasets where an optimal outcome is known. Computing time was tested on synthetic datasets of different sizes, and compared to two alternative published methods. Applying GWAS to identify genomic regions correlated to phenotypic traits has become a standard. None the less, the ever increasing sizes and numbers of genotyping datasets as well as analyzed genomes make computational methods nescessary, which can cope with large datasets and reduce the amount of biological follow-up analysis. Therefore, a package like BLINK, allowing GWAS to be performed on large datasets by means of a standard desktop computer in short time and also increasing power of the analysis is a valuable contribution to the work of scientists in many fields.

Although I find the idea, implementation and proof of concept compelling, in my opinion there are some essential points that need to be improved upon: Response: Thank you (Dr. Ries) for precisely summarize our findings and their significance to our research community. For your remaining concerns, we have addressed them correspondingly. Please read the details of our responses under each of your comments.

Major revision: The authors used published datasets to test and validate their algorithm. Some of these datasets are hard to find, starting from the information given in the paper. For example, I tried to redo the analysis for the maize flowering time data. A URL was given, which I followed. There was a number of datasets to be found and it was not possible for me to find out, which one was used in the paper. At least not without some detective work. An accession number or name of the dataset should be included in the manuscript, especially since this is required by the “Editorial policies & reporting standards" of GIGAScience.

Response: Again, great minds think the same! Thank you for pointing out the same issue that Reviewer #2 pointed out. We have added more description on the datasets.

The same goes for the synthetic dataset to test the computing time of BLINK. This dataset should be made accessible as well as the code that was used to create this dataset.

Response: The function of creating synthetic dataset has been added into BLINK to allow a reader to generate the synthetic dataset. A R demo code was illustrated in Github(https://github.com/Menggg/BLINK/blob/master/synthetic_genotype_data.R) to use the BLINK demo data to generate the synthetic dataset. Please read the highlighted section on page 15.

The results section “Association and enrichment on real phenotype", describes how BLINK outperforms to published methods. Specifically, it is stated, that BLINK, applied to a published dataset, identifies flowering time associated SNPs more accurately and more exhaustively. I don't see this from the data shown in Figure 3. Although it is true that BLINK (and FarmCPU) have a superior false positive rate in comparison to PLINK, and BLINK does identify SNPs in close proximity to the cloned genes, BLINK does also present a considerable number of unrelated SNPs with high p-values. It can not a priori be expected that these SNPs are not false positives, and in that case BLINK would be less powerful than FarmCPU. This should be discussed and the results should be substantiated further. As I understand it, the analysis of the second dataset (the original study should be cited here), is intended to further explain how most of the SNPs identified by BLINK are actually correlated to the phenotype. This point needs, in my opinion, to be made much clearer.

Response: We have edited the “Association and enrichment on real phenotype” section for more clarification. In the GWAS on maize flowering time, BLINK detected 49 SNPs, FarmCPU detected 14 SNPs. There were 9 SNPs overlapped between BLINK and FarmCPU.

BLINK detect 40 SNPs that were not detected by FarmCPU. FarmCPU detected 5 SNPs that were not detected by BLINK. An important question is if there is support from another study (FOAM) on the 9 common SNPs, 40 BLINK unique SNPs and 5 FarmCPU unique SNPs.One of the 5 FarmCPU unique SNPs shared with FOAM genes. However, among the 9 common SNPs, 4 were overlapped with FOAM genes. The chance to have 5 or more overlapped SNPs is less than 1% if the 9 SNPs are randomly selected from a genome, suggest support from FOAM study. Similarly, among the 40 common SNPs, 8 were overlapped with FOAM genes. The chance to have 8 or more overlapped SNPs is less than 3% if the 40 SNPs are randomly selected from a genome, suggest support from FOAM study. Please read the highlighted paragraphs on page 8 and 9.

Minor revision: Redundancy in methods „Genotype and phenotype Data" make this section tedious to read. In „Power, Type I error and FDR", the authors state, that they used varying bin sizes, but only report results for 10 KB bin sizes. It would be informative to also show the results of the other bin sizes.

Response: The context in “Genotype and phenotype data” section has been edited. The power, type I error and FDR results have been indicated using different bin sizes.

Parts of „BLINK Procedure" are copied from the „IDEA" section. Maybe this redundancy can be removed by merging „BLINK Procedure" with „IDEA". Response: This IDEA part has been merged with BLINK Procedure part.

Many points that are made in the last paragraph of the „Limitations" section are not limitations at all and should thus not be stated there. The first paragraph of this section contains valuable information about how the algorithm iteratively decides on a p-value for SNP selection. This information should go to the Methods section, where the functionality of BLINK is described. Response: The first paragraph has been moved to method section. “Limitation” section has been renamed. In the results section „Observed computing time", the authors state, that they tested parallelization on computer systems, including Linux and Mac. This implies, that further systems have been tested. Results for these systems are not shown, so they should not be mentioned.

Response: This statement has been removed.

A workflow describing the analysis of the datasets would be helpful, including command line commandos and parameters used. Response: The R code and command line code has been added as supplementary file (Table S2).

Thus, I recommend revision for this manuscript, due to the data availability, as well as improved clarity of some parts. Also I would recommend for a reviewer with a stronger background in statistics to check the theoretical background. Response: Dr. Ries, thank you for your recommendation which improve the readability of our manuscript.

Source

Content of review 2, reviewed on June 26, 2018

The athors have addressed my primary concerns. Please double check for typos. Here are two examples:

ABSTRACT: "Big dataset" tense is wrong in several places in the abstract. P2L12 "belonging sub-populations" is missing a "to"

Authors' response to reviews: (https://drive.google.com/open?id=1jtJ40A4LiNtIcgP4swEVF8h9p7DxeyfX)

Source

References

Meng, H., Xiaolei, L., Yao, Z., M., S. R., Zhiwu, Z. BLINK: a package for the next level of genome-wide association studies with both individuals and markers in the millions. GigaScience.

Pre-publication Review of

BLINK: a package for the next level of genome-wide association studies with both individuals and markers in the millions

Reviewed On March 13, 2018 , and June 26, 2018

Submitted to

Reviewed by

Actions

Content of review 1, reviewed on March 13, 2018

1). Please excuse us for avoiding duplication.

Response: The detail of test computers and their OS version have been added (Table S2).

Source

Content of review 2, reviewed on June 26, 2018

Source

References