This note tackles an important, interesting and challenging question, that of obtaining significance thresholds for phenotypes mapped in populations with unequal relatedness. The work reported here is of high quality. The results provide interesting insights into the problem, and raise questions that will require further work to be answered.
The simulation study performed for this note shows that, in a population that includes individuals with different levels of relatedness, permutations can provide reliable significance thresholds if the phenotypes are mapped with a statistical model that takes polygenic variation into account. This is demonstrated on phenotypes simulated to have polygenic variation and normally distributed errors. The authors further investigate the robustness of their finding by simulating phenotypes with non normally distributed errors, and by simulating different relatedness patterns and allelic frequencies.
The demonstration that permutations can provide reliable thresholds when the phenotype is mapped with a model that takes polygenic variation into account is conceptually interesting, and the deduction that the permutations can be analysed with a simple linear model with no random polygenic term is well done too. However, it seems to me that the main interest of this strategy resides in its applicability to real data. Since the authors investigate the robustness of their finding across different scenarios, I believe they share this view. If so, I reiterate and clarify the main comment of my previous review (1) and give a suggestion (2):
(1) Hereafter I will call “larger effect QTLs” QTLs with effect sizes large enough that they are detectable, in this study explaining 2-5% of the phenotypic variation (based on the effect size chosen for the power study). By “larger effect QTLs” I do not mean major QTLs such as the MHC for immune phenotypes. Confounding from “larger effect QTLs” in structured populations is established (Valdar et al., Segura et al. for example). It is not clear though how well mixed models, which primarily account for polygenic variation, correct for the confounding due to correlations with “larger effect QTLs”. Therefore, it is not clear whether permutations will provide reliable significance thresholds for phenotypes with such QTLs. Since phenotypes with “larger effect QTLs” are the type of phenotypes one can hope to identify QTLs for, it is important to consider their case and establish whether the strategy suggested by the authors is appropriate.
In my previous review, I requested that the authors, in their study of the type I error rate and power, simulate phenotypes with “larger QTLs” in addition to polygenic variation. I acknowledged that this would mean new simulations (possibly with different definitions of true and false positives, type I error rate and power to accommodate simulation and detection of QTLs on the same chromosome – when confounding is worst). If the authors decided not to undertake this, I suggested they add a comment to warn the reader of the limitations of their study and therefore of the generality of their conclusion. The authors added the following comment in the discussion: “In addition, a major QTL may result in false positives due to uncontrolled confounding between the QTL and a scanning locus. In such a case,
incorporating major QTLs as covariates in the model may address this concern (Valdar et al., 2009; Segura et al., 2012)”. This comment addresses the case of a major QTL whose presence would be evident, in which case it could easily be included as a covariate in the model. This was not the point of my review, which might not have been precise enough in terms of effect sizes. I suggest the comment in the discussion is modified and acknowledges that mixed models might not correct for confounding from “larger effect QTLs” well enough for permutations to provide reliable thresholds.
(please note: “larger effect QTLs” is not a suggestion of phrasing for the note, but I could not find a better one)
(2) In my previous review I also suggested (minor point) that the authors should investigate larger sibships to support their claim that permutations are appropriate under different population structures. The results they provide in their response show that permutations control the type I error rate correctly at 2 out of 3 significance thresholds. An interesting addition to the manuscript is an analysis with different allele frequencies at the founder generation. In that case permutations fail to control the false positive rate appropriately at 2 out of 3 significance thresholds too, and things get worse when the residuals are non-normally distributed. “Fail” here means that the type I error rate is significantly different (p-value < 0.05 for the failures mentioned above) from the expected level. However, it is difficult to get an idea of the magnitude of the problem, because the results are scattered between Table 2S (impact of simulating non normally distributed residuals), Table 1 (DAF), and response to my review (sibship size). I suggest the results from the simplest scenario (polygenic variation + normally distributed errors with sigma in {0.7, 1, 1.5}) are reported in one paragraph and Figure 1, and the results for alternative scenarios (non normally distributed errors, DAF, different population structures) are gathered in another paragraph so that the robustness of the strategy can be easily examined. This suggestion is one of minor reorganization and should not make the note longer. Simplifying Figure 1 by considering only the normally distributed residuals would also make it easier to read (less dense). For comparison purposes, the standard deviation of the residuals (or the proportion of phenotypic variation explained by the polygenic term) used for Table 1 should be indicated.
Other points raised in my previous review:
- In my previous review (minor point), I also wished to have an idea of the complexity and organization of the relatedness pattern in the population. The authors provided a heat map of the relatedness for 100 animals. However, I could not relate the heat map to the pedigree since there was no information about which animals were siblings/half-sibs/cousins etc., nor could I see the magnitude of the
differences in relatedness because no scale was provided. The heat map looks quite homogenously green.
- My point on non normally distributed phenotypes and their simulation through non normally distributed residuals is indeed out of the scope of this note so I won’t discuss it further here.
New important point:
The fact that the significance thresholds obtained by parametric bootstrap and unrestricted permutations are very similar and very different from those obtained with the other two methods (when the phenotypes are not mapped with mixed models) is puzzling, as is the fact that the thresholds obtained by parametric bootstrap are very different from those expected (when the phenotypes are not mapped with mixed models again). I previously wrongly assumed that the model for bootstrapping was a simple sibship model, instead of one based on the full pedigree. While the results were not too surprising to me under this assumption, they are now. I will explain below why these results are surprising to me:
The phenotypes simulated by parametric bootstrap to calculate significance thresholds are simulated in the same way as the phenotypes to be analysed, except for their polygenic component. Indeed, the polygenic component of the former is drawn from a multivariate normal distribution with covariance matrix G (from the pedigree), while the polygenic component of the latter, as I understand from the supplemental section “Simulation details”, was simulated by adding two components: one component drawn from a multivariate normal distribution with covariance matrix G, and one component simulated from one every five markers on chromosomes 11 to 20. The use of both components to simulate polygenic variation is unusual (I believe it was not used in Cheng et al, Genetics 2010), and, in light of the failure of parametric bootstrap to provide adequate threholds (see below), might not be without consequences. Therefore it should be explained. Also, it results that the sentence “The phenotype was generated such that polygenic variation approximately accounted for 56%, 46%, or 32% of the total phenotypic variation” (page 2 line 5 main text) is not quite right (this is only for the first component).
The inability of parametric bootstrap to control for false positives therefore seems to reside in the modeling of polygenic variation. The authors should attempt to explain the failure of parametric bootstrap (the sentence “The permutation (or bootstrap) test largely dissolves the confounding” is not satisfactory). I suggest the correlation between genotypic similarity (based on the markers used to simulate polygenic QTLs) and the coefficients of G is investigated by the authors, so that if the correlation is poor, the failure of parametric bootstrap can be attributed to poor modeling of the simulated polygenic variation by the coefficients of G (which could be due to the markers not capturing relatedness correctly or the coefficients being inaccurate).
New minor points:
- The sentence “The take-home message is that if the model is appropriate for a genome-wide scan, we may ignore the random polygenic effect to reduce computation when performing permutation tests to estimate the significance threshold.” is not clear/precise enough: If the phenotype is mapped with a model that appropriately controls for confounding, the permutations can be analysed in a model with no polygenic random term. Since this interesting point is stated only once, it really needs to be clear otherwise the reader might miss the point due to misunderstanding (at least I did on my first study of this note).
- The section “Simulation Details” is not structured enough and not precise enough: the fact that the genotypes sets are simulated by gene dropping should come after the choice of the number of chromosomes and markers, and before moving on to the phenotypes. Since it is said that equation (1) is used to generate the phenotypes, it should be indicated whether covariates effects were simulated or not, and indicating when a detectable QTL was simulated (i.e. for the power study but not for the study of the type I error rates) would make things clearer. Finally it should be stressed that a second component to polygenic variation (in addition to the random term with G as covariance matrix) was added by simulating polygenic QTLs.
-It should be made clear in the main text that power is misleading unless type I error is controlled. Looking at Figure 1 and Table 1, the reader will pay attention to the values for power no matter what the type I error rate is.
- “Computational approximation” part: Why computational? Also, this part is lengthy in its explanation of why the variance components need to be estimated, and the important point that the variance components are estimated in the null model does not stand out. This paragraph would be best in context at the end of the part “Statistical Model”.
- Bootstrap part of the supplemental material: It is not clear what model was fitted to estimate the variance components when the phenotypes studied are simulated with a detectable QTL (power study): will the detectable QTL be part of the model or not? The phrase “Under the hypothesis of no QTL” is ambiguous. Moreover, I do not understand the following sentence: “We can generate a bootstrap sample in a similar way when polygenic variation is ignored”
- I believe the part ‘pooling procedure’ is much less important/interesting than other points that could be detailed in the supplemental material, and so could be omitted if space was an issue in the supplemental material as well.
-page 2 line 17: heritability of the QTL should be changed to proportion of phenotypic variation explained, as discussed with reviewer #2
-page 3 line 1: “different allele (A/a) frequencies at the founder generation: 3/1 in F26 vs 1/3 in F34”. At the founder generation or at F26 and F34?
-suppl. material, page 3, Permutation tests: end of paragraph 2, it could be said again that unrestricted permutations were used for this note.
-it could be indicated in the title or legend of Figure 1 that it corresponds to simulations with DAF.
-Figure 1: relatedness (not) ignored instead of relateness (not) ignored