Bayesian hierarchical models allow ecologists to account for uncertainty and make inference at multiple scales. However, hierarchical models are often computationally intensive to fit, especially with large datasets, and researchers face trade-offs between capturing ecological complexity in statistical models and implementing these models.We present a recursive Bayesian computing (RB) method that can be used to fit Bayesian models efficiently in sequential MCMC stages to ease computation and streamline hierarchical inference. We also introduce transformation-assisted RB (TARB) to create unsupervised MCMC algorithms and improve interpretability of parameters. We demonstrate TARB by fitting a hierarchical animal movement model to obtain inference about individual- and population-level migratory characteristics.Our recursive procedure reduced computation time for fitting our hierarchical movement model by half compared to fitting the model with a single MCMC algorithm. We obtained the same inference fitting our model using TARB as we obtained fitting the model with a single algorithm.For complex ecological statistical models, like those for animal movement, multi-species systems, or large spatial and temporal scales, the computational demands of fitting models with conventional computing techniques can limit model specification, thus hindering scientific discovery. Transformation-assisted RB is one of the most accessible methods for reducing these limitations, enabling us to implement new statistical models and advance our understanding of complex ecological phenomena.

# Hierarchical computing for hierarchical models in ecology

##### Review badges

**4 pre-pub reviews

**0 post-pub reviews

# Hierarchical computing for hierarchical models in ecology

Published in Methods in Ecology and Evolution on November 05, 2020

##### Web of Science (Free Access)

##### Abstract

##### Authors

McCaslin, Hanna M.; Feuka, Abigail B.; Hooten, Mevin B.

##### Publons users who've claimed - I am an author

No Publons users have claimed this paper.

##### Contributors on Publons

- 1 reviewer

- Contribute
- pre-publication peer review (FINAL ROUND)
****##### Decision Letter

2020/09/2424-Sep-2020

MEE-20-06-476.R1 Hierarchical computing for hierarchical models in ecology

Dear Ms Hanna McCaslin,

It is a pleasure to accept your manuscript entitled "Hierarchical computing for hierarchical models in ecology" in its current form for publication in Methods in Ecology and Evolution. The comments of the reviewers who reviewed your manuscript are included below. Final instructions for your manuscript, and some promotion options, can be found at the end of this email.

Thank you for your fine contribution. On behalf of all Editors of Methods in Ecology and Evolution, I look forward to your continued contributions to the Journal.

Sincerely,

Professor Robert B. O'Hara

Senior Editor, Methods in Ecology and EvolutionReply to:

Ms India Stephenson

Methods in Ecology and Evolution Editorial Office

coordinator@methodsinecologyandevolution.orgWhy not become a member of the British Ecological Society? https://www.britishecologicalsociety.org/jointhebes

Associate Editor Comments to Author:

Associate Editor

Comments to the Author:

Thank you for making the changes - the reviewers and I are all happy, and I think this is much clearer.I would suggest one small change around l71: I think you should be clear that the example is a simple example to help explain the process (and that it could, of course, be fitted in a single MCMC run, but you want to use it to illustrate TARB).

Reviewer(s)' Comments to Author:

Reviewer: 1Comments to the Corresponding Author

The authors have improved their previous draft, and I have only minor suggestions to add.

Lines 78-83: Explanations for the hyperparameters in equation 3 and 4 are missing. “IG” is an inverse Gamma prior?

Lines 81-83: An explanation/reference should be added why this model cannot be sampled using Gibbs updates.

Line 90: Why would you still need to tune the updates for each logit(pj) by hand in the first stage? I think readers would appreciate a short explanation here.

Figure 1: Please check: in panel A, “Data” do not represent the hierarchical structure y_i_j. Suggest to either use the two relevant index in panel A or clarify in the legend that you refer to partitioned data in both panel A and B.

Lines 105-106: Please check, I found the reference to your cheatgrass example her a bit out of context.

Line 218 Delete redundant “the”.

Line 248-253: It is still difficult to understand how the computation gain can be explained, as you run the first stage of the recursive approach on 8 and 15 cores compared to the single algorithm approach? Feasible to add computation time for the recursive approach on a single core?Reviewer: 2

Comments to the Corresponding Author

The authors have responded to my earlier critiques appropriately, and I have no further comments.##### Decision letter by

##### Cite this decision letter

##### Reviewer report

2020/09/17The authors have responded to my earlier critiques appropriately, and I have no further comments.

##### Reviewed by

##### Cite this review

##### Reviewer report

2020/09/14The authors have improved their previous draft, and I have only minor suggestions to add.

Lines 78-83: Explanations for the hyperparameters in equation 3 and 4 are missing. “IG” is an inverse Gamma prior?

Lines 81-83: An explanation/reference should be added why this model cannot be sampled using Gibbs updates.

Line 90: Why would you still need to tune the updates for each logit(pj) by hand in the first stage? I think readers would appreciate a short explanation here.

Figure 1: Please check: in panel A, “Data” do not represent the hierarchical structure y_i_j. Suggest to either use the two relevant index in panel A or clarify in the legend that you refer to partitioned data in both panel A and B.

Lines 105-106: Please check, I found the reference to your cheatgrass example her a bit out of context.

Line 218 Delete redundant “the”.

Line 248-253: It is still difficult to understand how the computation gain can be explained, as you run the first stage of the recursive approach on 8 and 15 cores compared to the single algorithm approach? Feasible to add computation time for the recursive approach on a single core?##### Reviewed by

##### Cite this review

##### Author Response

2020/09/01Dear Editor,

Please find our revised manuscript entitled “Hierarchical computing for hierarchical models in ecology ” submitted for review in Methods in Ecology and Evolution.

We appreciate the thoughtful reviews and helpful comments on the previous version of our manuscript. We have revised our manuscript to be more accessible to ecologists and have strengthened our justification and explanation of the approach, including incorporating an example and sample code to illustrate our approach to a broader audience. Below please find reviewer comments and our responses.

Thank you for considering this manuscript. We look forward to hearing from you.

Sincerely,

Hanna McCaslin, Abigail Feuka, and Mevin HootenAssociate Editor Comments to the Author:

I think the topic is interesting, and this can become a really interesting manuscript. But, based on the reviewers' comments, I think it needs more work to make it accessible. I agree with reviewer 1 that the justification for using TARB rather than RB isn't clear. As far as I can see, the acceptance probability just requires calculating densities at the points in the parameter space, i.e. you just need to plug in the values in eqn. 6. So I'm not sure why these need to be analytically tractable. I assume I'm missing something.

** Authors’ response: To calculate the acceptance probability, we need to evaluate the PDF of \theta, rather than just sample from this distribution. Obtaining a sample from the posterior distribution of \theta is indeed trivial given that we have a sample from the posterior of g(\theta). However, to calculate the densities in eqn. 6, we need the PDF of \theta, which must be calculated as shown in eqn. 7. This is a common pitfall associated with MCMC and transformation, and we have added a discussion of this distinction and why it is important to this part of the manuscript (lines 152-155).Some other comments:

1. it might help to have a familiar example (e.g. a simple 2-level GLMM) as an example, so the reader can follow the use of TARB without having to spend a lot of effort understanding the example. I think it's fine if this is a toy example. It could also be used to explore how well the methods work (which the reviewers are concerned about).

** Authors’ response: Thanks for this feedback; based on this suggestion and that of the reviewers’, we have added a simple example in the introduction and methods to motivate and illustrate TARB, and have added a short tutorial for implementing TARB based on this example as supplement.At times the description could be clearer if you distinguish the densities from the first and second stage models. In particular, there are two densities for theta_j.

** Authors’ response: We have revised the methods and application to be more clear about which stage and which posterior distribution we are referring to.It might also help to describe [theta_j] in the first level models as a pseudo-prior, to emphasise the point that it isn't a true prior because it will be integrated out. This terminology is used in a slightly different context (e.g. Carlin & Chib, DOI: 10.1111/j.2517-6161.1995.tb02042.x), but with a similar aim.

** Authors’ response: To avoid confusion with the Carlin and Chib terminology, we use “temporary prior” because it is just a placeholder that facilitates the proposals in the second stage.

Reviewer(s)' Comments to Author:

Reviewer: 1Comments to the Corresponding Author

The study “Hierarchical computing for hierarchical models in ecology” applies recursive Bayesian computation to an ecological data set and aims to make this computational tool accessible to ecologists. This is a very timely study and the topic should be of much interest to a broad audience of ecologists that struggle to fit hierarchical models to large and complex data sets because of computational constraints. I found this an impressive and well written manuscript. However, working at the interface of field ecology and hierarchical modelling myself, I found myself spending a relatively long time reading some details, yet I was not fully convinced after reading whether this approach is indeed readily applicable to a large range of problems. I thus wonder whether some more explanations would be helpful to make the work more accessible to a broad audience. Perhaps the TARB approach introduced from line 113 onwards can be better explained if adding some more details about assumptions etc and perhaps also linking this to Bayes filter ecologist might be more familiar with?

**Authors’ response: We have added an overview of the recursive Bayesian computing approach in the introduction and have also provided a simple example that will help readers understand the concept more quickly. We return to the simple example in the beginning of the methods to illuminate the procedure benefit of the TARB approach. We also focused much of our revision on illustrating why TARB is advantageous over RB and have added more discussion to show how it is applicable to a large range of problems. In our experience, most ecologists who use Bayesian statistics are aware that the posterior from a previous analysis can be used as the prior for an analysis based on new data. However, this is rarely done in practice nor while adhering to the exact first stage posterior. Hooten et al. (2020) go into this concept in detail, so we don’t dwell on it in this manuscript. Instead, we focus on recursive computing for fitting hierarchical models in a way that requires less supervision of the algorithms.Also, I would find it interesting to learn more about how this approach can be applied to sparse data and unequal sample sizes among groups. How would you for example weight first stage estimates according to partition-level uncertainty in the second stage?

**Authors’ response: Thanks for these questions; this highlights one of the things that is so slick about this method! Unequal sample sizes or sparse partition-level data aren’t a problem for fitting models with this approach. Each partition is treated as an independent entity in the first stage, so they can each have different sample sizes. However, if all else in the model remains the same partition to partition (i.e., the same priors and hyperpriors), then the uncertainty captured by the posterior distributions for each partition-level model will reflect the sample size, for example, stage 1 models fit with fewer data will likely have more uncertainty than those fit with lots of data. Then, when we sample from these posterior distributions in the second stage, this uncertainty will carry through and partition-level parameters with more uncertainty will borrow strength from those with more data. Thus, weighting the first-stage estimates is built into the technique. This is part of what makes it so useful from a meta-analysis or online updating perspective, in which sample sizes will likely be different.Does the TARB approach results in less precision in parameter estimates at either first or second stage? What kind of hierarchical models should not be done with TARB? Perhaps worth to add some discussion here that may foster the understanding of the approach?

**Authors’ response: In general, the first stage parameter estimates will be less precise than in the second stage, because these estimates are not taking into account the shrinkage induced by the population-level parameter estimates. In the second stage of the model fitting process, the individual-level parameter estimates “borrow strength” from the population-level parameters, which will tend to shrink the parameter estimates. As TARB is just an efficient approach to fitting hierarchical models, rather than a class of models, the fit isn’t considered complete until both stages have been computed, so the stage 1 parameter estimates can be viewed as intermediate results, but should not be used for inference. We included them in Fig. 1 only to illustrate the shrinkage that occurs between the 2 stages. We’ve clarified this point in the manuscript. Also, we agree that it is useful to include a discussion of when TARB is not appropriate, and we have incorporated this into our discussion (lines 340-351).I much appreciate the amount of work that apparently went into this manuscript. The authors have chosen a relatively complex movement model for demonstration while providing some other helpful examples of ecological data previously analysed with hierarchical models. If feasible, I think that providing some TARB model code and output (perhaps as a supplementary tutorial) for one of the most accessible data sets such as the harbor seal counts would make the approach much more accessible to a broad audience.

Overall, an interesting and stimulating manuscript!

**Authors’ response: Thank you for the suggestion. We have added this supplementary piece for the new presence/absence example and have provided code for this example and our movement application in a Github repository accessed by the link in the acknowledgements.SPECIFIC COMMENTS:

Line 12: Can you provide more details for the statement “reduced computation time for fitting our hierarchical movement model by half”? If the initial model is fitted with an extremely slow conventional MCMC algorithms, half the time might be of less interest than in the initial model is state-of-the-art modelling?

**Authors’ response: While improved computational efficiency often occurs when using TARB, it is most helpful in reducing the tuning associated with MCMC algorithms. This is especially true for large hierarchical models where one might usually have to individually tune dozens or hundreds of individual-level parameters to achieve convergence. Therefore, TARB may allow the user to fit models that would otherwise be challenging to fit using conventional MCMC or automated software for which they have no control over the tuning. We have added a discussion of TARB compared to other state-of-the-art approaches for computing hierarchical models (lines 328-339).Line 27-29: Would it be possible to explain in few words or with a short example how individual telemetry data related to an species-level parameter of interest?

**Authors’ response: We have added a short example to clarify how telemetry data connect to a population-level parameter of interest.Line 55-58: Recursive computing is generally associated with sequential data but some of the hierarchical model application in ecology you mention above are not fitted to sequential data? Perhaps worth to clarify whether your approach is limited to sequential data or not?

**Authors’ response: Thank you, we have made this clarification in the manuscript – our approach is not limited to sequential data, but rather can be applied to any data that can be partitioned so that the individual partitions are independent conditional on the process and parameter models.Line 66: The examples (“(e.g., partitioned by individuals, sites, or species)”) are redundant with

** Authors’ response: (now line 104) We have reorganized how we discuss the data partitions in the introduction and methods and have removed this redundancyLines 68: Replace “population-level parameters” with “group-level parameters” (simply to avoid confusion around the word ‘population’?). Perhaps worth to consider also introducing the term ‘hyperparameter’ here?

** Authors’ response: (now line 102) We have updated “population-level” to “group-level” when speaking about hierarchical models in general, to match the usage of Gelman & Hill 2006 (Data analysis using regression and multilevel/hierarchical models). The most appropriate wording for this level of the model structure depends on the application, so in some places, we use more specific terms like “study-wide” parameters for the group-level parameters, but have been more clear about what we are referring to in these cases. We have found that the use of “hyperparameter” and “hyperprior” can sometimes lead to more confusion, because not everyone interprets these words in the same way, so we have avoided the use of this terminology.Line 79: ”temperature in blue tits”? Unclear.

**Authors’ response: (now line 66) We updated this phrase to be more clear in the context in which we referred to it.Line 94-95: Please check: I found the wording “specifying priors [theta_j] for the partition-level models” somewhat confusing: you specify priors for the partition-level parameter theta_j but perhaps it is not clear to all readers that “[theta_j]” refers to a prior?

**Authors’ response: Thanks for pointing out that our notation here may not be familiar to all readers; we have added a line to clarify that we use brackets to refer to probability distributions and the main reference associated with that notation (line 105).Line 98: “remaining parameters” are all “population-level parameters”?

**Authors’ response: (now line 120) This is correct, and we updated this wording.Line 108: Does “first stage” correspond to ‘partition-level’ terminology as used before? Suggest to make clear somewhere in the text how first/second stage link to partition/population levels.

**Authors’ response: The word “partition” refers to how we split the data rather than a level of the hierarchical model or computing stage. We have updated our use of ‘partition-level’ and ‘group-level’ for consistency throughout.Line 108: Perhaps explain ‘importance update’ in a few words (in parenthesis)?

**Authors’ response: (now line 131) We have added a reference to importance sampling for the interested reader (Geweke 1989, Econometrica, doi: 10.2307/1913710).Lines 113-131 The transformation and justification of the Jacobian prior, i.e. the TARB approach, is in my opinion incompletely described to make this accessible to ecologists and need some more detailed description. Also, shouldn’t aspects such as sequential data and underlying Markov processes be mentioned here? I think it need to be made clear here how the assumptions are met for data that resemble observations from underlying Markov process (e.g, movement records or population fluctuations) versus independent samples (e.g. independent records of species occurrence or behavior in different environments)?

**Authors’ response: We have revised our explanation of transformation and change of variables using the Jacobian (now beginning on line 129) so that it is more accessible for ecologists. The dependence in the data commonly used in hierarchical models is typically accounted for in the latent process (e.g., Markovian temporal dependence) and doesn’t present any issues with assumptions.Line 114: Perhaps some more information are warranted to explain how the transformation function g is determined and why as transformation should be used? Also, worth to mention the vector size here?

**Authors’ response: Thanks for this suggestion. We have rewritten this paragraph (lines 129-146) to more clearly explain g and why using the transformation can be advantageous. Additionally, we removed our incomplete explanation of vector size in reference to g because this detail contributed to this paragraph being difficult to follow, and isn’t essential to conveying the main point of the paragraph.Line 148: Explain I in equation 11?

**Authors’ response: We have added the definition of I as an identity matrix in the text following equation 15, where it first appears (line 189-190).Line 149: What landscape covariates were used in your study?

** Authors’ response: In our example, we treated the latitude of the observation as the only covariate in the potential function. However, we realize that referring to this as a “landscape covariate” is confusing so we have removed the reference to the more general case that could include landscape covariates (line 187).Line 154: Confusing to me and perhaps worth some clarification: shouldn’t ‘delta_p rather than sigma relate to movement speed?

**Authors’ response: Thanks for this question. In SDE models like this one, both delta p and sigma are related to movement speed. We have clarified that the potential function also influences movement speed (line 188).Line 179: For how many days (i.e. sequence length)?

**Authors’ response: We changed the temporal reference to the case study data from “fall 2018” to the specific dates of observations to report the timespan of the telemetry observations (line 179).Line 224: You present posterior estimates for first stage estimates but wouldn’t it be as interesting to present those from the second stage?

**Authors’ response: We present the first stage estimates alongside the second stage estimates for the partition-level parameters in Fig. 1 to show the shrinkage in these parameters induced by the group-level parameters, not to suggest that these were the estimates relevant for inference. We also have updated the manuscript to more clearly distinguish between the individual-level estimates and first-stage posteriors to clarify between these concepts (e.g., line 258).Legend Figure 1: Check "n = 15" and "n = 1675 telemetry locations from J = 15". Line 224:

**Authors’ response: Thanks for catching this; we’ve corrected “n=15” to “J=15”Line 250: Please check: the wording “to fit the full hierarchical model…” is confusing because you talk about fitting a TARB?

**Authors’ response: (now line 293) We have updated the language to be clear that we were referring to fitting the second stage algorithm.Reviewer: 2

Comments to the Corresponding Author

Review of "Hierarchical Computing for Hierarchical Models in Ecology"The authors introduce transformation-assisted recursive Bayesian computing (TARB) as an alternative to a standard MCMC algorithm for Bayesian hierarchical models. The method substantially reduces computational time while resulting in equivalent inference. The paper is well-written and introduces the quantitative ecology audience to a method that can enable researchers to fit more complex models without the impediments of computational limitations. Specific comments are below.

This may be personal preference, but I find it helpful to introduce the specific applications and model earlier in the manuscript. Equations (1)-(3) generally define what type of model TARB works for, but it was helpful for me to read

the application section first to provide relevant context throughout the methods description.

**Authors’ response: Thanks for this suggestion. We have added a simple example to demonstrate the motivation, utility, and implementation of TARB throughout the introduction and methods to provide context.In equation (2), do the \theta_j have to be independent conditional on \psi? Do all \theta_j have to have the same distribution [\theta|\psi], or can equation (2) be indexed [\theta_j | \psi]?

**Authors’ response: We have added the subscript j to theta for clarity (now Eq. 6).Lines 101-106: The authors sample the theta_j^

*randomly with replacement to reduce autocorrelation since the MH ratio assumes the proposed values are independent. What practical steps can be taken to ensure the autocorrelation is sufficiently low? Should researchers thin the stage 1 chain and look at acf plots? If the stage 1 chain has high autocorrelation, which can be the case in complex problems, what impact does that have on the MH ratio?**Authors’ response: (now line 127-128) As long as the first stage chain is long enough, sampling with replacement results in proposals that are uncorrelated. Alternative approaches involve thinning the first stage chain (Hooten and Hefley, 2019, Bringing Bayesian Models to Life), but overall sampling with replacement performs better and was recommended by Lunn et al. 2013 (10.1111/rssc.12007) and Hooten et al. 2020 (doi: 10.1080/00031305.2019.1665584).Lines 107-118: I found this paragraph difficult to follow and had to re-read several times. I understand the desire to have conjugacy in the first stage prior. It is less clear why the authors want the full model to use transformed parameters that have a Gaussian distribution. Why Gaussian? Is this just so the population level parameters are specified through the mean?

**Authors’ response: We appreciate this feedback, and have rewritten this paragraph to be more clear about why transformation is useful, and why we were focused on Gaussian distributions (lines 129-146). We referred to Gaussian parameters here because these are conventional in cases like random effects in GLMMs and because this is the distribution ecologists are often most comfortable working with and that is often easiest to specify if we are aiming to use more informative priors. The TARB approach is much more general however and could be applied regardless of distribution. We have clarified this in the manuscript.Line 131: Any considerations other than conjugacy for the temporary first stage prior? For example, is there a benefit to choosing a temporary prior that is "similar to" the marginal prior that would result from the full model after integrating out the hyperparameters?

**Authors’ response: Good point. The first stage prior affects the proposal for the second stage, so one could identify “optimal” priors to use in the first stage that yield the most efficient algorithms. We have found that diffuse temporary priors in the first stage perform well in a variety of examples.When is RB or TARB not worth the extra effort? Is the computational gain less dramatic with small to medium J? What if the dimension of theta_j is "large" (p>n) as is the case in say a model with spatial random effects? What happens if the theta_j are highly correlated in the full model posterior? Does this impact convergence diagnostics (e.g. effective sample size)?

**Authors’ response: As the partition-level sample size increases toward infinity, the first stage posteriors will shrink around the point estimates and in this case, the hierarchical model is not warranted because one could just use the first stage point estimates as data. However, we have found that when the partition sample sizes are relatively small and unequal, these types of recursive computing procedures are helpful. Also, in situations where tuning is needed for more than a few parameters, the TARB method is particularly helpful because it may not even be feasible to tune all of the parameters separately.Which parts of the algorithm contribute to the computational gain? Is it primarily the ability to parallelize the step 1 Markov chains? I don't know if the comparison to a single MCMC is totally fair since you performed all updates sequentially even though the full conditionals for some model parameters would still be independent and could benefit from parallelization.

**Authors’ response: Parallelization definitely contributes to the computational gain when the first stage models are nontrivial, but the other factors are that TARB alleviates tuning all of the random effects. The resulting algorithms are often faster computationally and less tedious for the researcher because the first-stage Gibbs updates do not need to be tuned like MH updates and >50% of the proposals aren’t being thrown out, and each partition-level model in stage one only needs to refer to the data from that partition. Additionally, stage 2 doesn’t rely on the data directly which can especially influence the computational gains when there are large amounts of response and covariate data or the data model is complicated.Line 175: I recommend the authors clarify that u_1 and u_2 are known, and I would specify these values here so the reader doesn't have to go to the appendix to find them.

**Authors’ response: Thanks for pointing this out, we have updated the text to reflect that u_1 and u_2 are fixed values (line 213).In Figure 1, I am surprised at how similar the posteriors are for stage 1 and stage 2. This is related to my previous comment, but is this because the parameters are not highly correlated in the full posterior? Also, I think the authors should explicitly state that the first stage posterior is only used to get proposals for stage 2, but that it isn't directly used for inference. I worry some readers might see Figure 1 and think they can use the stage 1 results for inference.

**Authors’ response: Yes, in this particular example the first stage posteriors are similar to the full posteriors. This is not always the case however, and we still need the second stage to get the correct population-level inference. We have clarified that the first stage is only temporary and not to be used for inference, nor will it always be similar to the second stage results.The intended audience would greatly benefit by including sample code as a supplementary file.

**Authors’ response: We have now provided all code to reproduce our analyses in R in a Github repository linked in the acknowledgements.

##### Author response by

##### Cite this author response

- pre-publication peer review (ROUND 1)
****##### Decision Letter

2020/08/0606-Aug-2020

MEE-20-06-476 Hierarchical computing for hierarchical models in ecology

Dear Ms Hanna McCaslin,

Thank you for submitting your manuscript to Methods in Ecology and Evolution. I have now received the reviewers' reports and a recommendation from the Associate Editor who handled the review process. Copies of their reports are included below. This manuscript has the potential to make a valuable contribution to the area, but there are a number of significant concerns that need to be addressed. I have considered your paper in light of the comments received and I would like to invite you to prepare a major revision. The main issue surrounds making your work more accessible to ecologists, so they can use the approach with their own problems.

In your revision, please make sure that you take full account of the above comments and those made in the reports below. Please note that Methods in Ecology and Evolution does not automatically accept papers after revision, and an invitation to revise a manuscript does not represent commitment to eventual publication on our part. We will reject revised manuscripts if they are returned without satisfactory responses to the reviewers' comments. When returning the revised paper, please show point-by-point how you have dealt with the various comments in the appropriate section of the submission form.

Please return your revision by 17-Sep-2020. If you need longer, please let us know so we can update our system accordingly. Before resubmitting your manuscript, please read through the resubmission instructions below.

We look forward to hearing from you in due course.

Sincerely,

Professor Robert B. O'Hara

Senior Editor, Methods in Ecology and EvolutionReply to:

Ms India Stephenson

Methods in Ecology and Evolution Editorial Office

coordinator@methodsinecologyandevolution.orgAssociate Editor Comments to Author:

Associate Editor

Comments to the Author:

I think the topic is interesting, and this can become a really interesting manuscript. But, based on the reviewers' comments, I think it needs more work to make it accessible. I agree with reviewer 1 that the justification for using TARB rather than RB isn't clear. As far as I can see, the acceptance probability just requires calculating densities at the points in the parameter space, i.e. you just need to plug in the values in eqn. 6. So I'm not sure why these need to be analytically tractable. I assume I'm missing something.Some other comments:

1. it might help to have a familiar example (e.g. a simple 2-level GLMM) as an example, so the reader can follow the use of TARB without having to spend a lot of effort understanding the example. I think it's fine if this is a toy example. It could also be used to explore how well the methods work (which the reviewers are concerned about).

2. At times the description could be clearer if you distinguish the densities from the first and second stage models. In particular, there are two densities for theta_j.

3. It might also help to describe [theta_j] in the first level models as a pseudo-prior, to emphasise the point that it isn't a true prior because it will be integrated out. This terminology is used in a slightly different context (e.g. Carlin & Chib, DOI: 10.1111/j.2517-6161.1995.tb02042.x), but with a similar aim.Reviewer(s)' Comments to Author:

Reviewer: 1Comments to the Corresponding Author

The study “Hierarchical computing for hierarchical models in ecology” applies recursive Bayesian computation to an ecological data set and aims to make this computational tool accessible to ecologists. This is a very timely study and the topic should be of much interest to a broad audience of ecologists that struggle to fit hierarchical models to large and complex data sets because of computational constraints. I found this an impressive and well written manuscript. However, working at the interface of field ecology and hierarchical modelling myself, I found myself spending a relatively long time reading some details, yet I was not fully convinced after reading whether this approach is indeed readily applicable to a large range of problems. I thus wonder whether some more explanations would be helpful to make the work more accessible to a broad audience. Perhaps the TARB approach introduced from line 113 onwards can be better explained if adding some more details about assumptions etc and perhaps also linking this to Bayes filter ecologist might be more familiar with?

Also, I would find it interesting to learn mor about how this approach can be applied to sparse data and unequal sample sizes among groups. How would you for example weight first stage estimates according to partition-level uncertainty in the second stage? Does the TARB approach results in less precision in parameter estimates at either first or second stage? What kind of hierarchical models should not be done with TARB? Perhaps worth to add some discussion here that may foster the understanding of the approach?

I much appreciate the amount of work that apparently went into this manuscript. The authors have chosen a relatively complex movement model for demonstration while providing some other helpful examples of ecological data previously analysed with hierarchical models. If feasible, I think that providing some TARB model code and output (perhaps as a supplementary tutorial) for one of the most accessible data sets such as the harbor seal counts would make the approach much more accessible to a broad audience.

Overall, an interesting and stimulating manuscript!SPECIFIC COMMENTS:

Line 12: Can you provide more details for the statement “reduced computation time for fitting our hierarchical movement model by half”? If the initial model is fitted with an extremely slow conventional MCMC algorithms, half the time might be of less interest than in the initial model is state-of-the-art modelling?

Line 27-29: Would it be possible to explain in few words or with a short example how individual telemetry data related to an species-level parameter of interest?

Line 55-58: Recursive computing is generally associated with sequential data but some of the hierarchical model application in ecology you mention above are not fitted to sequential data? Perhaps worth to clarify whether your approach is limited to sequential data or not?

Line 66: The examples (“(e.g., partitioned by individuals, sites, or species)”) are redundant with

Lines 68: Replace “population-level parameters” with “group-level parameters” (simply to avoid confusion around the word ‘population’?). Perhaps worth to consider also introducing the term ‘hyperparameter’ here?

Line 79: ”temperature in blue tits”? Unclear.

Line 94-95: Please check: I found the wording “specifying priors [theta_j] for the partition-level models” somewhat confusing: you specify priors for the partition-level parameter theta_j but perhaps it is not clear to all readers that “[theta_j]” refers to a prior?

Line 98: “remaining parameters” are all “population-level parameters”?

Line 108: Does “first stage” correspond to ‘partition-level’ terminology as used before? Suggest to make clear somewhere in the text how first/second stage link to partition/population levels.

Line 108: Perhaps explain ‘importance update’ in a few words (in parenthesis)?

Lines 113-131 The transformation and justification of the Jacobian prior, i.e. the TARB approach, is in my opinion incompletely described to make this accessible to ecologists and need some more detailed description. Also, shouldn’t aspects such as sequential data and underlying Markov processes be mentioned here? I think it need to be made clear here how the assumptions are met for data that resemble observations from underlying Markov process (e.g, movement records or population fluctuations) versus independent samples (e.g. independent records of species occurrence or behavior in different environments)?

Line 114: Perhaps some more information are warranted to explain how the transformation function g is determined and why as transformation should be used? Also, worth to mention the vector size here?

Line 148: Explain I in equation 11?

Line 149: What landscape covariates were used in your study?

Line 154: Confusing to me and perhaps worth some clarification: shouldn’t ‘delta_p rather than sigma relate to movement speed?

Line 179: For how many days (i.e. sequence length)?

Line 224: You present posterior estimates for first stage estimates but wouldn’t it be as interesting to present those from the second stage?

Legend Figure 1: Check "n = 15" and "n = 1675 telemetry locations from J = 15". Line 224:

Line 250: Please check: the wording “to fit the full hierarchical model…” is confusing because you talk about fitting a TARB?Reviewer: 2

Comments to the Corresponding Author

Review of "Hierarchical Computing for Hierarchical Models in Ecology"The authors introduce transformation-assisted recursive Bayesian computing (TARB) as an alternative to a standard MCMC algorithm for Bayesian hierarchical models. The method substantially reduces computational time while resulting in equivalent inference. The paper is well-written and introduces the quantitative ecology audience to a method that can enable researchers to fit more complex models without the impediments of computational limitations. Specific comments are below.

This may be personal preference, but I find it helpful to introduce the specific applications and model earlier in the manuscript. Equations (1)-(3) generally define what type of model TARB works for, but it was helpful for me to read

the application section first to provide relevant context throughout the methods description.In equation (2), do the \theta_j have to be independent conditional on \psi? Do all \theta_j have to have the same distribution [\theta|\psi], or can equation (2) be indexed [\theta_j | \psi]?

Lines 101-106: The authors sample the theta_j^* randomly with replacement to reduce autocorrelation since the MH ratio assumes the proposed values are independent. What practical steps can be taken to ensure the autocorrelation is sufficiently low? Should researchers thin the stage 1 chain and look at acf plots? If the stage 1 chain has high autocorrelation, which can be the case in complex problems, what impact does that have on the MH ratio?

Lines 107-118: I found this paragraph difficult to follow and had to re-read several times. I understand the desire to have conjugacy in the first stage prior. It is less clear why the authors want the full model to use transformed parameters that have a Gaussian distribution. Why Gaussian? Is this just so the population level parameters are specified through the mean?

Line 131: Any considerations other than conjugacy for the temporary first stage prior? For example, is there a benefit to choosing a temporary prior that is "similar to" the marginal prior that would result from the full model after integrating out the hyperparameters?

When is RB or TARB not worth the extra effort? Is the computational gain less dramatic with small to medium J? What if the dimension of theta_j is "large" (p>n) as is the case in say a model with spatial random effects? What happens if the theta_j are highly correlated in the full model posterior? Does this impact convergence diagnostics (e.g. effective sample size)?

Which parts of the algorithm contribute to the computational gain? Is it primarily the ability to parallelize the step 1 Markov chains? I don't know if the comparison to a single MCMC is totally fair since you performed all updates sequentially even though the full conditionals for some model parameters would still be independent and could benefit from parallelization.

Line 175: I recommend the authors clarify that u_1 and u_2 are known, and I would specify these values here so the reader doesn't have to go to the appendix to find them.

In Figure 1, I am surprised at how similar the posteriors are for stage 1 and stage 2. This is related to my previous comment, but is this because the parameters are not highly correlated in the full posterior? Also, I think the authors should explicitly state that the first stage posterior is only used to get proposals for stage 2, but that it isn't directly used for inference. I worry some readers might see Figure 1 and think they can use the stage 1 results for inference.

The intended audience would greatly benefit by including sample code as a supplementary file.

##### Decision letter by

##### Cite this decision letter

##### Reviewer report

2020/07/29Review of "Hierarchical Computing for Hierarchical Models in Ecology"

The authors introduce transformation-assisted recursive Bayesian computing (TARB) as an alternative to a standard MCMC algorithm for Bayesian hierarchical models. The method substantially reduces computational time while resulting in equivalent inference. The paper is well-written and introduces the quantitative ecology audience to a method that can enable researchers to fit more complex models without the impediments of computational limitations. Specific comments are below.

This may be personal preference, but I find it helpful to introduce the specific applications and model earlier in the manuscript. Equations (1)-(3) generally define what type of model TARB works for, but it was helpful for me to read

the application section first to provide relevant context throughout the methods description.In equation (2), do the \theta_j have to be independent conditional on \psi? Do all \theta_j have to have the same distribution [\theta|\psi], or can equation (2) be indexed [\theta_j | \psi]?

Lines 101-106: The authors sample the theta_j^* randomly with replacement to reduce autocorrelation since the MH ratio assumes the proposed values are independent. What practical steps can be taken to ensure the autocorrelation is sufficiently low? Should researchers thin the stage 1 chain and look at acf plots? If the stage 1 chain has high autocorrelation, which can be the case in complex problems, what impact does that have on the MH ratio?

Lines 107-118: I found this paragraph difficult to follow and had to re-read several times. I understand the desire to have conjugacy in the first stage prior. It is less clear why the authors want the full model to use transformed parameters that have a Gaussian distribution. Why Gaussian? Is this just so the population level parameters are specified through the mean?

Line 131: Any considerations other than conjugacy for the temporary first stage prior? For example, is there a benefit to choosing a temporary prior that is "similar to" the marginal prior that would result from the full model after integrating out the hyperparameters?

When is RB or TARB not worth the extra effort? Is the computational gain less dramatic with small to medium J? What if the dimension of theta_j is "large" (p>n) as is the case in say a model with spatial random effects? What happens if the theta_j are highly correlated in the full model posterior? Does this impact convergence diagnostics (e.g. effective sample size)?

Which parts of the algorithm contribute to the computational gain? Is it primarily the ability to parallelize the step 1 Markov chains? I don't know if the comparison to a single MCMC is totally fair since you performed all updates sequentially even though the full conditionals for some model parameters would still be independent and could benefit from parallelization.

Line 175: I recommend the authors clarify that u_1 and u_2 are known, and I would specify these values here so the reader doesn't have to go to the appendix to find them.

In Figure 1, I am surprised at how similar the posteriors are for stage 1 and stage 2. This is related to my previous comment, but is this because the parameters are not highly correlated in the full posterior? Also, I think the authors should explicitly state that the first stage posterior is only used to get proposals for stage 2, but that it isn't directly used for inference. I worry some readers might see Figure 1 and think they can use the stage 1 results for inference.

The intended audience would greatly benefit by including sample code as a supplementary file.

##### Reviewed by

##### Cite this review

##### Reviewer report

2020/07/03The study “Hierarchical computing for hierarchical models in ecology” applies recursive Bayesian computation to an ecological data set and aims to make this computational tool accessible to ecologists. This is a very timely study and the topic should be of much interest to a broad audience of ecologists that struggle to fit hierarchical models to large and complex data sets because of computational constraints. I found this an impressive and well written manuscript. However, working at the interface of field ecology and hierarchical modelling myself, I found myself spending a relatively long time reading some details, yet I was not fully convinced after reading whether this approach is indeed readily applicable to a large range of problems. I thus wonder whether some more explanations would be helpful to make the work more accessible to a broad audience. Perhaps the TARB approach introduced from line 113 onwards can be better explained if adding some more details about assumptions etc and perhaps also linking this to Bayes filter ecologist might be more familiar with?

Also, I would find it interesting to learn mor about how this approach can be applied to sparse data and unequal sample sizes among groups. How would you for example weight first stage estimates according to partition-level uncertainty in the second stage? Does the TARB approach results in less precision in parameter estimates at either first or second stage? What kind of hierarchical models should not be done with TARB? Perhaps worth to add some discussion here that may foster the understanding of the approach?

I much appreciate the amount of work that apparently went into this manuscript. The authors have chosen a relatively complex movement model for demonstration while providing some other helpful examples of ecological data previously analysed with hierarchical models. If feasible, I think that providing some TARB model code and output (perhaps as a supplementary tutorial) for one of the most accessible data sets such as the harbor seal counts would make the approach much more accessible to a broad audience.

Overall, an interesting and stimulating manuscript!SPECIFIC COMMENTS:

Line 12: Can you provide more details for the statement “reduced computation time for fitting our hierarchical movement model by half”? If the initial model is fitted with an extremely slow conventional MCMC algorithms, half the time might be of less interest than in the initial model is state-of-the-art modelling?

Line 27-29: Would it be possible to explain in few words or with a short example how individual telemetry data related to an species-level parameter of interest?

Line 55-58: Recursive computing is generally associated with sequential data but some of the hierarchical model application in ecology you mention above are not fitted to sequential data? Perhaps worth to clarify whether your approach is limited to sequential data or not?

Line 66: The examples (“(e.g., partitioned by individuals, sites, or species)”) are redundant with

Lines 68: Replace “population-level parameters” with “group-level parameters” (simply to avoid confusion around the word ‘population’?). Perhaps worth to consider also introducing the term ‘hyperparameter’ here?

Line 79: ”temperature in blue tits”? Unclear.

Line 94-95: Please check: I found the wording “specifying priors [theta_j] for the partition-level models” somewhat confusing: you specify priors for the partition-level parameter theta_j but perhaps it is not clear to all readers that “[theta_j]” refers to a prior?

Line 98: “remaining parameters” are all “population-level parameters”?

Line 108: Does “first stage” correspond to ‘partition-level’ terminology as used before? Suggest to make clear somewhere in the text how first/second stage link to partition/population levels.

Line 108: Perhaps explain ‘importance update’ in a few words (in parenthesis)?

Lines 113-131 The transformation and justification of the Jacobian prior, i.e. the TARB approach, is in my opinion incompletely described to make this accessible to ecologists and need some more detailed description. Also, shouldn’t aspects such as sequential data and underlying Markov processes be mentioned here? I think it need to be made clear here how the assumptions are met for data that resemble observations from underlying Markov process (e.g, movement records or population fluctuations) versus independent samples (e.g. independent records of species occurrence or behavior in different environments)?

Line 114: Perhaps some more information are warranted to explain how the transformation function g is determined and why as transformation should be used? Also, worth to mention the vector size here?

Line 148: Explain I in equation 11?

Line 149: What landscape covariates were used in your study?

Line 154: Confusing to me and perhaps worth some clarification: shouldn’t ‘delta_p rather than sigma relate to movement speed?

Line 179: For how many days (i.e. sequence length)?

Line 224: You present posterior estimates for first stage estimates but wouldn’t it be as interesting to present those from the second stage?

Legend Figure 1: Check "n = 15" and "n = 1675 telemetry locations from J = 15". Line 224:

Line 250: Please check: the wording “to fit the full hierarchical model…” is confusing because you talk about fitting a TARB?##### Reviewed by

##### Cite this review

**All peer review content displayed here is covered by a Creative Commons CC BY 4.0 license.**