3 Bayesian inference and INLA

3.1 Bayesian inference

Bayesian hierarchical models are often used to model spatial and spatio-temporal data. These models allow complete flexibility in how estimates borrow strength across space and time, and improve estimation and prediction of the underlying model features. In a Bayesian approach, a probability distribution \(\pi(\boldsymbol{y}|\boldsymbol{\theta})\), called likelihood, is specified for the observed data \(\boldsymbol{y} = (y_1, \ldots, y_n)\) given a vector of unknown parameters \(\boldsymbol{\theta}\). Then, a prior distribution \(\pi(\boldsymbol{\theta}|\boldsymbol{\eta})\) is assigned to \(\boldsymbol{\theta}\) where \(\boldsymbol{\eta}\) is a vector of hyperparameters. The prior distribution for \(\boldsymbol{\theta}\) represents the knowledge about \(\boldsymbol{\theta}\) before obtaining the data \(\boldsymbol{y}\). If \(\boldsymbol{\eta}\) is not known, a fully Bayesian approach would specify a hyperprior distribution for \(\boldsymbol{\eta}\). Alternatively, an empirical Bayes approach might be used by which an estimate of \(\boldsymbol{\eta}\) is used as if \(\boldsymbol{\eta}\) were known. Assuming that \(\boldsymbol{\eta}\) is known, inference concerning \(\boldsymbol{\theta}\) is based on the posterior distribution of \(\boldsymbol{\theta}\) which is defined from the Bayes’ Theorem as

\[\pi(\boldsymbol{\theta}|\boldsymbol{y}) = \frac{\pi(\boldsymbol{y}, \boldsymbol{\theta)}}{\pi(\boldsymbol{y})} = \frac{\pi(\boldsymbol{y}|\boldsymbol{\theta}) \pi(\boldsymbol{\theta})}{\int \pi(\boldsymbol{y}|\boldsymbol{\theta}) \pi(\boldsymbol{\theta}) d\boldsymbol{\theta}}.\]

The denominator \(\pi(\boldsymbol{y}) = \int \pi(\boldsymbol{y}|\boldsymbol{\theta}) \pi(\boldsymbol{\theta}) d\boldsymbol{\theta}\) defines the marginal likelihood of the data \(\boldsymbol{y}\). This is free of \(\boldsymbol{\theta}\) and may be set to a scaling constant which does not impact the shape of the posterior distribution. Thus, the posterior distribution is often expressed as

\[\pi(\boldsymbol{\theta}|\boldsymbol{y}) \propto \pi(\boldsymbol{y}|\boldsymbol{\theta}) \pi(\boldsymbol{\theta}).\]

Bayesian methods allow to incorporate prior beliefs into the model, and provide a way of formalizing the process of learning from the data to update the prior information. In contrast to frequentist methods, Bayesian methods provide credible intervals on parameters and probability values on hypotheses that are in line with common sense interpretations. Moreover, Bayesian methods may handle complex models that are difficult to fit using classical methods such as repeated measures, missing data, and multivariate data.

One principal difficulty in applying Bayesian methods is the calculation of the posterior \(\pi(\boldsymbol{\theta}|\boldsymbol{y})\) which usually involves high-dimensional integration that is generally not tractable in closed-form. Thus, even when the likelihood and the prior distribution have closed-form expressions, the posterior distribution may not. Markov chain Monte Carlo (MCMC) methods have been traditionally used for solving this problem, and user-friendly software such as WinBUGS (Lunn et al. 2000), JAGS (Plummer 2019) and Stan (Stan Development Team 2019) have facilitated the use of Bayesian inference with MCMC in many scientific fields. MCMC methods work by generating a sample of values \(\{ \boldsymbol{\theta}^{(g)},\ g = 1, \ldots, G \}\) from a convergent Markov chain whose stationary distribution is the posterior \(\pi(\boldsymbol{\theta}|\boldsymbol{y})\). From these samples, empirical summaries of the \(\boldsymbol{\theta}^{(g)}\) values may be used to summarize the posterior distribution of the parameters of interest. For example, we might use the sample mean to estimate the posterior mean

\[\widehat{E(\theta_i | \boldsymbol{y} )} = \frac{1}{G} \sum_{i=1}^G \theta_i^{(g)},\]

and the sample variance to estimate the variance

\[\widehat{Var(\theta_i | \boldsymbol{y} )} = \frac{1}{G-1} \sum_{i=1}^G (\theta_i^{(g)} - \widehat{E(\theta_i | \boldsymbol{y} )})^2.\]

MCMC methods require the use of diagnostics to decide when the sampling chains have reached the stationary distribution, that is, the posterior distribution. One easy way to see if the chain has converged is to examine the traceplot which is a plot of the parameter value at each iteration against the iteration number, and see how well the chain is mixing or moving around the parameter space. Sample autocorrelations are also useful since they can inform whether the algorithm will be slow to explore the entire posterior distribution and this will impede convergence. The Geweke diagnostic (Geweke 1992) takes the first and last parts of the chain and compares the means of both parts to see if the two parts are from the same distribution. It is also common to assess convergence by running a small number of parallel chains initialized at different starting locations. Traceplots are then examined to see if there is a point after which all chains seem to overlap. Diagnostics can also be used to assess whether the variation within and between the chains coincide (Gelman and Rubin 1992).

MCMC methods have made a great impact on statistical practice by making Bayesian inference possible for complex models. However, they are sampling methods that are extremely computationally demanding and present a wide range of problems in terms of convergence. Integrated nested Laplace approximation (INLA) is a computational less-intensive alternative to MCMC designed to perform approximate Bayesian inference in latent Gaussian models (Håvard Rue, Martino, and Chopin 2009). These models include a very wide and flexible class of models ranging from generalized linear mixed models to spatial and spatio-temporal models. INLA uses a combination of analytical approximations and numerical algorithms for sparse matrices to approximate the posterior distributions with closed-form expressions. This allows faster inference and avoids problems of sample convergence and mixing which permit to fit large datasets and explore alternative models. Examples of big health data applications using INLA are Shaddick, Thomas, and Green (2018) which produces global estimates of fine particulate matter ambient pollution, Moraga et al. (2015) which predicts lymphatic filariasis prevalence in sub-Saharan Africa, and Osgood-Zimmerman et al. (2018) which maps child growth failure in Africa. INLA can be easily applied thanks to the R package R-INLA (Havard Rue et al. 2021). The INLA website http://www.r-inla.org includes documentation, examples, and other resources about INLA and the R-INLA package. Below we provide an introduction to INLA, and Chapter 4 provides an introduction to the R-INLA package as well as examples on how to use it.

3.2 Integrated nested Laplace approximation

Integrated nested Laplace approximation (INLA) allows to perform approximate Bayesian inference in latent Gaussian models such as generalized linear mixed models and spatial and spatio-temporal models. Specifically, models are of the form

\[y_i|\boldsymbol{x}, \boldsymbol{\theta} \sim \pi(y_i|x_i, \boldsymbol{\theta}),\ i=1,\ldots,n,\] \[\boldsymbol{x}|\boldsymbol{\theta} \sim N(\boldsymbol{\mu(\theta)}, \boldsymbol{Q(\theta)}^{-1}),\] \[\boldsymbol{\theta} \sim \pi(\boldsymbol{\theta}),\] where \(\boldsymbol{y}\) are the observed data, \(\boldsymbol{x}\) represents a Gaussian field, and \(\boldsymbol{\theta}\) are hyperparameters. \(\boldsymbol{\mu(\theta)}\) is the mean and \(\boldsymbol{Q(\theta)}\) is the precision matrix (i.e., the inverse of the covariance matrix) of the latent Gaussian field \(\boldsymbol{x}\). Here \(\boldsymbol{y}\) and \(\boldsymbol{x}\) can be high-dimensional. However, to produce fast inferences, the dimension of the hyperparameter vector \(\boldsymbol{\theta}\) should be small because approximations are computed using numerical integration over the hyperparameter space.

Observations \(y_i\) are, in many situations, assumed to belong to an exponential family with mean \(\mu_i = g^{-1}(\eta_i)\). The linear predictor \(\eta_i\) accounts for effects of various covariates in an additive way

\[\eta_i = \alpha + \sum_{k=1}^{n_{\beta}} \beta_k z_{ki} + \sum_{j=1}^{n_f} f^{(j)}(u_{ji}).\] Here, \(\alpha\) is the intercept, \(\{ \beta_k \}\)’s quantify the linear effects of covariates \(\{ z_{ki} \}\) on the response, and \(\{ f^{(j)}(\cdot) \}\)’s are a set of random effects defined in terms of some covariates \(\{ u_{ji}\}\). This formulation permits to accommodate a wide range of models thanks to the very different forms that the \(\{f^{(j)}\}\) functions can take including spatial and spatio-temporal models.

INLA uses a combination of analytical approximations and numerical integration to obtain approximated posterior distributions of the parameters. These posteriors can then be post-processed to compute quantities of interest like posterior expectations and quantiles. Let \(\boldsymbol{x}=(\alpha,\{\beta_k\},\{f^{(j)}\})|\boldsymbol{\theta}\sim N(\boldsymbol{\mu(\theta)},Q(\boldsymbol{\theta})^{-1})\) denote the vector of the latent Gaussian variables, and let \(\boldsymbol{\theta}\) denote the vector of hyperparameters which are not necessarily Gaussian. INLA computes accurate and fast approximations to the posterior marginals of the components of the latent Gaussian variables

\[\pi(x_i|\boldsymbol{y}),\ i=1,\ldots,n,\] as well the posterior marginals for the hyperparameters of the Gaussian latent model

\[\pi(\theta_j|\boldsymbol{y}),\ j=1,\ldots,dim(\boldsymbol{\theta}).\]

The posterior marginals of each element \(x_i\) of the latent field \(\boldsymbol{x}\) are given by

\[\pi(x_i|\boldsymbol{y})=\int \pi(x_i|\boldsymbol{\theta},\boldsymbol{y})\pi(\boldsymbol{\theta}|\boldsymbol{y})d\boldsymbol{\theta},\] and the posterior marginals for the hyperparameters can be written as

\[\pi(\theta_j|\boldsymbol{y})=\int \pi(\boldsymbol{\theta}|\boldsymbol{y})d\boldsymbol{\theta}_{-j}.\] This nested formulation is used to approximate \(\pi(x_i|\boldsymbol{y})\) by combining analytical approximations to the full conditionals \(\pi(x_i | \boldsymbol{\theta},\boldsymbol{y} )\) and \(\pi(\boldsymbol{\theta} |\boldsymbol{ y} )\) and numerical integration routines to integrate out \(\boldsymbol{\theta}\). Similarly, \(\pi(\theta_j|\boldsymbol{y})\) is approximated by approximating \(\pi(\boldsymbol{\theta}|\boldsymbol{y})\) and integrating out \(\boldsymbol{\theta}_{-j}\). Specifically, the posterior density of the hyperparameters is approximated using a Gaussian approximation for the posterior of the latent field, \(\tilde \pi_G(\boldsymbol{x}|\boldsymbol{\theta},\boldsymbol{y})\), evaluated at the posterior mode, \(\boldsymbol{x}^*(\boldsymbol{\theta}) = \mbox{arg max }_{\boldsymbol{x}} \pi_G(\boldsymbol{x}|\boldsymbol{\theta},\boldsymbol{y})\),

\[\tilde \pi(\boldsymbol{\theta}|\boldsymbol{y}) \propto \left.\frac{\pi(\boldsymbol{x},\boldsymbol{\theta},\boldsymbol{y})}{\tilde \pi_G(\boldsymbol{x}|\boldsymbol{\theta},\boldsymbol{y})}\right|_{\boldsymbol{x}=\boldsymbol{x}^*(\boldsymbol{\theta})}.\] Then, INLA constructs the following nested approximations:

\[\tilde \pi(x_i|\boldsymbol{y})=\int \tilde \pi(x_i|\boldsymbol{\theta},\boldsymbol{y})\tilde \pi(\boldsymbol{\theta}|\boldsymbol{y})d\boldsymbol{\theta},\ \tilde \pi(\theta_j|y)=\int \tilde \pi(\boldsymbol{\theta}|\boldsymbol{y})d\boldsymbol{\theta}_{-j}\]

Finally, these approximations can be integrated numerically with respect to \(\boldsymbol{\theta}\)

\[\tilde \pi(x_i|\boldsymbol{y})=\sum_k \tilde \pi(x_i|\boldsymbol{\theta_k},\boldsymbol{y})\tilde \pi(\boldsymbol{\theta_k}|\boldsymbol{y})\times \Delta_k,\]

\[\tilde \pi(\theta_j|\boldsymbol{y})=\sum_l \tilde \pi(\boldsymbol{\theta^*_l}|\boldsymbol{y})\times \Delta^*_l,\]

where \(\Delta_k\) (\(\Delta^*_l\)) denotes the area weight corresponding to \(\boldsymbol{\theta_k}\) (\(\boldsymbol{\theta^*_l}\)).

The approximations for the posterior marginals for the \(x_i\)’s conditioned on selected values of \(\boldsymbol{\theta_k}\), \(\tilde \pi(x_i|\boldsymbol{\theta_k},\boldsymbol{y})\), can be obtained using a Gaussian, a Laplace, or a simplified Laplace approximation. The simplest and fastest solution is to use a Gaussian approximation derived from \(\tilde \pi_G(\boldsymbol{x}|\boldsymbol{\theta},\boldsymbol{y})\). However, in some situations this approximation produces errors in the location and fails to capture skewness behavior. The Laplace approximation is preferable to the Gaussian approximation, but it is relatively costly. The simplified Laplace approximation (which is the default option in the R-INLA package) has smaller cost and satisfactorily remedies location and skewness inaccuracies of the Gaussian approximation.