Approximations to the Log-Likelihood Function in the Nonlinear Mixed-Effects Model

In Approximations to the Log-Likelihood Function in the Nonlinear Mixed Model, Pinheiro and Bates discuss four different log-likelihood function approximations: the linear mixed effects approximation, the Laplace approximation, importance sampling, and adaptive Gaussian quadrature. Throughout the post, I will applying these methods to an example. Let’s get the data set-up out of the way first.

We assume that to have a dataset consisting of $N$ observations of: scalar response, $y$; a $p$-dimensional vector of covariates associated with fixed effects, $\mathbf{x}$; a scalar-valued covariate associated with random effects, $z$. We assume the observations are divided into $K$ independent clusters with $n_k$ observations per cluster such that $\sum_{k = 1}^K n_k = N$.

We assume that the responses follow a negative binomial distribution with conditonal means and variances given by:

\[\begin{aligned} \mathbb{E}[y_{k,i} \rvert \boldsymbol{\beta}] &= \mu_{k,i} = \exp(\eta_{k,i}) = \mathbf{x}_{k,i}^\top \boldsymbol{\alpha} + z_{k,i}\beta_k \\ \text{Var}(y_{k,i} \rvert \boldsymbol{\beta}) &= \mu_{k,i} + \frac{1}{\gamma} \mu_{k,i}^2 \end{aligned}\]

Here, $\gamma > 0$ is an overdispersion parameter, $\boldsymbol{\alpha} \in \mathbb{R}^p$ is a vector of fixed effect coefficients, and $\beta_k \in \mathbb{R}$ is a vector of random effect coefficients satisfying:

\[\begin{aligned} \beta_k &\overset{iid}{\sim} \mathcal{N}(0, \tau^2); \hspace{4mm} k \in [K] \end{aligned}\]

where $\tau^2$ is a variance component. We assume that observations within a cluster are conditionally independent.

Omitting the covariates from our notation, the conditional log-likelihood for cluster $k$, $\mathbf{y}_k$, is given by:

\[\log(p(\mathbf{y}_k; \boldsymbol{\alpha}, \gamma \rvert \beta_k)) = \sum_{i = 1}^{n_k} \left[ \log\left( \frac{\Gamma(y_{k,i} + \gamma + 2)}{y_{k,i}! \Gamma(\gamma + 2)}\right) + y_{k,i} \left[ \log(\mu_{k,i}) - \log(\mu_{k,i} + \gamma) \right] + \gamma\left[ \log(\gamma) - \log(\mu_{k,i} + \gamma)\right] \right]\]

where $\Gamma(\cdot)$ denotes the Gamma function. We have placed a Gaussian prior on the random effects:

\[\log(p(\boldsymbol{\beta}; \tau^2)) = -\frac{1}{2} \sum_{k = 1}^K \left[\log(2 \pi \tau^2) + \frac{\beta_k^2}{\tau^2} \right]\]

Our goal will be to obtain maximum likelihood estimates for $\boldsymbol{\alpha}$, $\gamma$, and $\tau^2$. This requires maximizing:

\[p(\mathbf{y}; \boldsymbol{\alpha}, \gamma, \tau^2) = \int p(\mathbf{y}; \boldsymbol{\alpha}, \gamma \rvert \boldsymbol{\beta}) p(\boldsymbol{\beta}; \tau^2) d \boldsymbol{\beta}\]

Set-Up

The mixed effects model of Pinheiro and Bates has two levels. In the top-most level, we assume the data can be divided into $K$ groups, the $i$-th of which has $n_i$ observations. We also assume that the (conditional) mean of each observation is some non-linear function of a group-specific parameter vector, $\boldsymbol{\phi}$, and covariates, $\boldsymbol{\gamma}$. The $j$-th observation in the $i$-th group, denoted by $y_{i,j}$, is modeled as:

\[\begin{equation} \label{eq:model} \begin{aligned} y_{i,j} &= f(\boldsymbol{\phi}_{i,j}, \boldsymbol{\gamma}_{i,j}) + \epsilon_{i,j} \\ \boldsymbol{\phi}_{i,j} &= \boldsymbol{\phi}_{i,j}(\boldsymbol{\alpha}, \boldsymbol{\beta}_i) = \mathbf{X}_{i,j} \boldsymbol{\alpha} + \mathbf{Z}_{i,j} \boldsymbol{\beta}_i \\ \epsilon_{i,j} &\overset{iid}{\sim} \mathcal{N}(0, \sigma^2) \\ \boldsymbol{\beta}_i &\sim \mathcal{N}(\mathbf{0}_q, \sigma^2 \mathbf{D}) \end{aligned} \end{equation}\]

where \(\boldsymbol{\alpha}\) is the $p$-dimensional vector of fixed effects, \(\boldsymbol{\beta}_i\) is the $q$-dimensional vector of random effects for group $i$, an $\mathbf{X}$ and $\mathbf{Z}$ are the fixed effects and random effects design matrices, respectively, for the $j$-th observation in the $i$-th group. The model assumes that the observation errors, \(\epsilon_{i,j}\) are independent of the random effects, and that the random effects have some general covariance matrix, \(\sigma^2 \mathbf{D}\).

Fitting the model in Eq. \eqref{eq:model} can be done in a variety of ways. Here, we are concerned with maximum likelihood estimation, which requires finding the values of the $\boldsymbol{\alpha}$, $\sigma^2$, and $\mathbf{D}$ which maximize the marginal -likelihood given by:

\[\begin{equation} \label{eq:marg-lik} \mathcal{L}(\mathbf{y}; \boldsymbol{\alpha}, \sigma^2, \mathbf{D}) = \int p(\mathbf{y}; \boldsymbol{\alpha}, \sigma^2, \mathbf{D} \rvert \boldsymbol{\beta}) p(\boldsymbol{\beta}; \sigma^2, \mathbf{D}) d \boldsymbol{\beta} \end{equation}\]

These integrals are, in general, intractable due to the fact that they may be multi-dimensional and also involve the non-linear function, $f(\cdot)$. Thus, approximations are needed for model fitting and inference.


LME Approximation

The linear mixed effects (LME) method was developed by Lindstrom and Bates in their work Newton-Raphson and EM Algorithms for Linear Mixed-Effects Models for Repeated Measures Data. The log-likelihood approximation that Pinheiro and Bates refer to as the LME approximation is one step of the two step algorithm proposed by Lindstrom and Bates. We detail these steps below, which are alternated until convergence.

Penalized Non-Linear Least Squares Step

Denote the estimate of $\mathbf{D}$ at iteration $t$ with $\hat{\mathbf{D}}^{(t)}$. We obtain the conditional mode of $\boldsymbol{\beta}$ and the conditional estimate of $\boldsymbol{\alpha}$ as:

\[\begin{equation} \label{eq:pnls-optim} (\hat{\boldsymbol{\alpha}}^{(t)}, \hat{\boldsymbol{\beta}}^{(t)}) = \underset{\boldsymbol{\alpha}, \boldsymbol{\beta}}{\arg \min} \left\{ \sum_{i = 1}^K \left(\rvert \rvert \mathbf{y}_i - \mathbf{f}_i(\boldsymbol{\phi}_i, \boldsymbol{\gamma}_i) \rvert \rvert_2^2 + \boldsymbol{\beta}_i^\top \left[\hat{\mathbf{D}}^{(t)}\right]^{-1} \boldsymbol{\beta}_i \right) \right\} \end{equation}\]

where \(\mathbf{f}_i(\boldsymbol{\phi}_i, \boldsymbol{\gamma})\) is the vector of \(f(\boldsymbol{\phi}_{i,j}, \boldsymbol{\gamma}_{i,j})\) for group $i$.

Linear Mixed Effects Step

In this step, we find $\hat{\mathbf{D}}^{(t+1)}$ with an approximation of the log-likelihood via a first-order Taylor expansion. The point of expansion is the current estimate/mode,

Recall that $\boldsymbol{\phi} = \boldsymbol{\phi}(\boldsymbol{\alpha}, \boldsymbol{\beta})$. A first-order Taylor expansion of $\mathbf{f}(\cdot)$ about $\hat{\boldsymbol{\theta}}_i^{(t)} = (\hat{\boldsymbol{\alpha}}^{(t)}, \hat{\boldsymbol{\beta}}^{(t)}_i)$ is given by:

\[\begin{aligned} \hat{\mathbf{f}}_i(\boldsymbol{\phi}_i, \boldsymbol{\gamma}_i) &= \mathbf{f}_i(\boldsymbol{\phi}_i, \boldsymbol{\gamma}_i) \\ &\approx \left. \left[ \mathbf{f}_i(\boldsymbol{\phi}_i, \boldsymbol{\gamma}_i) \right] \right\rvert_{\boldsymbol{\theta}_i = \hat{\boldsymbol{\theta}}_i^{(t)}} + (\boldsymbol{\theta}_i - \hat{\boldsymbol{\theta}}_i^{(t)})^\top\left. \left[ \frac{\partial \mathbf{f}_i(\boldsymbol{\phi}, \boldsymbol{\gamma}_i)}{\partial \boldsymbol{\theta}_i} \right] \right\rvert_{\boldsymbol{\theta}_i = \hat{\boldsymbol{\theta}}_i^{(t)}} \\ &= \left. \left[ \mathbf{f}_i(\boldsymbol{\phi}_i, \boldsymbol{\gamma}_i) \right] \right\rvert_{\boldsymbol{\theta}_i = \hat{\boldsymbol{\theta}}_i^{(t)}} + (\boldsymbol{\alpha} - \hat{\boldsymbol{\alpha}}^{(t)})^\top\left. \left[ \frac{\partial \mathbf{f}_i(\boldsymbol{\phi}_i, \boldsymbol{\gamma}_i)}{\partial \boldsymbol{\alpha}} \right] \right\rvert_{\boldsymbol{\theta}_i = \hat{\boldsymbol{\theta}}_i^{(t)}} + (\boldsymbol{\beta}_i - \hat{\boldsymbol{\beta}}_i^{(t)})^\top\left. \left[ \frac{\partial \mathbf{f}_i(\boldsymbol{\phi}, \boldsymbol{\gamma}_i)}{\partial \boldsymbol{\beta}_i} \right] \right\rvert_{\boldsymbol{\theta}_i = \hat{\boldsymbol{\theta}}_i^{(t)}} \end{aligned}\]

Define the following quantities:

\[\begin{aligned} \tilde{\mathbf{f}}^{\boldsymbol{\alpha}}_i(\hat{\boldsymbol{\theta}}_i^{(t)}) &= \left. \left[ \frac{\partial \mathbf{f}_i(\boldsymbol{\phi}_i, \boldsymbol{\gamma}_i)}{\partial \boldsymbol{\alpha}} \right] \right\rvert_{\boldsymbol{\theta}_i = \hat{\boldsymbol{\theta}}_i^{(t)}} \\ \tilde{\mathbf{f}}^{\boldsymbol{\beta}_i}_i(\hat{\boldsymbol{\theta}}_i^{(t)}) &= \left. \left[ \frac{\partial \mathbf{f}_i(\boldsymbol{\phi}, \boldsymbol{\gamma}_i)}{\partial \boldsymbol{\beta}_i} \right] \right\rvert_{\boldsymbol{\theta}_i = \hat{\boldsymbol{\theta}}_i^{(t)}} \\ \hat{\mathbf{y}}_i^{(t)} &= \mathbf{y}_i - \left. \left[ \mathbf{f}_i(\boldsymbol{\phi}_i, \boldsymbol{\gamma}_i) \right] \right\rvert_{\boldsymbol{\theta}_i = \hat{\boldsymbol{\theta}}_i^{(t)}} - (\hat{\boldsymbol{\alpha}}^{(t)})^\top \tilde{\mathbf{f}}_i^{\alpha}(\hat{\boldsymbol{\theta}}_i^{(t)}) - (\hat{\boldsymbol{\beta}}_i^{(t)})^\top \tilde{\mathbf{f}}_i^{\alpha}(\hat{\boldsymbol{\theta}}_i^{(t)}) \end{aligned}\]

Notice that (for a fixed iteration $t$) this approximation is the same as the log-likelihood function for a linear mixed effects model where the response is $\hat{\mathbf{y}}^{(t)} = (\hat{\mathbf{y}}^{(t)}_1, \dots, \hat{\mathbf{y}}_K^{(t)})^\top$, and the fixed and random effects design matrices are given by:

\[\begin{aligned} \tilde{\mathbf{f}}^{\boldsymbol{\alpha}}(\hat{\boldsymbol{\theta}}^{(t)}) &= \begin{bmatrix} \rvert & & \rvert \\ \tilde{\mathbf{f}}_1^{\boldsymbol{\alpha}}(\hat{\boldsymbol{\theta}_1}^{(t)}) & \dots & \tilde{\mathbf{f}}_K^{\boldsymbol{\alpha}}(\hat{\boldsymbol{\theta}_K}^{(t)}) \\ \rvert & & \rvert \end{bmatrix} \\ \tilde{\mathbf{f}}^{\boldsymbol{\beta}}(\hat{\boldsymbol{\theta}}^{(t)}) &= \begin{bmatrix} \rvert & & \rvert \\ \tilde{\mathbf{f}}_1^{\boldsymbol{\beta}_1}(\hat{\boldsymbol{\theta}_1}^{(t)}) & \dots & \tilde{\mathbf{f}}_K^{\boldsymbol{\beta}_K}(\hat{\boldsymbol{\theta}_K}^{(t)}) \\ \rvert & & \rvert \end{bmatrix} \end{aligned}\]

One can also perform the estimation using a restricted log-likelihood approximation as in Lindstrom and Bates (1990):

\[\ell_{LME}^R(\boldsymbol{y}; \boldsymbol{\alpha}, \sigma^2, \mathbf{D}) = - \frac{1}{2} \sum_{i = 1}^K \log \left( \left\rvert \sigma^2 (\tilde{\mathbf{f}}_i^{\boldsymbol{\alpha}}(\hat{\boldsymbol{\theta}}_i^{(t)}))^\top \left[\mathbb{I}_{n_i \times n_i} + (\tilde{\mathbf{f}}_i^{\boldsymbol{\beta}_i}(\hat{\boldsymbol{\theta}}_i^{(t)}))\mathbf{D} (\tilde{\mathbf{f}}_i^{\boldsymbol{\beta}_i}(\hat{\boldsymbol{\theta}}_i^{(t)}))^\top \right]^{-1} \right\rvert \tilde{\mathbf{f}}_i^{\boldsymbol{\alpha}}(\hat{\boldsymbol{\theta}}_i^{(t)}) \right) + \ell_{LME}(\boldsymbol{y}; \boldsymbol{\alpha}, \sigma^2, \mathbf{D})\]

This is equivalent to a Laplacian approximation of the integral in Eq. \eqref{eq:marg-lik} but with a flat prior on the fixed effects.


Laplace Approximation

A Laplace approximation performs a second-order Taylor expansion of the conditional log-likelihood about a maximum a posteriori (MAP) estimate of the random effects with respect to some prior. Since the point of expansion is chosen to be the mode, the first-order term in the expansion disappears, and the integrand resembles a Gaussian kernel.

Define:

\[\begin{aligned} g(\mathbf{y}_i; \boldsymbol{\alpha}, \mathbf{D}, \boldsymbol{\beta}_i) &= \frac{1}{2 \sigma^2} \left[ \rvert \rvert \mathbf{y}_i - \mathbf{f}_i(\boldsymbol{\phi}_i, \boldsymbol{\gamma}_i) \rvert \rvert_2^2 + \boldsymbol{\beta}_i^\top \mathbf{D}^{-1} \boldsymbol{\beta}_i \right] \\ \hat{\boldsymbol{\beta}}_i &= \underset{\boldsymbol{\beta}_i}{\arg \min} \left\{ g(\mathbf{y}_i; \boldsymbol{\alpha}, \mathbf{D}, \boldsymbol{\beta}_i) \right\} \\ g(\hat{\boldsymbol{\beta}}_i) &= \left. \left[ g(\mathbf{y}_i; \boldsymbol{\alpha}, \mathbf{D}, \boldsymbol{\beta}_i) \right] \right\rvert_{\boldsymbol{\beta}_i = \hat{\boldsymbol{\beta}}_i} \\ g'(\hat{\boldsymbol{\beta}_i}) &= \left. \left[ \frac{\partial g(\mathbf{y}_i; \boldsymbol{\alpha}, \mathbf{D}, \boldsymbol{\beta}_i)}{\partial \boldsymbol{\beta}_i} \right] \right\rvert_{\boldsymbol{\beta}_i = \hat{\boldsymbol{\beta}}_i} \\ g''(\hat{\boldsymbol{\beta}_i}) &= \left. \left[ \frac{\partial^2 g(\mathbf{y}_i; \boldsymbol{\alpha}, \mathbf{D}, \boldsymbol{\beta}_i)}{\partial \boldsymbol{\beta}_i \partial \boldsymbol{\beta}_i^\top} \right] \right\rvert_{\boldsymbol{\beta}_i = \hat{\boldsymbol{\beta}}_i} \end{aligned}\]

Example

For each cluster, we define:

\[q(\beta_k) = \log(p(\mathbf{y}_k; \boldsymbol{\alpha}, \gamma \rvert \beta_k)) + \log(p(\beta_k; \tau^2))\]

Suppose we knew the value of $\beta_k$ that maximized $q(\beta_k)$ (denote this with $\hat{\beta}_k$). We could then construct a second-order Taylor approximation of $q(\beta_k)$ as:

\[q(\beta_k) \approx q(\hat{\beta}_k) + (\beta_k - \hat{\beta}_k) q'(\hat{\beta}_k) + \frac{1}{2}(\beta_k - \hat{\beta}_k)^2 q''(\hat{\beta}_k)\]

Because $\hat{\beta}_k$ maximizes $q(\beta_k)$, it should also satisfy $q’(\beta_k) = 0$. Thus, the approximation can be simplified to:

\[q(\beta_k) \approx q(\hat{\beta}_k) + \frac{1}{2}(\beta_k - \hat{\beta}_k)^2 q''(\hat{\beta}_k)\]

If we have the posterior mode for $\boldsymbol{\beta}$, then we can just plug this value into the $q(\beta_k)$ functions and use some arithmetric tricks to get:


Importance Sampling Approximation

An alternative to the previous approximations, which were more analytical in flavor, is to take a sampling-based approach. Once can view the integral in Eq. \eqref{eq:marg-lik} as the expectation of the conditional log-likelihood taken over the distribution of the random effects. In importance sampling, we take repeated samples from a specially chosen distribution in order to approximate this expectation as the average of our samples.

Let \(\mathbf{z}^* \sim q(\cdot)\) be a latent variable that we introduce. We pick \(\mathbf{z}^*\) such that \(\boldsymbol{\beta}_{i} = \boldsymbol{\beta}_i(\mathbf{z}^*)\) (some function, $\boldsymbol{\beta}_i(\cdot)$, of \(\mathbf{z}^*\)) and so that it has a simpler distribution than the random effects prior distribution. Importance sampling makes the following approximation:

\[\begin{aligned} p(\mathbf{y}_i; \boldsymbol{\alpha}, \sigma^2, \mathbf{D}) &= \int p(\mathbf{y}_i, \boldsymbol{\beta}_i; \boldsymbol{\alpha}, \sigma^2, \mathbf{D}) d \boldsymbol{\beta}_i \\ &= \int \frac{p(\mathbf{y}_i, \boldsymbol{\beta}_i(\mathbf{z}^*); \boldsymbol{\alpha}, \sigma^2, \mathbf{D})}{q(\mathbf{z}^*)} q(\mathbf{z}^*) d \mathbf{z}^* \\ &= \mathbb{E}_{\mathbf{z}^*} \left[ \frac{p(\mathbf{y}_i, \boldsymbol{\beta}_i(\mathbf{z}^*); \boldsymbol{\alpha}, \sigma^2, \mathbf{D})}{q(\mathbf{z}^*)} \right] \\ &\approx \frac{1}{n^{IS}} \sum_{j = 1}^{n^{IS}} \frac{p(\mathbf{y}_i, \boldsymbol{\beta}_i(\mathbf{z}^*_j); \boldsymbol{\alpha}, \sigma^2, \mathbf{D})}{q(\mathbf{z}^*_j)} \end{aligned}\]

where $n^{IS}$ is the number of importance samples taken. When working out the Laplace approximation (see Eq. \eqref{eq:approx-gauss-density}), we found that the integrand for cluster $i$ (i.e. the joint distribution barring constants) is approximately proportional to a Gaussian density with mean and covariance:

\[\begin{aligned} \boldsymbol{\mu}_i &= \hat{\boldsymbol{\beta}}_i \\ \boldsymbol{\Sigma}_i &= \left[ g''(\hat{\boldsymbol{\beta}}_i) \right]^{-1} \end{aligned}\]

This implies that the posterior distribution should also be proportional to this density. Since this is the optimal proposal distribution (i.e. $q(\cdot)$), we will draw samples as:

\[\begin{aligned} \mathbf{z}^* &\sim \mathcal{N}(\mathbf{0}_q, \mathbb{I}_{q \times q}) \\ \boldsymbol{\beta}_i(\mathbf{z}^*) &= \boldsymbol{\mu}_i + \boldsymbol{\Lambda}^{-\frac{1}{2}}\left[\boldsymbol{\Sigma}_i\right] \mathbf{z}^* \end{aligned}\]

where $\Lambda^{-\frac{1}{2}}[\cdot]$ denotes the inverse Cholesky factor. Let a quantity with $\mathbf{z}^*$ as its argument denote the quantity evaluated with respect to the importance samples. For example:

\[\boldsymbol{\phi}_i(\mathbf{z}^*) = \mathbf{X}_{i,j} \boldsymbol{\alpha} + \mathbf{Z}_{i,j} \boldsymbol{\beta}_i(\mathbf{z}^*)\]

The above can be simplified by using the approximation to $g’’(\hat{\boldsymbol{\beta}}_i)$ that we found in the preivous section.

Because importance sampling relies on taking random samples, there can be numerical issues if we sampling too far into the tails of the proposal distribution. This can be overcome by replacing the choice of $q(\cdot)$ with something with fatter tails (like a Student’s $t$-distribution with the same mean and a large number of degrees of freedom). Alternatively, we can forego sampling altogether and use quadrature.


Adaptive Gaussian Quadrature Approximation

Because we have a multi-dimensional integral, we have to repeatedly apply one-dimensional quadrature rules, but this can still work well in practice. First, let’s cover “standard” Gaussian quadrature where we do not concern ourselves with finding “good” quadrature nodes.

Gaussian Quadrature

Recall that Gauss-Hermite quadrature makes the following approximation:

\[\begin{aligned} \int f(x) \exp(-x^2) dx \approx \sum_{j = 1}^m w_j f(\nu_j) \end{aligned}\]

where $\nu_j$ and $w_j$ are the abscissae of the $m$-th order Hermite polynomial and the associated weights, respectively. This approximation is essentially the same as importance sampling where our proposal distribution is a standard Gaussian (and we transform the integrand suitably to match). However, instead of approximating the expectation (with respect to this proposal distribution) as the empirical mean over some random sample, we use a deterministic set of points (i.e. the roots of a Hermite polynomial) and also weight the summands.

To use quadrature, we will need to manipulate the integral into the correct form. As we did in the previous section, we can rewrite the random effects as:

\[\begin{aligned} \mathbf{z}^* &\sim \mathcal{N}(\mathbf{0}_q, \mathbb{I}_{q \times q}) \\ \boldsymbol{\beta}_i &= \boldsymbol{\Lambda}^{\frac{1}{2}} \left[ \sigma^2 \mathbf{D} \right] \mathbf{z}^*; \end{aligned}\]

To make this one-dimensional, we can equivalently define \(\mathbf{z}^* = (z^*_1, \dots, z^*_q)^\top\) for \(z^*_j \sim \mathcal{N}(0, 1)\).

We can use quadrature to approximate each dimension of the multiple integral separately.

Adaptive Gaussian Quadrature

Adaptive Gaussian quadrature is very similar. Instead of directly using the weights and abscissae of a Hermite polynomial, we transform them so that the nodes lie in areas of high density (with respect to the posterior).

For a one-dimensional integral, adaptive Gaussian quadrature makes the following approximation:

\[\int p(x, y) dy = \int p(x \rvert y) p(y) dy \approx \sqrt{2 \hat{\sigma}^2} \sum_{i = 1}^{n^{AGQ}} w_i \exp(\nu_i^2) p(x, \hat{\mu} + \nu_i \sqrt{2 \hat{\sigma}^2})\]

where $\hat{\mu}$ and $\hat{\sigma}^2$ are the mean and variance of Laplace approximation to the posterior, $p(y \rvert x)$, and $w_i$ and $\nu_i$ are the appropriate weights and abscissae. If, instead, $y$ is $q$-dimensional (we denote this version as $\mathbf{y}$), and we apply the approximation to each coordinate independently, we get:

\[\int p(x, \mathbf{y}) d\mathbf{y} \approx \sqrt{2} \rvert \hat{\boldsymbol{\Sigma}} \rvert^{\frac{1}{2}} \sum_{j_1 = 1}^{n^{AGQ}} \dots \sum_{j_q = 1}^{n^{AGQ}} p(x, \hat{\boldsymbol{\mu}} + \sqrt{2} \rvert \hat{\boldsymbol{\Sigma}} \rvert^{\frac{1}{2}} \boldsymbol{\nu}_{j_1, \dots, j_q}) \prod_{k = 1}^q c_{j_k} \exp(v_{j_q}^2)\]

where \(\boldsymbol{\nu}_{j_1, \dots, j_q} = (\nu_{j_1}, \dots, \nu_{j_q})^\top\).

Example

We now derive the adaptive Gaussian quadrature approximation to the integral. As in the Laplace approximation, we will define:

\[\begin{aligned} q(\beta_k) &= \log(p(\mathbf{y}_k; \boldsymbol{\alpha}, \gamma \rvert \beta_k)) + \log(p(\beta_k; \tau^2)) \\ \hat{\mu}_k &= \hat{\beta}_k^{(t+1)} \\ \hat{\sigma}_k^2 &= \left[ - q''(\hat{\beta}_k^{(t+1)}) \right]^{-1} \end{aligned}\]

The adaptive Gaussian quadrature approximation for one cluster is:

\[\begin{aligned} p(\mathbf{y}_k; \boldsymbol{\alpha}, \gamma, \tau^2) &= \int p(\mathbf{y}_k; \boldsymbol{\alpha}, \gamma \rvert \beta_k) p(\beta_k; \tau^2) d \beta_k \\ &= \int \exp(q(\beta_k)) d \beta_k \\ &\approx \sqrt{2 \hat{\sigma}_k^2} \sum_{i = 1}^{n^{AGQ}} w_i \exp(\nu_i^2) \exp(q(\hat{\mu}_k + \nu_i \sqrt{2 \hat{\sigma}_k^2})) \end{aligned}\]

where $w_i$ and $\nu_i$ are the standard weights and abscissae of the $n^{AGQ}$-order Hermite polynomial. Thus, for the full data, we have:

\[\begin{aligned} \ell_{AGQ}(\boldsymbol{\alpha}, \gamma, \tau^2; \mathbf{y}) &= \log\left( \prod_{k = 1}^K \sqrt{2 \hat{\sigma}_k^2} \sum_{i = 1}^{n^{AGQ}} w_i \exp(\nu_i^2) \exp\left(q\left(\hat{\mu}_k + \nu_i \sqrt{2 \hat{\sigma}_k^2}\right)\right) \right) \\ &= \sum_{k = 1}^K \left[\frac{1}{2} \log\left( 2 \hat{\sigma}_k^2\right) + \log\left( \sum_{i = 1}^{n^{AGQ}} w_i \exp(\nu_i^2) \exp\left(q\left(\hat{\mu}_k + \nu_i \sqrt{2 \hat{\sigma}_k^2}\right)\right) \right) \right] \end{aligned}\]