This post is a primer on generalized linear models and their associated estimation procedures. Throughout this post, we perform derivatives of matrices, vectors, and scalars. We denote all of them with $\partial$ even though they may not technically be partial derivatives. The meaning should be clear from the context if you keep track of the dimensions of the variables.
We’ll use the notation $\text{vec}(\mathbf{M})$ denote the vectorization of matrix $\mathbf{M}$, where we flatten the matrix row-by-row. Also, any function $f: \mathbb{R} \rightarrow \mathbb{R}$ will be applied element-wise to vectors and matrices.
Set-Up
We have $n$ observations coming from $k$ different clusters, each of size $n_t$ for $t \in [k]$. The full data will be denoted by $\mathbf{y}$. We will have $p$ fixed effect covariates arranged in a $p$-dimensional vector, \(\mathbf{x}_{i}\), and $q$ random effects covariates in a $q$-dimensional vector, \(\mathbf{z}_{i}\).
In general, we’ll use a subscript to denote the $i$-th observation out of some total. With no superscript, the total will be $n$. With a superscript $t$, the total will be the $t$-th cluster size, $n_t$.
We’ll also let $\alpha$ denote a $p$-dimensional fixed effect coefficient vector, $\alpha$, and $\beta^t$ will denote a $q$-dimensional random effect coefficient vector corresponding to group $t$.
In addition, we assume $\beta^t \overset{iid}{\sim} \mathcal{F}$ for all $t \in [k]$ for an exponential family distribution, $\mathcal{F}$, and $q \times q$ covariance matrix, $D(\theta)$, depending on an $m$-dimensional variance component vector, $\theta$.
Additional Assumptions.
A few auxiliary assumptions must also be made for the analysis later, which we list here:- The third moment and higher moments of $\beta$ are of order $o(\rvert \rvert \theta \rvert \rvert)$.
- The entries of $D(\theta)$ are linear in $\theta$.
- $D(\theta) = \text{vec}(0)$ if $\theta = \text{vec}(0)$
Our model comes in the form of a specification of the conditional mean, $\mu_i = \mathbb{E}[\mathbf{y}^t_i \rvert \beta^t]$. For a monotonic and differentiable link function (e.g. $\log(\cdot)$ or $\text{logit}(\cdot)$), the conditional mean of the $i$-th observation in group $t$ is assumed to be given by:
\[\mu^t_{i} = g^{-1}\left(\mathbf{x}_i^\top \alpha + \mathbf{z}_i^\top \beta^t \right) \label{eq:glmm}\]To write Eq. \eqref{eq:glmm} in matrix notation, we assume that the observations are ordered by group, so the first $n_1$ observations are all from group $1$. We can then define the following vectors/matrices:
\[\mu = \begin{bmatrix} \mu_{\cdot, 1} \\ \vdots \\ \mu_{\cdot, k} \end{bmatrix}, \hspace{2mm} \mathbf{X} = \begin{bmatrix} — & \mathbf{x}_1 & — \\ & \dots & \\ — & \mathbf{x}_{n} & — \\ \end{bmatrix}, \hspace{2mm} \tilde{\mathbf{Z}}^t = \begin{bmatrix} — & \mathbf{z}^t_1 & — \\ & \dots & \\ — & \mathbf{z}^t_{n_t} & — \\ \end{bmatrix}, \hspace{2mm} \mathbf{Z} = \begin{bmatrix} \tilde{\mathbf{Z}}^1 & \dots & 0 \\ \vdots & \ddots & \vdots \\ 0 & \dots & \tilde{\mathbf{Z}}^{k} \end{bmatrix}, \hspace{2mm} \beta = \begin{bmatrix} \beta^1 \\ \vdots \\ \beta^k \end{bmatrix} \nonumber\]where $\tilde{\mathbf{Z}}^t$ is constructed by stacking the $\mathbf{z}_i$ vectors for group $t$ into a matrix, and where $\beta$ is the vector created by stacking together all of the $\beta^t$ vectors. As a reminder, $\mu \in \mathbb{R}^{n \times 1}$, $\mathbf{X} \in \mathbb{R}^{n \times p}$, $\tilde{\mathbf{Z}}^t \in \mathbb{R}^{n_t \times q}$, $\mathbf{Z} \in \mathbb{R}^{n \times (q \times k)}$, and $\beta \in \mathbb{R}^{(q \times k) \times 1}$.
The model is then:
\[\mu = g^{-1}(\mathbf{X} \alpha + \mathbf{Z} \beta) \label{eq:glmm-matrix}\]Our data come as an $n \times 1$ vector of counts, $\mathbf{y}$ that come from $k$ different clusters. Cluster $t$ will have $n_t$ observations such that $\sum_{t = 1}^k n_t = n$. To keep things simple, we'll assume $p = q = 1$, so $\alpha \in \mathbb{R}$ and $\beta \in \mathbb{R}^{k \times 1}$. Thus, $\mathbf{x}_i, \mathbf{z}_i \in \mathbb{R}$ as well. Our model is then: $$ \mathbf{y}^t_i \rvert \beta^t \sim \text{Poi}\left(\mu_i^t\right), \hspace{5mm} \mu_i^t = \exp\left(\mathbf{x}_i \alpha + \mathbf{z}_i \beta^t \right) $$ We'll have a scalar-valued variance component that we call $\theta$. Let $\text{vec}(\mathbf{0})$ be a $k \times 1$ vector of zeros. The random effects will have distribution: $$ \beta \sim \mathcal{N}(\mathbf{0}, \theta \mathbb{I}_{k \times k}) $$ which is equivalent to: $$ \beta^t \overset{iid}{\sim} \mathcal{N}(0, \theta), \hspace{5mm} \forall t \in [k] $$
If we further assume that we only have a fixed intercept and a random intercept, then $\mathbf{x}_i = \mathbf{z}_i = 1$. The model then simplifies to: $$ \mathbf{y}^t_i \rvert \beta^t \sim \text{Poi}\left(\mu_i^t\right), \hspace{5mm} \mu_i^t = \exp\left(\alpha + \beta^t \right) $$ We'll also assume that $n_t = m$ for all $t \in [k]$ to make this case really simple. Thus: $$ \mathbf{X} = \mathbf{1}_{m}, \hspace{5mm} \tilde{\mathbf{Z}}^t = \mathbf{1}_m, \hspace{5mm} \mathbf{Z} = \begin{bmatrix} \mathbf{1}_m & 0 & \dots & 0 \\ 0 & \mathbf{1}_m & \dots & 0 \\ \vdots & & \ddots & \vdots \\ 0 & 0 & \dots & \mathbf{1}_m \end{bmatrix} $$
We’ll use $[\cdot] \rvert_{\beta = \beta_0}$ to denote evaluation of the function in brackets when setting $\beta$ equal to $\beta_0$. We’ll also use a superscript $0$ (e.g. $\mu^0$, $\eta^0$, etc.) to denote the quantity under the null hypothesis (i.e. $\theta = \mathbf{0} \implies \beta = \mathbf{0}$).
Likelihood
We can write the conditional log-likelihood using the exponential family form (see my generalized linear models post).
\[\ell(\mathbf{y}; \alpha, \theta \rvert \beta) = \sum_{i = 1}^n \left[ \frac{\zeta_i \mathbf{y}_i - A(\zeta_i)}{d(\phi, \omega_i)} + \log(h(\mathbf{y}_i, \phi, \omega_i))\right], \hspace{10mm} \mathcal{L}(\mathbf{y}; \alpha, \theta \rvert \beta) = \exp \left( \sum_{i = 1}^n \left[ \frac{\zeta_i \mathbf{y}_i - A(\zeta_i)}{d(\phi, \omega_i)} + \log(h(\mathbf{y}_i, \phi, \omega_i))\right] \right) \label{eq:condition-log-lik}\]where $\phi > 0$ is a dispersion/scale parameter; $\omega_i$ is a (prior) dispersion weights; $\zeta_i$ is a distribution parameter; and $h(\cdot)$ and $A(\cdot)$ are known functions. We assume that $d(\phi, \omega_i) = \phi \omega_i^{-1}$.
As we found in my generalized linear models post, the conditional variance of the responses is given by:
\[\text{Cov}(\mathbf{y} \rvert \beta) = \text{diag}(d(\phi, \omega)) \underbrace{\frac{\partial^2 A(\theta)}{\partial \theta \partial \theta^\top}}_{=V(\mu)} = \text{diag}^{-1}\left( \frac{\omega}{\phi}\right) V(\mu)\]Quasi-Likelihood
We don’t actually need to specify the true likelihood, even though (in practice) we will usually be able to do so. Using quasi-likelihood methods can often be less computationally expensive than maximum likelihood for count data.
In the quasi-likelihood scheme, we keep the conditional mean assumption that we discussed above and also make the assumption that the conditional variance can be expressed as a function of the mean in the form $\text{Cov}(\mathbf{y}) = \text{diag}\left(d(\phi, \omega)\right)V(\mu)$ for some function $V(\cdot)$. Later, we’ll denote diagonal elements of $\text{Cov}(\mathbf{y})$ with $v(\mu_i)$.
The conditional log quasi-likelihood and quasi-likelihood are given by:
\[\ell_q(\mathbf{y}_i; \alpha, \theta \rvert \beta) = \int_{\mathbf{y}_i}^{\mu_i} \frac{\omega_i(\mathbf{y}_i - u)}{\phi V(u)} du, \hspace{10mm} \mathcal{L}_q(\mathbf{y}_i; \alpha, \theta \rvert \beta) = \exp\left(\int_{\mathbf{y}_i}^{\mu_i} \frac{\omega_i(\mathbf{y}_i - u)}{\phi V(u)} du \right) \label{eq:quasi-lik}\]When $\mathcal{F}$ is an exponential family distribution, the log-likelihood and log quasi-likelihood are equal. See my post on quasi-likelihood for a more in-depth discussion.
The variance is equal to the mean for Poisson data, and, since we have no overdispersion, we have $\phi = 1$ and $\omega_{i} = 1$ for all $i \in [n]$. Thus, $\text{Cov}(\mathbf{y} \rvert \beta) = \mathbf{V}(\mu) = \mu$. We can write the conditional log quasi-likelihood as: $$ \ell_q(\mathbf{y}; \alpha, \theta \rvert \beta) = \sum_{i = 1}^{n} \left[ \mathbf{y}_{i}\log(\mu_{i}) - \mathbf{y}_{i} \log(\mathbf{y}_{i})- \mu_{i} + \mathbf{y}_{i} \right] \label{eq:conditional-log-quasi-likelihood} $$
Proof.
Assuming that all samples and targets are independent, the log quasi-likelihood is: $$ \begin{aligned} \ell_q(\mathbf{y}; \alpha, \theta \rvert \beta) &= \sum_{i = 1}^{n} \ell_q(\mathbf{y}_{i}; \alpha, \theta \rvert \beta) \\ &= \sum_{i = 1}^{n} \int_{\mathbf{y}_{i}}^{\mu_{i}} \frac{\mathbf{y}_{i} - u}{u} du \\ &= \sum_{i = 1}^{n} \int_{\mathbf{y}_{i}}^{\mu_{i}} \left( \frac{\mathbf{y}_{i}}{u} - 1\right) du \\ &= \sum_{i = 1}^{n} \left[ \mathbf{y}_{i} \int_{\mathbf{y}_{i}}^{\mu_{i}} \frac{1}{u} du - (\mu_{i} - \mathbf{y}_{i}) \right]\\ &= \sum_{i = 1}^{n} \left[ \mathbf{y}_{i} \left( \log(\mu_{i}) - \log(\mathbf{y}_{i}) \right)- (\mu_{i} - \mathbf{y}_{i}) \right] \\ &= \sum_{i = 1}^{n} \left[ \mathbf{y}_{i}\log(\mu_{i}) - \mathbf{y}_{i} \log(\mathbf{y}_{i})- \mu_{i} + \mathbf{y}_{i} \right] \end{aligned} \nonumber $$Let $f$ denote the density associated with $\mathcal{F}$, the distribution of the random effects. The unconditional likelihood is then given by integrating out the random effects from the joint likelihood:
\[\begin{aligned} \mathcal{L}(\mathbf{y}; \alpha, \theta) &= \int \left( \prod_{i = 1}^n \mathcal{L}(\mathbf{y}_i; \alpha, \theta \rvert \beta) \right) f(\beta) d\beta \end{aligned} \label{eq:uncondition-log-lik}\]Remember that the above integral is multi-dimensional if $\beta$ has dimension greater than $1$!
We usually do maximum likelihood estimation for the parameter vector $\beta$ by taking the derivative of the log of the above, setting it equal to zero, and solving for $\beta$. However, depending upon the form of $f(\cdot)$ and the number of random effects we have, Eq. \eqref{eq:uncondition-log-lik} may be a multi-dimensional integral that is difficult to evaluate, let alone differentiate.
Even if we use the quasi-likelihood, we have the same problem as the (potentially) multi-dimensional integral remains:
\[\mathcal{L}_q(\mathbf{y}; \alpha, \theta) = \int \left( \prod_{i = 1}^n \mathcal{L}_q(\mathbf{y}_i; \alpha, \theta \rvert \beta) \right) f(\beta) d\beta \label{eq:uncondition-quasi-lik}\]We need some other work-around to do inference. For this reason, we use a linear approximation of the quasi-likelihood.
Taylor Approximation
One approximation we can use is through the Taylor expansion. First, let’s rewrite the conditional quasi-likelihood of $(\alpha, \theta)$ given $\beta$ using a Taylor expansion. We’ll use a second-order expansion since the remainder will be relatively simple due to the assumptions made on the higher order moments of the random effects.
Denote the first- and second-order partial derivatives of $\mathcal{L}_{q}(\alpha, \theta; y \rvert \beta) = \exp(\ell_q(\mathbf{y}; \alpha, \theta \rvert \beta))$ with:
\[\begin{aligned} \frac{\partial \mathcal{L}_q(\mathbf{y}; \alpha, \theta\rvert \beta)}{\partial \beta} &= \mathcal{L}_q(\mathbf{y}; \alpha, \theta \rvert \beta) \mathbf{Z}^\top \frac{\partial \ell_q(\mathbf{y}; \alpha, \theta \rvert \beta)}{\partial \eta}\\ \frac{\partial^2 \mathcal{L}_q(\mathbf{y}; \alpha, \theta\rvert \beta)}{\partial \beta \partial \beta^\top} &= \mathcal{L}_q(\mathbf{y}; \alpha, \theta \rvert \beta)\mathbf{Z}^\top \left(\frac{\partial \ell_q(\mathbf{y}; \alpha, \theta \rvert \beta)}{\partial \eta}\frac{\partial \ell_q(\mathbf{y}; \alpha, \theta \rvert \beta)}{\partial \eta^\top} + \frac{\partial^2 \ell_q(\mathbf{y}; \alpha, \theta \rvert \beta)}{\partial \eta \partial \eta^\top} \right) \mathbf{Z} \end{aligned} \label{eq:likelihood-derivs}\]First Order Derivations.
$$ \begin{aligned} \frac{\partial \mathcal{L}_q(\mathbf{y}; \alpha, \theta\rvert \beta)}{\partial \beta} &= \frac{\partial}{\partial \beta} \left[ \exp\left(\ell_{q}(\mathbf{y}; \alpha, \theta \rvert \beta) \right) \right] \\ &= \exp\left( \ell_q(\mathbf{y}; \alpha, \theta \rvert \beta) \right) \frac{\partial}{\partial \beta} \left[ \sum_{i = 1}^n \ell_q(\mathbf{y}_i; \alpha, \theta \rvert \beta) \right] \\ &= \exp\left( \ell_q(\mathbf{y}; \alpha, \theta \rvert \beta) \right) \frac{\partial \eta}{\partial \beta} \frac{\partial}{\partial \eta} \left[ \sum_{i = 1}^n \ell_q(\mathbf{y}_i; \alpha, \theta \rvert \beta) \right] & \left( \eta \in \mathbb{R}^{n \times 1}, \beta \in \mathbb{R}^{q \times 1} \implies \frac{\partial \eta}{\partial \beta} \in \mathbb{R}^{n \times q} \right) \\ &= \exp\left( \ell_q(\mathbf{y}; \alpha, \theta \rvert \beta) \right)\mathbf{Z}^\top \frac{\partial}{\partial \eta} \left[ \sum_{i = 1}^n\ell_q(\mathbf{y}_i; \alpha, \theta \rvert \beta) \right] & \left(\eta = \mathbf{X} \alpha + \mathbf{Z} \beta \right) \\ &= \exp\left( \ell_q(\mathbf{y}; \alpha, \theta \rvert \beta) \right) \sum_{i = 1}^n\frac{\partial \ell_q(\mathbf{y}_i; \alpha, \theta \rvert \beta)}{\partial \eta_i} (\mathbf{Z}_{\cdot, i})^\top & \left( \frac{\partial \ell_q}{\partial \eta} \in \mathbb{R}^{n \times 1}, \mathbf{Z}^\top \in \mathbb{R}^{q \times n} \right) \end{aligned} \nonumber $$ In matrix notation: $$ \frac{\partial \mathcal{L}_q(\mathbf{y}; \alpha, \theta\rvert \beta)}{\partial \beta} = \exp(\ell_q(\mathbf{y}; \alpha, \theta \rvert \beta)) \mathbf{Z}^\top \frac{\partial \ell_q(\mathbf{y}; \alpha, \theta \rvert \beta)}{\partial \eta} = \mathcal{L}_q(\mathbf{y}; \alpha, \theta \rvert \beta) \mathbf{Z}^\top \frac{\partial \ell_q(\mathbf{y}; \alpha, \theta \rvert \beta)}{\partial \eta} \nonumber $$Second Order Derivations.
$$ \begin{aligned} \frac{\partial^2 \mathcal{L}_q(\mathbf{y}; \alpha, \theta\rvert \beta)}{\partial \beta \partial \beta^\top} &= \frac{\partial}{\partial \beta} \left[\exp\left( \ell_q(\mathbf{y}; \alpha, \theta \rvert \beta) \right) \sum_{i = 1}^n\frac{\partial \ell_q(\mathbf{y}_i; \alpha, \theta \rvert \beta)}{\partial \eta_i} (\mathbf{Z}_{\cdot, i})^\top \right] \\ &= \underbrace{\frac{\partial}{\partial \beta}[\exp(\ell_q(\mathbf{y}; \alpha, \theta \rvert \beta))] \left( \sum_{i = 1}^n \frac{\partial \ell_q(\mathbf{y}_i; \alpha, \theta \rvert \beta)}{\partial \eta_i} \mathbf{z}_i^\top\right)}_{(a)} + \underbrace{\exp(\ell_q(\mathbf{y}; \alpha, \theta \rvert \beta)) \frac{\partial}{\partial \beta} \left[ \sum_{i = 1}^n \frac{\partial \ell_q(\mathbf{y}_i; \alpha, \theta \rvert \beta)}{\partial \eta_i} (\mathbf{Z}_{i, \cdot})^\top\right]}_{(b)} & \left( \text{product rule}\right) \\ &= \exp(\ell_q(\mathbf{y}; \alpha, \theta \rvert \beta)) \left[\left( \sum_{i = 1}^n \frac{\partial \ell_q(\mathbf{y}_i; \alpha, \theta \rvert \beta)}{\partial \eta_i} (\mathbf{Z}_{i, \cdot})^\top \right) \left( \sum_{i = 1}^n \frac{\partial \ell_q(\mathbf{y}_i; \alpha, \theta \rvert \beta)}{\partial \eta_i} (\mathbf{Z}_{i, \cdot}) \right) + \sum_{i = 1}^n \frac{\partial^2 \ell_q(\mathbf{y}_i; \alpha, \theta \rvert \beta)}{\partial \eta_i^2} (\mathbf{Z}_{i, \cdot})^\top \mathbf{Z}_{i, \cdot} \right] \end{aligned} \nonumber $$ We show $(a)$ and $(b)$ separately: $$ \begin{aligned} (a) &=\frac{\partial}{\partial \beta} \left[ \exp\left( \ell_q(\mathbf{y}; \alpha, \theta \rvert \beta) \right) \right] \left[ \sum_{i = 1}^n \ell_q(\mathbf{y}_i; \alpha, \theta \rvert \beta) (\mathbf{Z}_{\cdot, i})^\top \right]^\top \\ &= \exp\left(\ell_q(\mathbf{y}; \alpha, \theta \rvert \beta) \right) \left(\sum_{i = 1}^n\frac{\partial \ell_q(\mathbf{y}_i; \alpha, \theta \rvert \beta)}{\partial \eta_i} (\mathbf{Z}_{i, \cdot})^\top\right)\left(\sum_{i = 1}^n\frac{\partial \ell_q(\mathbf{y}_i; \alpha, \theta \rvert \beta)}{\partial \eta_i} (\mathbf{Z}_{i, \cdot})^\top\right)^\top\\ &= \exp\left(\ell_q(\mathbf{y}; \alpha, \theta \rvert \beta) \right) \left(\sum_{i = 1}^n\frac{\partial \ell_q(\mathbf{y}_i; \alpha, \theta \rvert \beta)}{\partial \eta_i} (\mathbf{Z}_{i, \cdot})^\top\right)\left(\sum_{i = 1}^n\frac{\partial \ell_q(\mathbf{y}_i; \alpha, \theta \rvert \beta)}{\partial \eta_i} \mathbf{Z}_{i, \cdot} \right) \\ (b) &= \exp\left( \ell_q(\mathbf{y}; \alpha, \theta \rvert \beta) \right)\frac{\partial}{\partial \beta} \left[ \sum_{i = 1}^n \ell_q(\mathbf{y}_i; \alpha, \theta \rvert \beta) (\mathbf{Z}_{i, \cdot})^\top \right] \\ &= \exp\left( \ell_q(\mathbf{y}; \alpha, \theta \rvert \beta) \right) \frac{\partial}{\partial \eta} \left[ \sum_{i = 1}^n \ell_q(\mathbf{y}_i; \alpha, \theta \rvert \beta) (\mathbf{Z}_{i, \cdot})^\top \right]\frac{\partial \eta}{\partial \beta} \\ &= \exp\left( \ell_q(\mathbf{y}; \alpha, \theta \rvert \beta) \right) \frac{\partial}{\partial \eta} \left[ \sum_{i = 1}^n \ell_q(\mathbf{y}_i; \alpha, \theta \rvert \beta) (\mathbf{Z}_{i, \cdot})^\top \right] \mathbf{Z} & \left(\eta = \mathbf{X} \alpha + \mathbf{Z} \beta \right) \\ &= \exp\left( \ell_q(\mathbf{y}; \alpha, \theta \rvert \beta) \right)\left( \sum_{i = 1}^n \frac{\partial^2 \ell_q(\mathbf{y}_i; \alpha, \theta \rvert \beta)}{\partial \eta_i^2} (\mathbf{Z}_{i, \cdot})^\top \mathbf{Z}_{i, \cdot} \right) & \left( \frac{\partial \ell_q}{\partial \eta} \in \mathbb{R}^{n \times 1}, \mathbf{Z}^\top \in \mathbb{R}^{q \times n}, (\mathbf{Z}_{i, \cdot})^\top \in \mathbb{R}^{q \times 1} \right) \\ \end{aligned} \nonumber $$ In matrix notation: $$ \begin{aligned} \frac{\partial^2 \mathcal{L}_q(\mathbf{y}; \alpha, \theta\rvert \beta)}{\partial \beta \partial \beta^\top} &= \exp(\ell_q(\mathbf{y}; \alpha, \theta \rvert \beta)) \left( \mathbf{Z}^\top \frac{\partial \ell_q(\mathbf{y}; \alpha, \theta \rvert \beta)}{\partial \eta}\frac{\partial \ell_q(\mathbf{y}; \alpha, \theta \rvert \beta)}{\partial \eta^\top} \mathbf{Z} + \mathbf{Z}^\top \frac{\partial^2 \ell_q(\mathbf{y}; \alpha, \theta \rvert \beta)}{\partial \eta \partial \eta^\top} \mathbf{Z} \right)\\ &= \mathcal{L}_q(\mathbf{y}; \alpha, \theta \rvert \beta) \mathbf{Z}^\top \left( \frac{\partial \ell_q(\mathbf{y}; \alpha, \theta \rvert \beta)}{\partial \eta}\frac{\partial \ell_q(\mathbf{y}; \alpha, \theta \rvert \beta)}{\partial \eta^\top}+ \frac{\partial^2 \ell_q(\mathbf{y}; \alpha, \theta \rvert \beta)}{\partial \eta \partial \eta^\top}\right) \mathbf{Z} \end{aligned} \nonumber $$Let \(\mathcal{L}'_q(\beta_0)\) and \(\mathcal{L}''_q(\beta_0)\) denote the partial derivatives evaluated at $\beta = \beta_0$. The Taylor expansion of $\mathcal{L}_q(\mathbf{y}; \alpha, \theta \rvert \beta)$ at $\beta_0$ is:
\[\mathcal{L}_q(\mathbf{y}; \alpha, \theta \rvert \beta) = \left[ \mathcal{L}_q(\mathbf{y}; \alpha, \theta \rvert \beta)\right]\bigg\rvert_{\beta = \beta_0} + (\beta - \beta_0)^\top \left[\frac{\partial \mathcal{L}_q(\mathbf{y}; \alpha, \theta \rvert \beta)}{\partial \beta} \right]\bigg\rvert_{\beta = \beta_0} + \frac{1}{2}(\beta - \beta_0)^\top \left[ \frac{\partial \mathcal{L}_q(\mathbf{y}; \alpha, \theta \rvert \beta)}{\partial \beta \partial \beta^\top}\right] \bigg\rvert_{\beta = \beta_0} (\beta - \beta_0) + [R]\rvert_{\beta = \beta_0} \label{eq:taylor-expansion}\]where $R$ is a remainder term containing the higher order partial derivatives. Let $\bar{R}$ denote the remainder terms divided by \(\left[\mathcal{L}_q(\mathbf{y}; \alpha, \theta \rvert \beta) \right]\rvert_{\beta = \beta_0}\). Factoring out $\exp(\ell_q(\mathbf{y}; \alpha, \theta \rvert \beta))$ and letting $\beta_0 = \mathbf{0}$ (the random effects mean) we get:
\[\mathcal{L}_q(\mathbf{y}; \alpha, \theta \rvert \beta) = \left[\mathcal{L}_q(\mathbf{y}; \alpha, \theta \rvert \beta)\right]\bigg\rvert_{\beta = \mathbf{0}} \left( 1 + \beta^\top\left[ \frac{\partial \ell_q(\mathbf{y}; \alpha, \theta \rvert\beta)}{\partial \eta} \right]\bigg\rvert_{\beta = \mathbf{0}} + \frac{1}{2}\beta^\top \left(\mathbf{Z}^\top \left[ \frac{\partial \ell_q(\mathbf{y}; \alpha, \theta \rvert \beta)}{\partial \eta}\frac{\partial \ell_q(\mathbf{y}; \alpha, \theta \rvert\beta)}{\partial \eta^\top} + \frac{\partial^2 \ell_q(\mathbf{y}; \alpha, \theta \rvert \beta)}{\partial \eta \partial \eta^\top}\right]\bigg\rvert_{\beta = \mathbf{0}} \mathbf{Z} \right)\beta + [\bar{R}]\rvert_{\beta = \mathbf{0}} \right) \label{eq:taylor-exp-0}\]We first find the first and second order derivatives of the log quasi-likelihood with respect to the linear predictor, $\eta$: $$ \frac{\partial \ell(\mathbf{y}; \alpha, \theta \rvert \beta)}{\partial \eta} = \frac{\partial \ell_q(\mathbf{y}; \alpha, \theta \rvert \beta)}{\partial \eta} = \mathbf{y} - \mu \in \mathbb{R}^{n \times 1}, \hspace{5mm} \frac{\partial^2 \ell(\mathbf{y}; \alpha, \theta \rvert \beta)}{\partial \eta \partial \eta^\top} = \frac{\partial^2 \ell_q(\mathbf{y}; \alpha, \theta \rvert \beta)}{\partial \eta \partial \eta^\top} = \text{diag}(-\mu) \in \mathbb{R}^{n \times n} \label{eq:deriv-lin-pred} $$
Proof.
The first order derivative is straightforward: $$ \begin{aligned} \frac{\partial \ell_q(\mathbf{y}; \alpha, \theta \rvert \beta)}{\partial \eta_{j}} &= \frac{\partial}{\partial \eta_{j}} \left[ \sum_{i = 1}^{n} \left[ \mathbf{y}_{i} \log(\mu_{i}) - \mathbf{y}_{i} \log(\mathbf{y}_i) - \mu_{i} + \mathbf{y}_i\right] \right] \\ &= \frac{\partial}{\partial \eta_{j}} \left[ \sum_{i = 1}^{n} \left[ \mathbf{y}_{i} \eta_i - \mathbf{y}_{i} \log(\mathbf{y}_i) - \exp(\eta_i) + \mathbf{y}_i\right] \right] & \left(\log(\mu_{i}) = \eta_{i} \right) \\ &= \mathbf{y}_j - \exp(\eta_j) \\ \implies \frac{\partial \ell_q(\mathbf{y}; \alpha, \theta \rvert \beta)}{\partial \eta} &= \mathbf{y} - \exp(\eta) \\ &= \mathbf{y} - \mu \end{aligned} \nonumber $$ The second order derivative is a bit more complicated because we are going to get an $n \times n$ matrix. $$ \begin{aligned} \frac{\partial^2 \ell_q(\mathbf{y}; \alpha, \theta \rvert \beta)}{\partial \eta_{j}^2} &= \frac{\partial }{\partial \eta_{j}} \left[ \mathbf{y}_{j} - \exp(\eta_{j}) \right] \\ &= - \exp(\eta_{j}) \\ &= -\mu_j \\ \frac{\partial^2 \ell_q(\mathbf{y}; \alpha, \theta \rvert \beta)}{\partial \eta_{j'} \partial \eta_{j}} &= \frac{\partial }{\partial \eta_{j'}} \left[ \mathbf{y}_{j} - \exp(\eta_{j})\right] \\ &= 0 \\ \implies \frac{\partial^2 \ell_q(\mathbf{y}; \alpha, \theta \rvert \beta)}{\partial \eta \partial \eta^\top} &= \text{diag}\left(-\exp(\eta)\right) \\ &= \text{diag}\left(- \mu \right) \end{aligned} \nonumber $$ The result is the same had we differentiated the Poisson likelihood (this is easy to show).Proof.
$$ \begin{aligned} \left[ \mathcal{L}_q(\mathbf{y}; \alpha, \theta \rvert \beta)\right]\bigg\rvert_{\beta = \mathbf{0}} &= \left[\exp\left(\ell_q(\mathbf{y}; \alpha, \theta \rvert \beta) \right)\right] \bigg\rvert_{\beta = \mathbf{0}}\\ &= \exp\left(\sum_{i = 1}^{n} \left[ \mathbf{y}_{i} \log(\mu^0_{i}) - \mathbf{y}_{i} \log(\mathbf{y}_{i}) - \mu^0_{i} + \mathbf{y}_{i} \right]\right) \\ &= \exp\left( \mathbf{y}^\top (\log(\mu^0) - \log(\mathbf{y})) + (\mu^0 - \mathbf{y})^\top \mathbf{1}_n \right) \end{aligned} $$Proof.
First, let's evaluate the quadratic term. $$ \begin{aligned} \mathbf{Z}^\top \left[ \frac{\partial \ell_q(\mathbf{y}; \alpha, \theta \rvert \beta)}{\partial \eta} \frac{\partial \ell_q(\mathbf{y}; \alpha, \theta \rvert \beta)}{\partial \eta^\top} \right] \mathbf{Z} &= \mathbf{Z}^\top (\mathbf{y} - \mu)(\mathbf{y} - \mu)^\top \mathbf{Z} \\ &= \mathbf{Z}^\top \begin{bmatrix} (\mathbf{y}_1 - \mu_1)^2 & \dots & (\mathbf{y}_1 - \mu_1)(\mathbf{y}_n - \mu_n) \\ \vdots & \ddots & \vdots \\ (\mathbf{y}_n - \mu_n)(\mathbf{y}_1 - \mu_1) & \dots & (\mathbf{y}_n - \mu_n)^2 \end{bmatrix} \mathbf{Z} \\ &= \begin{bmatrix} (\mathbf{y}_1 - \mu_1) \sum_{j = 1}^m (\mathbf{y}_j^1 - \mu^1_j) & \dots & (\mathbf{y}_n - \mu_n) \sum_{j = 1}^m (\mathbf{y}_j^1 - \mu^1_j)\\ \vdots & \ddots & \vdots \\ (\mathbf{y}_1 - \mu_1) \sum_{j = 1}^m (\mathbf{y}_j^k - \mu^k_j)& \dots & (\mathbf{y}_n - \mu_n) \sum_{j = 1}^m (\mathbf{y}_j^k - \mu_j^k) \end{bmatrix} \mathbf{Z} \\ &= \begin{bmatrix} \left( \sum_{j = 1}^m (\mathbf{y}_j^1 - \mu^1_j) \right)^2& \dots & \left(\sum_{j = 1}^m (\mathbf{y}^1_j - \mu^1_j) \right)\left( \sum_{j = 1}^m (\mathbf{y}_j^k - \mu^k_j) \right) \\ \vdots & \ddots & \vdots \\ \left( \sum_{j = 1}^m (\mathbf{y}_j^k - \mu^k_j) \right) \left(\sum_{j = 1}^m (\mathbf{y}^1_j - \mu^1_j) \right) & \dots & \left( \sum_{j = 1}^m (\mathbf{y}_j^k - \mu^k_j) \right)^2 \end{bmatrix} \end{aligned} \nonumber $$ We also have: $$ \begin{aligned} \mathbf{Z}^\top\frac{\partial^2 \ell_q(\mathbf{y}; \alpha, \theta \rvert \beta)}{\partial \eta \partial \eta^\top} \mathbf{Z} &= \mathbf{Z}^\top \begin{bmatrix} -\mu_1 & \dots & 0 \\ \vdots & \ddots & \vdots \\ 0 & \dots & - \mu_n \end{bmatrix} \mathbf{Z} \\ &= \begin{bmatrix} - \sum_{j = 1}^m \mu_j^1 & \dots & 0 \\ \vdots & \ddots & \vdots \\ 0 & \dots & -\sum_{j = 1}^m \mu_j^k \end{bmatrix} \end{aligned} \nonumber $$ Thus: $$ \mathbf{Z}^\top \left[ \frac{\partial \ell_q(\mathbf{y}; \alpha, \theta \rvert \beta)}{\partial \eta} \frac{\partial \ell_q(\mathbf{y}; \alpha, \theta \rvert \beta)}{\partial \eta^\top} + \frac{\partial^2 \ell_q(\mathbf{y}; \alpha, \theta \rvert \beta)}{\partial \eta \partial \eta^\top}\right] \mathbf{Z} = \begin{bmatrix} \left( \sum_{j = 1}^m (\mathbf{y}_j^1 - \mu^1_j) \right)^2 - \sum_{j = 1}^m \mu_j^1 & \dots & \left(\sum_{j = 1}^m (\mathbf{y}^1_j - \mu^1_j) \right)\left( \sum_{j = 1}^m (\mathbf{y}_j^k - \mu^k_j) \right) \\ \vdots & \ddots & \vdots \\ \left( \sum_{j = 1}^m (\mathbf{y}_k^1 - \mu^k_j) \right) \left(\sum_{j = 1}^m (\mathbf{y}^1_j - \mu^1_j) \right) & \dots & \left( \sum_{j = 1}^m (\mathbf{y}_j^k - \mu^k_j) \right)^2 - \sum_{j = 1}^m \mu_j^k \end{bmatrix} \nonumber $$ Plugging this into Eq. \eqref{conditional-log-quasi-likelihood}, we get: $$ \mathcal{L}_q(\mathbf{y}; \alpha, \theta \rvert \beta) = \left[ \mathcal{L}_q(\mathbf{y}; \alpha, \theta \rvert \beta)\right]\bigg\rvert_{\beta = \mathbf{0}} \left(1 + \beta^\top (\mathbf{y} - \mu^0) + \frac{1}{2} \beta^\top \begin{bmatrix} \left( \sum_{j = 1}^m (\mathbf{y}_j^1 - (\mu^0)^1_j) \right)^2 - \sum_{j = 1}^m (\mu^0)_j^1 & \dots & \left(\sum_{j = 1}^m (\mathbf{y}^1_j - (\mu^0)^1_j) \right)\left( \sum_{j = 1}^m (\mathbf{y}_j^k - (\mu^0)^k_j) \right) \\ \vdots & \ddots & \vdots \\ \left( \sum_{j = 1}^m (\mathbf{y}_k^1 - (\mu^0)^k_j) \right) \left(\sum_{j = 1}^m (\mathbf{y}^1_j - (\mu^0)^1_j) \right) & \dots & \left( \sum_{j = 1}^m (\mathbf{y}_j^k - (\mu^0)^k_j) \right)^2 - \sum_{j = 1}^m (\mu^0)_j^k \end{bmatrix} \beta + \left[ \bar{R} \right] \rvert_{\beta = \mathbf{0}} \right) $$Marginal (Quasi-)Likelihood
Notice that Eq. \eqref{eq:uncondition-quasi-lik} (and also \eqref{eq:uncondition-log-lik}) is just the expectation of $\mathcal{L}_q(\mathbf{y}; \alpha, \theta \rvert \beta)$ ($\mathcal{L}(\mathbf{y}; \alpha, \theta \rvert \beta)$, respectively) with respect to the probability measure specified by $\mathcal{F}$. We can therefore write the unconditional quasi-likelihood (or likelihood) with the expectation of our Taylor expansion:
\[\mathcal{L}_q(\mathbf{y}; \alpha, \theta \rvert \beta) = \left[\mathcal{L}_q(\mathbf{y}; \alpha, \theta \rvert \beta)\right]\bigg\rvert_{\beta = \mathbf{0}} \left( 1 + \frac{1}{2}\text{tr} \left[\mathbf{Z}^\top \left[\frac{\partial \ell_q(\mathbf{y}; \alpha, \theta \rvert \beta)}{\partial \eta} \frac{\partial \ell_q(\mathbf{y}; \alpha, \theta \rvert \beta)}{\partial \eta^\top} + \frac{\partial^2 \ell_q(\mathbf{y}; \alpha, \theta \rvert \beta)}{\partial \eta \partial \eta^\top}\right]\bigg\rvert_{\beta = \mathbf{0}} \mathbf{Z} D(\theta) \right] + o(\rvert \rvert \theta \rvert \rvert) \right) \label{eq:taylor-expansion-expectation}\]Proof.
Let: $$ \bar{\mathcal{L}}''_q(\mathbf{0}) = \left(\mathbf{Z}^\top\left[ \frac{\partial \ell_q(\mathbf{y}; \alpha, \theta \rvert \beta)}{\partial \eta}\frac{\partial \ell_q(\mathbf{y}; \alpha, \theta \rvert \beta)}{\partial \eta^\top} + \frac{\partial^2 \ell_q(\mathbf{y}; \alpha, \theta \rvert \beta)}{\partial \eta \partial \eta^\top} \right]\bigg\rvert_{\beta = \mathbf{0}}\mathbf{Z} \right) \nonumber $$ Then: $$ \begin{aligned} \mathcal{L}_q(\mathbf{y}; \alpha, \theta) &= \int \mathcal{L}_q(\mathbf{y}; \alpha, \theta \rvert \beta) f(\beta) d\beta \\ &= \mathbb{E}_{\beta} \left[ \mathcal{L}_q(\mathbf{y}; \alpha, \theta \rvert \beta) \right] \\ &= \mathbb{E}_{\beta} \left[ \left[\mathcal{L}_q(\mathbf{y}; \alpha, \theta \rvert \beta)\right]\bigg\rvert_{\beta = \mathbf{0}} \left( 1 + \beta^\top \left[\frac{\partial \ell_q(\mathbf{y}; \alpha, \theta \rvert \beta)}{\partial \eta}\right]\bigg\rvert_{\beta = \mathbf{0}} + \frac{1}{2}\beta^\top \bar{\mathcal{L}}''_q(\mathbf{0}) \beta + \bar{R} \right)\right] \\ &= \left[\mathcal{L}_q(\mathbf{y}; \alpha, \theta \rvert \beta)\right]\bigg\rvert_{\beta = \mathbf{0}} \left( 1 + \mathbb{E}_{\beta} \left[ \beta^\top \left[\frac{\partial \ell_q(\mathbf{y}; \alpha, \theta \rvert \beta)}{\partial \eta}\right]\bigg\rvert_{\beta = \mathbf{0}}\right] + \frac{1}{2}\mathbb{E}_{\beta} \left[ \beta^\top \bar{\mathcal{L}}''_q(\mathbf{0}) \beta \right] + \mathbb{E}_{\beta} \left[ \bar{R} \right]\right) \\ &= \left[\mathcal{L}_q(\mathbf{y}; \alpha, \theta \rvert \beta)\right]\bigg\rvert_{\beta = \mathbf{0}} \left( 1 + \frac{1}{2}\mathbb{E}_{\beta} \left[ \beta^\top \bar{\mathcal{L}}''_q(\mathbf{0}) \beta \right] + \mathbb{E}_{\beta} \left[ \bar{R} \right]\right) & \left(\mathbb{E}_\beta \left[ \beta^\top \right] = \mathbf{0} \right) \\ &= \left[\mathcal{L}_q(\mathbf{y}; \alpha, \theta \rvert \beta)\right]\bigg\rvert_{\beta = \mathbf{0}} \left( 1 + \frac{1}{2}\mathbb{E}_{\beta} \left[ \text{tr}\left[ \beta^\top \bar{\mathcal{L}}''_q(\mathbf{0}) \beta \right] \right] + \mathbb{E}_{\beta} \left[ \bar{R} \right]\right) & \left( \beta^\top \bar{\mathcal{L}}''_q(\mathbf{0}) \beta \in \mathbb{R} \right) \\ &= \left[\mathcal{L}_q(\mathbf{y}; \alpha, \theta \rvert \beta)\right]\bigg\rvert_{\beta = \mathbf{0}} \left( 1 + \frac{1}{2}\mathbb{E}_{\beta} \left[ \text{tr}\left[ \bar{\mathcal{L}}''_q(\mathbf{0}) \beta \beta^\top \right] \right] + \mathbb{E}_{\beta} \left[ \bar{R} \right]\right) & \left( \text{tr}\left[ \cdot \right] \text{ commutes}\right) \\ &= \left[\mathcal{L}_q(\mathbf{y}; \alpha, \theta \rvert \beta)\right]\bigg\rvert_{\beta = \mathbf{0}} \left( 1 + \frac{1}{2}\text{tr}\left[ \mathbb{E}_{\beta} \left[ \bar{\mathcal{L}}''_q(\mathbf{0}) \beta \beta^\top \right] \right] + \mathbb{E}_{\beta} \left[ \bar{R} \right]\right) & \left( \text{tr}\left[ \cdot \right] \text{ is linear} \right) \\ &= \left[\mathcal{L}_q(\mathbf{y}; \alpha, \theta \rvert \beta)\right]\bigg\rvert_{\beta = \mathbf{0}} \left( 1 + \frac{1}{2} \text{tr}\left[ \bar{\mathcal{L}}''_q(\mathbf{0}) \text{Cov}(\beta) \right] + \mathbb{E}_{\beta} \left[ \bar{R} \right]\right) & \left( \mathbb{E}_\beta \left[ \beta \right] = \mathbf{0} \implies \mathbb{E}_\beta \left[ \beta \beta^\top \right] = \text{Cov}(\beta)\right) \\ &= \left[\mathcal{L}_q(\mathbf{y}; \alpha, \theta \rvert \beta)\right]\bigg\rvert_{\beta = \mathbf{0}}\left( 1 + \frac{1}{2}\text{tr}\left[ \bar{\mathcal{L}}''_q(\mathbf{0}) D(\theta) \right] + \mathbb{E}_{\beta} \left[ \bar{R} \right]\right) & \left( \text{Cov}(\beta) = D(\theta) \right) \\ &= \left[\mathcal{L}_q(\mathbf{y}; \alpha, \theta \rvert \beta)\right]\bigg\rvert_{\beta = \mathbf{0}} \left( 1 + \frac{1}{2} \text{tr}\left[ \bar{\mathcal{L}}''_q(\mathbf{0}) D(\theta) \right] + o(\rvert \rvert \theta \rvert\rvert)\right) & \left( \text{higher order assumption} \right) \end{aligned} \nonumber $$ Plugging in for $\mathcal{L}''_q(\mathbf{0})$ we found in Eq. \eqref{eq:likelihood-derivs}: $$ \begin{aligned} \mathcal{L}_q(\mathbf{y}; \alpha, \theta) &= \left[\mathcal{L}_q(\mathbf{y}; \alpha, \theta \rvert \beta)\right]\bigg\rvert_{\beta = \mathbf{0}} \left( 1 + \frac{1}{2} \text{tr}\left[ \bar{\mathcal{L}}''_q(\mathbf{0}) D(\theta) \right] + o(\rvert \rvert \theta \rvert\rvert)\right) \\ &=\left[\mathcal{L}_q(\mathbf{y}; \alpha, \theta \rvert \beta)\right]\bigg\rvert_{\beta = \mathbf{0}} \left( 1 + \frac{1}{2}\text{tr} \left[\mathbf{Z}^\top \left[\frac{\partial \ell_q(\mathbf{y}; \alpha, \theta \rvert \beta)}{\partial \eta} \frac{\partial \ell_q(\mathbf{y}; \alpha, \theta \rvert \beta)}{\partial \eta^\top} + \frac{\partial^2 \ell_q(\mathbf{y}; \alpha, \theta \rvert \beta)}{\partial \eta \partial \eta^\top}\right]\bigg\rvert_{\beta = \mathbf{0}} \mathbf{Z} D(\theta) \right] + o(\rvert \rvert \theta \rvert \rvert) \right) \end{aligned} \nonumber $$Taking the natural logarithm of both sides gives us the marginal log quasi-likelihood:
\[\begin{aligned} \ell_q(\mathbf{y}; \alpha, \theta) &= \log(\mathcal{L}_q(\mathbf{y}; \alpha, \theta)) \\ &= \left[\ell_q(\mathbf{y}; \alpha, \theta \rvert \beta)\right]\bigg\rvert_{\beta = \mathbf{0}} + \log\left(1 + \frac{1}{2} \text{tr} \left[ \mathbf{Z}^\top \left[\frac{\partial \ell_q(\mathbf{y}; \alpha, \theta \rvert \beta)}{\partial \eta} \frac{\partial \ell_q(\mathbf{y}; \alpha, \theta \rvert \beta)}{\partial \eta^\top} + \frac{\partial^2 \ell_q(\mathbf{y}; \alpha, \theta \rvert \beta)}{\partial \eta \partial \eta^\top}\right]\bigg\rvert_{\beta = \mathbf{0}} \mathbf{Z} D(\theta) \right] + o(\rvert \rvert \theta \rvert \rvert)\right) \\ &\approx \left[\ell_q(\mathbf{y}; \alpha, \theta \rvert \beta)\right]\bigg\rvert_{\beta = \mathbf{0}}+ \frac{1}{2} \text{tr} \left[ \mathbf{Z}^\top \left[\frac{\partial \ell_q(\mathbf{y}; \alpha, \theta \rvert \beta)}{\partial \eta} \frac{\partial \ell_q(\mathbf{y}; \alpha, \theta \rvert \beta)}{\partial \eta^\top} + \frac{\partial^2 \ell_q(\mathbf{y}; \alpha, \theta \rvert \beta)}{\partial \eta \partial \eta^\top}\right]\bigg\rvert_{\beta = \mathbf{0}} \mathbf{Z} D(\theta) \right] \\ \end{aligned} \label{eq:log-quasi-lik}\]The approximation follows from the fact that $\log(1 + x) \approx x$ for small $x$, and we drop the remainder term since we assume it is small.
Taking the natural logarithm of $[\mathcal{L}_q(\mathbf{y}; \alpha, \theta \rvert \beta)]\rvert_{\beta = \mathbf{0}}$ in Eq. \eqref{eq:conditional-quasi-likelihood} gives us: $$ \left[\ell_q(\mathbf{y}; \alpha, \theta \rvert \beta)\right]\bigg\rvert_{\beta = \mathbf{0}} = \mathbf{y}^\top(\log(\mu^0) - \log(\mathbf{y})) + (\mu^0 - \mathbf{y})^\top \mathbf{1} $$ We also already specified the form of $D(\theta)$ as $\theta\Sigma$. Using the equalities we found in Eq. \eqref{eq:deriv-lin-pred}, the unconditional quasi-likelihood is given by: $$ \begin{aligned} \ell_q(\mathbf{y}; \alpha, \theta) &\approx \mathbf{y}^\top(\log(\mu^0) - \log(\mathbf{y})) + (\mu^0 - \mathbf{y})^\top \mathbf{1} + \frac{1}{2} \text{tr} \left[ \mathbf{Z}^\top \left((\mathbf{y} - \mu^0)(\mathbf{y} - \mu^0)^\top + \text{diag}(- \mu^0)\right)\mathbf{Z} \left( \theta \Sigma \right)\right] \\ &= \mathbf{y}^\top(\log(\mu^0) - \log(\mathbf{y})) + (\mu^0 - \mathbf{y})^\top \mathbf{1} + \frac{\theta}{2} \text{tr} \left[ \mathbf{Z}^\top \left((\mathbf{y} - \mu^0)(\mathbf{y} - \mu^0)^\top + \text{diag}(- \mu^0)\right)\mathbf{Z}\right] \end{aligned} \label{eq:log-quasi-lik-ex} $$
Inference
Okay, so now we have two different approximations of the likelihood or quasi-likelihood, which we can use for inference. We’ll continue with the Taylor approximation. We should note that we further assume that:
\[\frac{\partial g^{-1}(\eta_i)}{\partial \eta_i} = \frac{\partial \mu_i}{\partial \eta_i} = \left(\frac{\partial \eta_i}{\partial \mu_i}\right)^{-1} \label{eq:deriv-assumption}\]which holds if you assume that $g(\cdot)$ is strictly monotone and continuous. This implies that $g(\cdot)$ is invertible, and its derivative is never $0$. This holds for many standard choices of link function (e.g. $\log$, $\text{logit}$, the identity).
Since the score function involves a derivative, there is a separate one for each parameter. Due to fact that we assume that the random effects are Gaussian with mean zero and covariance matrix parametrized by $\theta$, we only really have $\alpha$, $\theta$, and $\phi$ are our main parameters (since $\omega$ are prior weights).
Before we derive the scores, let’s first define the following terms:
\[\delta_i = \frac{1}{\frac{\partial \eta_i}{\partial \mu_i}}, \hspace{5mm} \rho_i = \frac{\frac{\partial v(\mu_i)}{\partial \mu_i} \frac{\partial \eta_i}{\partial \mu_i} + v(\mu_i) \frac{\partial^2 \eta_i}{\partial \mu_i^2}}{v^2(\mu_i) \left(\frac{\partial \eta_i}{\partial \mu_i}\right)^3}, \hspace{5mm} \gamma_i = \frac{1}{v(\mu_i) \left(\frac{\partial \eta_i}{\partial \mu_i}\right)^2}, \hspace{5mm} \gamma_{0, i} = \gamma_i + \rho_i(\mathbf{y}_i - \mu_i) \label{eq:intermed-terms}\]where we recall that we let $v(\mu_i) = d(\phi, \omega_i) V(\mu_i) = \frac{\phi}{\omega_i}V(\mu_i)$. We’ll also define the following matrices:
\[\Delta = \begin{bmatrix} \delta_1 & \dots & 0 \\ \vdots & \ddots &\vdots \\ 0 & \dots & \delta_n \end{bmatrix}, \hspace{5mm} \Gamma = \begin{bmatrix} \gamma_1 & \dots & 0 \\ \vdots & \ddots &\vdots \\ 0 & \dots & \gamma_n \end{bmatrix}, \hspace{5mm} \Gamma_0 = \begin{bmatrix} \gamma_{0, 1} & \dots & 0 \\ \vdots & \ddots &\vdots \\ 0 & \dots & \gamma_{0, n} \end{bmatrix} \label{eq:intermed-mats}\]To save space, we’ll also define:
\[\mathbf{C} = \frac{\partial \ell_q(\mathbf{y}; \alpha, \theta \rvert \beta)}{\partial \eta}\frac{\partial \ell_q(\mathbf{y}; \alpha, \theta \rvert \beta)}{\partial \eta^\top} + \frac{\partial^2 \ell_q(\mathbf{y}; \alpha, \theta \rvert \beta)}{\partial \eta \partial \eta^\top}\]and we’ll let $\mathbf{C}^0 = \mathbf{C} \rvert_{\beta = \mathbf{0}}$ as before.
To streamline our analysis later, let’s do some initial derivations. We have:
\[\begin{aligned} \frac{\partial}{\partial \eta_i} \left[ \left(\frac{\partial \eta_i}{\partial \mu_i}\right)^{-1} \right] &= -\frac{\partial^2 \eta_i}{\partial \mu_i^2}\left(\frac{\partial \eta_i}{\partial \mu_i}\right)^{-2} \\ \frac{\partial}{\partial \mu_i} \left[ \ell_q(\mathbf{y}; \alpha, \theta \rvert \beta) \right] &= \frac{\mathbf{y}_i - \mu_i}{v(\mu_i)} \\ \frac{\partial}{\partial \eta_i}\left[ \ell_q(\mathbf{y}; \alpha, \theta \rvert \beta) \right] &= \left(\frac{\partial \eta_i}{\partial \mu_i}\right)^{-1} \frac{\mathbf{y}_i - \mu_i}{v(\mu_i)} = \gamma_i \delta^{-1}_i (\mathbf{y}_i - \mu_i) \\ -\frac{\partial^2}{\partial \eta_i^2}[ \ell_q(\mathbf{y}; \alpha, \theta\rvert \beta)] &= - \frac{\partial}{\partial \eta_i} \left[ \frac{\mathbf{y}_i - \mu_i}{v(\mu_i) \frac{\partial \eta_i}{\partial \mu_i}} \right] = \gamma_{0, i} \end{aligned} \label{eq:intermed-derivs}\]Proof.
$$ \begin{aligned} \frac{\partial}{\partial \eta_i} \left[ \left(\frac{\partial \eta_i}{\partial \mu_i}\right)^{-1}\right] &= - \left(\frac{\partial \eta_i}{\partial \mu_i}\right)^{-2} \frac{\partial \mu_i}{\partial \eta_i} \frac{\partial}{\partial \mu_i}\left[ \frac{\partial \eta_i}{\partial \mu_i} \right] = -\frac{\partial^2 \eta_i}{\partial \mu_i^2}\left(\frac{\partial \eta_i}{\partial \mu_i}\right)^{-2} \\ \frac{\partial}{\partial \mu_i} \left[ \ell_q(\mathbf{y}; \alpha, \theta \rvert \beta) \right] &= \frac{\partial}{\partial \mu_i} \left[ \int_{y_i}^{\mu_i} \frac{\omega_i(\mathbf{y}_i - u)}{\phi V(u)} du \right] = \frac{\omega_i(\mathbf{y}_i - \mu_i)}{\phi V(\mu_i)} = \frac{\mathbf{y}_i - \mu_i}{v(\mu_i)} \\ \frac{\partial}{\partial \eta_i}\left[ \ell_q(\mathbf{y}; \alpha, \theta \rvert \beta) \right] &= \frac{\partial \mu_i}{\partial \eta_i} \frac{\partial}{\partial \mu_i} \left[\ell_q(\mathbf{y}; \alpha, \theta \rvert \beta) \right] = \left(\frac{\partial \eta_i}{\partial \mu_i}\right)^{-1} \frac{\mathbf{y}_i - \mu_i}{v(\mu_i)} = \frac{\frac{\partial \eta_i}{\partial \mu_i}}{v(\mu_i) \left(\frac{\partial \eta_i}{\partial \mu_i}\right)^2} (\mathbf{y}_i - \mu_i) = \gamma_i \delta^{-1}_i (\mathbf{y}_i - \mu_i) \nonumber \end{aligned} $$ To show the last equality, we note that: $$ \begin{aligned} \frac{\partial^2 \mu_i}{\partial \eta_i^2} &= \frac{\partial \mu_i}{\partial \eta_i} \frac{\partial}{\partial \mu_i}\left[ \frac{\partial \mu_i}{\partial \eta_i}\right] = - \frac{\partial \mu_i}{\partial \eta_i} \left(\frac{\partial \eta_i}{\partial \mu_i}\right)^2 \frac{\partial^2 \eta_i}{\partial \mu_i^2} = - \left(\frac{\partial \eta_i}{\partial \mu_i}\right)^{-3} \frac{\partial^2 \eta_i}{\partial \mu_i^2} \hspace{5mm} \implies \hspace{5mm} \frac{\partial^2 \eta_i}{\partial \mu_i^2} = -\frac{\partial^2 \mu_i}{\partial \eta_i^2} \left(\frac{\partial \eta_i}{\partial \mu_i}\right)^{3} \end{aligned} \label{eq:temp} $$ Then: $$ \begin{aligned} \frac{\partial^2}{\partial \eta_i^2}[ \ell_q(\mathbf{y}; \alpha, \theta \rvert \beta)] &= \frac{\partial}{\partial \eta_i} \left[ \frac{\partial \mu_i}{\partial \eta_i} \right] \left(\frac{\mathbf{y}_i - \mu_i}{v(\mu_i)} \right)+ \left(\frac{\partial \mu_i}{\partial \eta_i} \right)\frac{\partial \mu_i}{\partial \eta_i} \frac{\partial}{\partial \mu_i} \left[ \frac{\mathbf{y}_i - \mu_i}{v(\mu_i)} \right] & \left(\text{chain rule}\right) \\ &= \frac{\partial^2 \mu_i}{\partial \eta_i^2} \left(\frac{\mathbf{y}_i - \mu_i}{v(\mu_i)} \right) + \left( \frac{\partial \mu_i}{\partial \eta_i}\right)^2 \left(\frac{-v(\mu_i) - \frac{\partial v(\mu_i)}{\partial \mu_i}(\mathbf{y}_i - \mu_i)}{v^2(\mu_i)}\right) \\ &= - \left(\frac{\partial \eta_i}{\partial \mu_i}\right)^{-3} \frac{\partial^2 \eta_i}{\partial \mu_i^2} \left( \frac{\mathbf{y}_i - \mu_i}{v(\mu_i)}\right) + \left( \frac{\partial \mu_i}{\partial \eta_i}\right)^2 \left(\frac{-v(\mu_i) - \frac{\partial v(\mu_i)}{\partial \mu_i}(\mathbf{y}_i - \mu_i)}{v^2(\mu_i)}\right) & \left(\text{Eq. } \eqref{eq:temp}\right) \\ &= - \frac{\frac{\partial^2 \eta_i}{\partial \mu_i^2} \left( \frac{\mathbf{y}_i - \mu_i}{v(\mu_i)}\right)}{\left(\frac{\partial \eta_i}{\partial \mu_i}\right)^{3}} + \frac{-v(\mu_i) - \frac{\partial v(\mu_i)}{\partial \mu_i}(\mathbf{y}_i - \mu_i)}{v^2(\mu_i) \left( \frac{\partial \eta_i}{\partial \mu_i}\right)^2 } & \left( \frac{\partial \mu_i}{\partial \eta_i} = \left(\frac{\partial \eta_i}{\partial \mu_i} \right)^{-1} \text{ by Eq. \eqref{eq:deriv-assumption}}\right) \\ &= - \frac{1}{v(\mu_i) \left(\frac{\partial \eta_i}{\partial \mu_i}\right)^2} - (\mathbf{y}_i - \mu_i)\left(\frac{\frac{\partial^2 \eta_i}{\partial \mu_i^2}}{v(\mu_i) \left(\frac{\partial \eta_i}{\partial \mu_i}\right)^3} + \frac{\frac{\partial v(\mu_i)}{\partial \mu_i}}{v^2(\mu_i)\left(\frac{\partial \eta_i}{\partial \mu_i}\right)^2}\right) \\ &= - \gamma_i - (\mathbf{y}_i - \mu_i)\left(\frac{v(\mu_i) \frac{\partial^2 \eta_i}{\partial \mu_i^2}}{v^2(\mu_i) \left(\frac{\partial \eta_i}{\partial \mu_i}\right)^3} + \frac{\frac{\partial v(\mu_i)}{\partial \mu_i} \frac{\partial \eta_i}{\partial \mu_i}}{v^2(\mu_i)\left(\frac{\partial \eta_i}{\partial \mu_i}\right)^3}\right) \\ &= - \gamma_i - \rho_i (\mathbf{y}_i - \mu_i) \\ &= - \gamma_{0, i} \end{aligned} \nonumber $$If we consider all components of $\eta$ at the same time, the last two equalities become matrices:
\[\frac{\partial}{\partial \eta}[\ell_q(\mathbf{y}; \alpha, \theta \rvert \beta)] = \Gamma \Delta^{-1}(\mathbf{y} - \mu), \hspace{10mm} \frac{\partial^2}{\partial \eta \partial \eta^\top}[\ell_q(\mathbf{y}; \alpha, \theta \rvert \beta)] = -\Gamma_0\]For our example, so we have: $$ \delta_i = \mu_i, \hspace{5mm} \rho_i = 0, \hspace{5mm} \gamma_i = \mu_i, \hspace{5mm} \gamma_{0,i} = \mu_i $$
Proof.
For our example, we use $\eta_i = \log(\mu_i)$ and $v(\mu_i) = \mu_i$. Thus: $$ \frac{\partial \eta_i}{\partial \mu_i} = \frac{1}{\mu_i}, \hspace{5mm} \frac{\partial^2 \eta_i}{\partial \mu_i^2} = -\frac{1}{\mu_i^2}, \hspace{5mm} \frac{\partial v(\mu_i)}{\partial \mu_i} = 1 \nonumber $$ Plugging these into Eq. \eqref{eq:intermed-terms} gives the desired result.Scores
Common choices for parameter estimation and hypothesis testing will require us to take the gradient of the likelihood with respect to the parameters of interest. That is, we need to find the score function (see my likelihood post).
Score For $\alpha$
The score for $\alpha$ has elements:
\[U_{\alpha_j} = \frac{\partial}{\partial \alpha_j} [\ell_q(\mathbf{y}; \alpha, \theta) ] \approx \left(\mathbf{X}_{\cdot, j}\right)^\top \Gamma^0 (\Delta^0)^{-1} (\mathbf{y} - \mu^0) + \frac{1}{2} \text{tr}\left[ \frac{\partial \mathbf{C}^0}{\partial \alpha_j} \mathbf{Z} D(\theta) \mathbf{Z}^\top \right] \label{eq:alpha-score}\]Proof.
$$ \begin{aligned} U_{\alpha_j} &= \frac{\partial}{\partial \alpha_j} \left[ \ell_q(\mathbf{y}; \alpha, \theta) \right] \\ &\approx \frac{\partial}{\partial \alpha_j} \left[ \left[\ell_q(\mathbf{y}; \alpha, \theta \rvert \beta)\right]_{\beta = \mathbf{0}} + \frac{1}{2} \text{tr} \left[ \mathbf{Z}^\top \left[\frac{\partial \ell_q(\mathbf{y}; \alpha, \theta \rvert \beta)}{\partial \eta} \frac{\partial \ell_q(\mathbf{y}; \alpha, \theta \rvert \beta)}{\partial \eta^\top} + \frac{\partial^2 \ell_q(\mathbf{y}; \alpha, \theta \rvert \beta)}{\partial \eta \partial \eta^\top}\right]_{\beta = \mathbf{0}} \mathbf{Z} D(\theta) \right] \right] & \left(\text{Eq. } \eqref{eq:log-quasi-lik} \right) \\ &= \frac{\partial}{\partial \alpha_j} \left[ \left[ \ell_q(\mathbf{y}; \alpha, \theta \rvert \beta) \right]\bigg\rvert_{\beta = \mathbf{0}} \right] + \frac{1}{2} \frac{\partial}{\partial \alpha_j} \left[\text{tr}\left[ \mathbf{Z}^\top \mathbf{C}^0 \mathbf{Z} D(\theta) \right] \right] \\ &= \frac{\partial}{\partial \alpha_j} \left[ \left[ \sum_{i = 1}^n \ell_q(\mathbf{y}_i; \alpha, \theta \rvert \beta) \right]\bigg\rvert_{\beta= \mathbf{0}}\right] + \frac{1}{2} \frac{\partial}{\partial \alpha_j} \text{tr} \left[ \mathbf{C}^0 \mathbf{Z} D(\theta) \mathbf{Z}^\top \right] & \left(\text{tr}\left[ \cdot \right] \text{ is linear} \right) \\ &= \frac{\partial}{\partial \alpha_j} \left[ \left[\sum_{i = 1}^n \ell_q(\mathbf{y}_i; \alpha, \theta \rvert \beta) \right]\bigg\rvert_{\beta= \mathbf{0}} \right] + \frac{1}{2} \text{tr}\left[ \frac{\partial \mathbf{C}^0}{\partial \alpha_j} \mathbf{Z} D(\theta) \mathbf{Z}^\top \right] & \left(\text{by } (a)\right) \\ &= \left(\mathbf{X}_{\cdot, j}\right)^\top \Gamma^0 (\Delta^0)^{-1} (\mathbf{y} - \mu^0) + \frac{1}{2} \text{tr}\left[ \frac{\partial \mathbf{C}^0}{\partial \alpha_j} \mathbf{Z} D(\theta) \mathbf{Z}^\top \right] & \left(\text{by } (b)\right) \end{aligned} \nonumber $$ $$ \begin{aligned} (a): \frac{\partial}{\partial \alpha_j} \left[ \text{tr} \left[ \mathbf{C}^0 \mathbf{Z} D(\theta) \mathbf{Z}^\top \right] \right] &= \sum_{i = 1}^n \sum_{i' = 1}^n \frac{\partial}{\partial \alpha_j} \left[ \mathbf{C}^0_{i, i'} (\mathbf{Z} D(\theta) \mathbf{Z}^\top)_{i, i'} \right] = \text{tr}\left[ \frac{\partial \mathbf{C}^0}{\partial \alpha_j} \mathbf{Z} D(\theta) \mathbf{Z}^\top \right] \\ (b):\frac{\partial}{\partial \alpha_j} \left[ \left[\ell_q(\mathbf{y}; \alpha, \theta \rvert \beta)\right] \bigg\rvert_{\beta= \mathbf{0}} \right] &= \sum_{i = 1}^n \frac{\partial}{\partial \alpha_j} \left[ \left[\ell_q(\mathbf{y}_i; \alpha, \theta \rvert \beta) \right] \bigg\rvert_{\beta= \mathbf{0}} \right]\\ &= \sum_{i = 1}^n \frac{\partial \eta_i}{\partial \alpha_j} \frac{\partial}{\partial \eta_i} \left[ \left[ \ell_q(\mathbf{y}_i; \alpha, \theta \rvert \beta) \right]\bigg\rvert_{\beta= \mathbf{0}} \right] \\ &= \sum_{i = 1}^n \mathbf{x}_{i,j} \gamma^0_i (\delta^0_i)^{-1}\left(\mathbf{y}_i - \mu^0_i\right) & \left( \eta_{i,t} = \mathbf{x}_i^\top \alpha + \mathbf{z}_i^\top \beta_t \right) \\ &= \left(\mathbf{X}_{\cdot, j}\right)^\top \Gamma^0 (\Delta^0)^{-1} (\mathbf{y} - \mu^0) \end{aligned} \nonumber $$We have the following identities: $$ \frac{\partial \mu^0}{\partial \alpha} = \mu^0, \hspace{5mm} \frac{\partial }{\partial \alpha} \left[ (\mu^0) (\mu^0)^\top \right] = 2(\mu^0) (\mu^0)^\top $$
Proof.
I've found that the way to avoid dimension issues is to just write out the matrix. $$ \begin{aligned} \frac{\partial \mu^0}{\partial \alpha} = \frac{\partial}{\partial \alpha} \left[ \exp\left(\mathbf{X} \alpha \right) \right] = \frac{\partial}{\partial \alpha} \left[ \begin{bmatrix} \exp\left( \mathbf{X}_1 \alpha \right) \\ \vdots \\ \exp\left( \mathbf{X}_n \alpha \right) \\ \end{bmatrix} \right] = \begin{bmatrix} \mathbf{X}_1 \exp\left( \mathbf{X}_1 \alpha \right) \\ \vdots \\ \mathbf{X}_n \exp\left( \mathbf{X}_n \alpha \right) \\ \end{bmatrix} = \text{diag}(\mathbf{X}) \exp(\mathbf{X} \alpha) = \text{diag}(\mathbf{X}) \mu^0 \end{aligned} \nonumber $$ Since $\mathbf{X} = \mathbf{1}_n$, we have $\frac{\partial \mu^0}{\partial \alpha} = \mu^0$.We do the same thing for $\mu \mu^\top$: $$ \begin{aligned} \frac{\partial }{\partial \alpha} \left[ \mu \mu^\top \right] &= \frac{\partial}{\partial \alpha} \left[ \begin{bmatrix} \exp\left(\mathbf{X}_{1} \alpha\right) \\ \vdots \\ \exp\left(\mathbf{X}_{n} \alpha\right) \end{bmatrix} \begin{bmatrix} \exp\left(\mathbf{X}_{1} \alpha\right) \\ \vdots \\ \exp\left(\mathbf{X}_{n} \alpha\right) \end{bmatrix}^\top \right] \\ &= \frac{\partial}{\partial \alpha} \left[ \begin{bmatrix} \left[\exp\left(\mathbf{X}_{1} \alpha\right) \right]^2 & \dots & \exp\left( \mathbf{X}_{1} \alpha\right) \exp\left( \mathbf{X}_{n} \alpha \right)\\ \vdots & \ddots & \vdots \\ \exp\left( \mathbf{X}_{n} \alpha\right) \exp\left( \mathbf{X}_{1} \alpha \right) & \dots & \left[\exp\left(\mathbf{X}_{n} \alpha\right)\right]^2 \\ \end{bmatrix} \right] \\ &= \begin{bmatrix} 2\mathbf{X}_1 \left[ \exp\left(\mathbf{X}_{1} \alpha\right)\right]^2 & \dots & \exp\left( \mathbf{X}_{1} \alpha\right) \exp\left( \mathbf{X}_{n} \alpha \right) (\mathbf{X}_1 + \mathbf{X}_n )\\ \vdots & \ddots & \vdots \\ \exp\left( \mathbf{X}_{n} \alpha\right) \exp\left( \mathbf{X}_{1} \alpha \right) (\mathbf{X}_n + \mathbf{X}_1 )& \dots & 2 \mathbf{X}_n \left[ \exp\left(\mathbf{X}_{n} \alpha\right)\right]^2 \\ \end{bmatrix} \\ &= \exp(\mathbf{X}\alpha) \left[ \exp(\mathbf{X} \alpha) \right]^\top \otimes \left( \mathbf{X} \mathbf{1}_n^\top + \mathbf{1}_n \mathbf{X}^\top \right) \\ &= (\mu^0) (\mu^0)^\top \otimes \left( \mathbf{X} \mathbf{1}_n^\top + \mathbf{1}_n \mathbf{X}^\top \right) \end{aligned} \nonumber $$ And letting $\mathbf{X} = \mathbf{1}_n$, we have $\frac{\partial }{\partial \alpha} \left[ \mu \mu^\top \right] = 2 (\mu^0)(\mu^0)^\top$.
Proof.
$$ \begin{aligned} \frac{\partial \mathbf{C}^0}{\partial \alpha} &= \frac{\partial}{\partial \alpha} \left[ (\mathbf{y} - \mu^0)(\mathbf{y} - \mu^0)^\top + \text{diag}(-\mu^0) \right] \\ &= \frac{\partial}{\partial \alpha} \left[ \mathbf{y}\mathbf{y}^\top - \mathbf{y} (\mu^0)^\top -\mu^0 \mathbf{y}^\top + \mu^0(\mu^0)^\top - \text{diag}(\mu^0) \right] \\ &= -\mathbf{y} \left(\mu^0 \right)^\top - \mu^0 \mathbf{y}^\top + (\mu^0) (\mu^0)^\top - \text{diag}(\mu^0) \end{aligned} \nonumber $$Score For $\theta$
The score for $\theta$ has elements:
\[U_{\theta_j} = \frac{\partial}{\partial \theta_j} [\ell_q(\mathbf{y}; \alpha, \theta) ] \approx \frac{1}{2} \text{tr} \left[ \mathbf{C}^0 \mathbf{Z} \frac{\partial D(\theta)}{\partial \theta_j} \mathbf{Z}^\top\right] = \frac{1}{2} \left( (\mathbf{y} - \mu^0)^\top (\Delta^0)^{-1} \Gamma^0 \mathbf{Z} \frac{\partial D(\theta)}{\partial \theta_j} \mathbf{Z}^\top \Gamma^0 (\Delta^0)^{-1}(\mathbf{y} - \mu^0) - \text{tr} \left[ \Gamma^0_0 \mathbf{Z} \frac{\partial D(\theta)}{\partial \theta_j} \mathbf{Z}^\top\right] \right)\]Proof.
$$ \begin{aligned} U_{\theta_j} &= \frac{\partial}{\partial \theta_j} [\ell_q(\mathbf{y}; \alpha, \theta) ] \\ &\approx \frac{\partial}{\partial \theta_j} \left[ \left[ \ell_q(\mathbf{y}; \alpha, \theta \rvert \beta)\right]_{\beta = \mathbf{0}} + \frac{1}{2} \text{tr} \left[ \mathbf{Z}^\top \mathbf{C}^0 \mathbf{Z} D(\theta) \right] \right] & \left(\text{Eq. } \eqref{eq:log-quasi-lik} \right) \\ &= \frac{1}{2} \frac{\partial}{\partial \theta_j} \left[ \text{tr} \left[ \mathbf{Z}^\top \mathbf{C}^0 \mathbf{Z} D(\theta) \right] \right] & \left(\text{no } \theta \text{ in first term} \right) \\ &= \frac{1}{2} \frac{\partial}{\partial \theta_j} \left[ \text{tr} \left[ \mathbf{C}^0 \mathbf{Z} D(\theta) \mathbf{Z}^\top \right] \right] \\ &= \frac{1}{2} \sum_{i = 1}^n \sum_{i' = 1}^n \frac{\partial}{\partial \theta_j} \left[ \mathbf{C}^0_{i, i'} \left(\mathbf{Z} D(\theta) \mathbf{Z}^\top \right)_{i, i'}\right] \\ &= \frac{1}{2} \sum_{i = 1}^n \sum_{i' = 1}^n \mathbf{C}^0_{i, i'} \frac{\partial}{\partial \theta_j} \left[ \left(\mathbf{Z} D(\theta) \mathbf{Z}^\top \right)_{i, i'}\right] & \left(\text{no } \theta \text{ in } \mathbf{C}^0 \right) \\ &= \frac{1}{2} \sum_{i = 1}^n \sum_{i' = 1}^n \mathbf{C}^0_{i, i'} \left(\mathbf{Z} \frac{\partial D(\theta)}{\partial \theta_j} \mathbf{Z}^\top \right)_{i, i'} \\ &= \frac{1}{2} \text{tr} \left[ \mathbf{C}^0 \mathbf{Z} \frac{\partial D(\theta)}{\partial \theta_j} \mathbf{Z}^\top\right] \end{aligned} \nonumber $$ We can rewrite the score in terms of the quantities we defined in Eqs. \eqref{eq:intermed-terms} and \eqref{eq:intermed-mats}: $$ \begin{aligned} U_{\theta_j} &= \frac{1}{2} \text{tr} \left[ \mathbf{C}^0 \mathbf{Z} \frac{\partial D(\theta)}{\partial \theta_j} \mathbf{Z}^\top\right] \\ &= \frac{1}{2} \text{tr} \left[ \left[ \frac{\partial \ell_q(\mathbf{y}; \alpha, \theta \rvert \beta)}{\partial \eta}\frac{\partial \ell_q(\mathbf{y}; \alpha, \theta \rvert \beta)}{\partial \eta^\top} + \frac{\partial^2 \ell_q(\mathbf{y}; \alpha, \theta \rvert \beta)}{\partial \eta \partial \eta^\top} \right]\bigg\rvert_{\beta = \mathbf{0}} \mathbf{Z} \frac{\partial D(\theta)}{\partial \theta_j} \mathbf{Z}^\top\right] \\ &= \frac{1}{2} \text{tr} \left[ \left(\Gamma^0 (\Delta^0)^{-1}(\mathbf{y} - \mu^0)(\mathbf{y} - \mu^0)^\top (\Delta^0)^{-1} \Gamma^0 - \Gamma^0_0\right)\mathbf{Z} \frac{\partial D(\theta)}{\partial \theta_j} \mathbf{Z}^\top\right] \\ &= \frac{1}{2} \left( \text{tr} \left[ \Gamma^0 (\Delta^0)^{-1}(\mathbf{y} - \mu^0)(\mathbf{y} - \mu^0)^\top (\Delta^0)^{-1} \Gamma^0 \mathbf{Z} \frac{\partial D(\theta)}{\partial \theta_j} \mathbf{Z}^\top \right] - \text{tr} \left[ \Gamma_0 \mathbf{Z} \frac{\partial D(\theta)}{\partial \theta_j} \mathbf{Z}^\top\right] \right)\\ &= \frac{1}{2} \left( \text{tr} \left[(\mathbf{y} - \mu^0)^\top (\Delta^0)^{-1} \Gamma^0 \mathbf{Z} \frac{\partial D(\theta)}{\partial \theta_j} \mathbf{Z}^\top \Gamma^0 (\Delta^0)^{-1}(\mathbf{y} - \mu^0)\right] - \text{tr} \left[ \Gamma^0_0 \mathbf{Z} \frac{\partial D(\theta)}{\partial \theta_j} \mathbf{Z}^\top\right] \right)\\ &= \frac{1}{2} \left( (\mathbf{y} - \mu^0)^\top (\Delta^0)^{-1} \Gamma^0 \mathbf{Z} \frac{\partial D(\theta)}{\partial \theta_j} \mathbf{Z}^\top \Gamma^0 (\Delta^0)^{-1}(\mathbf{y} - \mu^0) - \text{tr} \left[ \Gamma^0_0 \mathbf{Z} \frac{\partial D(\theta)}{\partial \theta_j} \mathbf{Z}^\top\right] \right) & \left(\text{first term is a scalar}\right)\\ \end{aligned} \nonumber $$ where the first term in the parentheses and $\Gamma_0$ are all evaluated at $\beta = \mathbf{0}$.We have the following: $$ \begin{aligned} \frac{\partial D(\theta)}{\partial \theta} = \frac{\partial}{\partial \theta} \left[ \theta \Sigma\right] = \Sigma \end{aligned} $$ The score is thus: $$ \begin{aligned} U_{\theta} &\approx \frac{1}{2}\left( (\mathbf{y}- \mu^0)^\top (\Delta^0)^{-1} \Gamma^0 \mathbf{Z} \frac{\partial D(\theta)}{\partial \theta} \mathbf{Z}^\top \Gamma^0 (\Delta^0)^{-1}(\mathbf{y} - \mu^0) + \text{tr}\left[ \Gamma^0_0 \mathbf{Z} \frac{\partial D(\theta)}{\partial \theta} \mathbf{Z}^\top\right]\right) \\ &= \frac{1}{2} \left( (\mathbf{y} - \mu^0)^\top \mathbf{Z} \mathbf{Z}^\top (\mathbf{y} - \mu^0)+ \text{tr}\left[ \text{diag}(\mu^0) \mathbf{Z} \mathbf{Z}^\top \right] \right) \end{aligned} \label{eq:theta-score-ex} $$
Score For $\phi$
Technically there is also the scale parameter, but we will just assume it is known for now 🤫.
Information
A Bit Of Discussion
We can think of $U_\theta$ and $U_\alpha$ as components of one big score:
\[U = \begin{bmatrix} U_\theta \\ U_\alpha \end{bmatrix} \label{eq:full-score}\]The variance of the score evaluated at the true value of the parameter is the Fisher information:
\[\mathcal{I} = \mathbb{E}\left[ (U - \mathbb{E}[U])(U - \mathbb{E}[U])^\top \right] = \mathbb{E}[UU^\top] \label{eq:full-info}\]where the second equality follows from the fact that the score has expectation zero when the null hypothesis is true (i.e. the null hypothesis value for the parameter is correct).
To perform a score test for just $\theta$, we need the variance of just $U_\theta$ (the Fisher information). We can partition $\mathcal{I}$ in the following way:
\[\mathcal{I} = \begin{bmatrix} \mathcal{I}_{\theta, \theta} & \mathcal{I}_{\theta, \alpha} \\ \mathcal{I}_{\alpha, \theta} & \mathcal{I}_{\alpha, \alpha} \end{bmatrix} \label{eq:partition-info}\]where:
\[\mathcal{I}_{\theta, \theta} = \mathbb{E}\left[ \frac{\partial \ell(\mathbf{y}; \alpha, \theta)}{\partial \theta} \frac{\partial \ell(\mathbf{y}; \alpha, \theta)}{\partial \theta^\top}\right], \hspace{5mm} \mathcal{I}_{\alpha, \theta} = \mathbb{E}\left[ \frac{\partial \ell(\mathbf{y}; \alpha, \theta)}{\partial \alpha} \frac{\partial \ell(\mathbf{y}; \alpha, \theta)}{\partial \theta^\top}\right], \hspace{5mm} \mathcal{I}_{\alpha, \alpha} = \mathbb{E}\left[ \frac{\partial \ell(\mathbf{y}; \alpha, \theta)}{\partial \alpha} \frac{\partial \ell(\mathbf{y}; \alpha, \theta)}{\partial \alpha^\top}\right] \label{eq:info-blocks}\]The parameter of interest for our test is $\theta$, so we only really care about the block of \(\mathcal{I}^{-1}\) corresponding to $\theta$. However, this block is not equal to the inverse of the corresponding block in $\mathcal{I}$ due to the presence and estimation of $\alpha$. That is:
\[\left(\mathcal{I}^{-1}\right)_{\theta, \theta} \neq \left(\mathcal{I}_{\theta, \theta} \right)^{-1} \label{eq:tricky-neq}\]To overcome this, can use the Schur complement for inverting a partitioned matrix. For general block matrix, $M$, we have:
\[M = \begin{bmatrix} A & B \\ C & D \end{bmatrix} \implies M^{-1} = \begin{bmatrix} (A - BD^{-1} C)^{-1} & -(A - B D^{-1} C)^{-1} BD^{-1} \\ -D^{-1} C (A - BD^{-1}C)^{-1} & D^{-1} + D^{-1}C(A - B D^{-1} C)^{-1} B D^{-1} \end{bmatrix} \nonumber\]Thus, the (relevant parts of the) Fisher information matrix can be computed as:
\[\mathcal{I}^{-1}_{\theta, \theta} = \left( \mathcal{I}_{\theta, \theta} - \mathcal{I}_{\alpha, \theta} \mathcal{I}^{-1}_{\alpha, \alpha} \mathcal{I}_{\theta, \alpha} \right)^{-1} \label{eq:relevant-fisher-info}\]A more detailed discussion of this partitioning idea can be found in my post on estimation theory.
In the remaining sections, we’ll slightly modify our notation. We’ll use a superscript of $0$ to denote a quantity (e.g. $\mu$, $\eta$, $\delta$, etc.) evaluated by setting $\beta = \mathbf{0}$ and $\alpha = \hat{\alpha}_{MLE}$, the maximum likelihood estimate of $\alpha$ under the null hypothesis.
Recall that we assumed that $D(\theta) = \mathbf{0}$ if $\theta = \mathbf{0}$. This simplifies the score for $\alpha$ a lot when we evaluate it under the null hypothesis:
\[\begin{aligned} U_{\alpha_j}(\mathbf{0}) &\approx \left(\mathbf{X}_{\cdot, j}\right)^\top \Gamma^0 (\Delta^0)^{-1} (\mathbf{y} - \mu^0) \\ U_{\theta_j}(\mathbf{0}) &\approx \frac{1}{2}\left((\mathbf{y} - \mu^0)^\top (\Delta^0)^{-1}\Gamma^0 \mathbf{Z} \frac{\partial D(\theta)}{\partial \theta_j} \mathbf{Z}^\top \Gamma^0 (\Delta^0)^{-1}(\mathbf{y} - \mu^0) - \text{tr}\left[ \Gamma^0_0 \mathbf{Z} \frac{\partial D(\theta)}{\partial \theta_j} \mathbf{Z}^\top\right]\right) \end{aligned}\]Information For $(\alpha, \alpha)$
\[\begin{aligned} \mathcal{I}_{\alpha_i, \alpha_j} &\approx \mathbb{E}\left[ U_{\alpha_i}(\mathbf{0}) U_{\alpha_j}(\mathbf{0})\right] = (\mathbf{X}_{\cdot, i})^\top \Gamma^0 \mathbf{X}_{\cdot, j} \end{aligned} \label{eq:info-alpha-alpha}\]Proof.
$$ \begin{aligned} \mathcal{I}_{\alpha_i, \alpha_j} &= \mathbb{E}\left[ U_{\alpha_i}(0) U_{\alpha_j}(0)\right] \\ &\approx \mathbb{E}\left[ \left(\mathbf{X}_{\cdot, i}\right)^\top \Gamma^0 (\Delta^0)^{-1} (\mathbf{y} - \mu^0) \left(\mathbf{X}_{\cdot, j}\right)^\top \Gamma^0 (\Delta^0)^{-1} (\mathbf{y} - \mu^0) \right] \\ &= \mathbb{E}\left[ \left(\mathbf{X}_{\cdot, i}\right)^\top (\mathbf{y} - \mu^0) \left(\mathbf{X}_{\cdot, j}\right)^\top (\mathbf{y} - \mu^0) \right] \\ &= \mathbb{E}\left[ \left(\sum_{h = 1}^{n} \frac{\gamma^0_h}{\delta^0_h} \mathbf{X}_{h, i}(\mathbf{y}_h - \mu^0_h) \right) \left(\sum_{h = 1}^{n} \frac{\gamma^0_h}{\delta^0_h} \mathbf{X}_{h, j}(\mathbf{y}_h - \mu^0_h) \right) \right] \\ &= \sum_{h = 1}^{n} \left( \mathbb{E}\left[ \frac{(\gamma^0_h)^2}{(\delta^0_h)^2}\mathbf{X}_{h, i}\mathbf{X}_{h, j} (\mathbf{y}_h - \mu^0_h)^2 \right] + \sum_{l \neq h} \mathbb{E}\left[ \frac{\gamma^0_h \gamma^0_l}{\delta^0_h \delta^0_l} \mathbf{X}_{h, i}\mathbf{X}_{l, j}(\mathbf{y}_h - \mu^0_h)(\mathbf{y}_l - \mu^0_l) \right] \right) \\ &= \sum_{h = 1}^{n} \left( \frac{(\gamma^0_h)^2}{(\delta^0_h)^2}\mathbf{X}_{h, i}\mathbf{X}_{h, j} \mathbb{E}\left[ (\mathbf{y}_h - \mu^0_h)^2 \right] + \sum_{l \neq h} \frac{\gamma^0_h \gamma^0_l}{\delta^0_h \delta^0_l} \mathbf{X}_{h, i}\mathbf{X}_{l, j} \mathbb{E}\left[ \mathbf{y}_h - \mu^0_h \right]\mathbb{E}\left[ \mathbf{y}_l - \mu^0_l \right] \right) & \left(\mathbf{y}_i\text{'s independent}\right) \\ &= \sum_{h = 1}^{n} \frac{(\gamma^0_h)^2}{(\delta^0_h)^2} \mathbf{X}_{h, i}\mathbf{X}_{h, j} \text{Var}\left( \mathbf{y}_h \right) & \left(\mathbb{E}\left[ \mathbf{y}_h - \mu^0_h \right] = 0\right) \\ &= \sum_{h = 1}^{n} \frac{\left(\frac{\partial \eta^0_h}{\partial \mu_h}\right)^2}{v^2(\mu^0_h) \left(\frac{\partial \eta^0_h}{\partial \mu_h}\right)^4}\mathbf{X}_{h, i}\mathbf{X}_{h, j} \text{Var}\left( \mathbf{y}_h \right) & \left(\text{Eq. } \eqref{eq:intermed-terms}\right)\\ &= \sum_{h = 1}^{n} \frac{1}{v^2(\mu^0_h) \left(\frac{\partial \eta^0_h}{\partial \mu_h}\right)^2}\mathbf{X}_{h, i}\mathbf{X}_{h, j} v(\mu^0_h) & \left(\text{Eq. } \eqref{eq:quasi-lik} \right)\\ &= \sum_{h = 1}^n \gamma^0_h \mathbf{X}_{h, i}\mathbf{X}_{h, j} \\ &= (\mathbf{X}_{\cdot, i})^\top \Gamma^0 \mathbf{X}_{\cdot, j} \end{aligned} \nonumber $$We just need to plug-in for $\Gamma^0$ and use the fact that $\mathbf{X} = \mathbf{1}_n$: $$ \mathcal{I}_{\alpha, \alpha} \approx \mathbf{X}^\top \text{diag}(\mu^0) \mathbf{X} = \sum_{i = 1}^n \mu_i^0 $$
Information For $(\alpha, \theta)$
\[\begin{aligned} \mathcal{I}_{\alpha_j, \theta_l} &= \frac{1}{2} \sum_{i' = 1}^n \mathbf{X}_{i', j} \mathbb{E}\left[ (\mathbf{y}_{i'} - \mu_{i'}^0)^3 \right] \left( \gamma_{i'}^0 (\delta_{i'}^0)^{-1} \right)^3 \left[ \mathbf{Z}_{i', \cdot} \frac{\partial D(\theta)}{\partial \theta_l} (\mathbf{Z}_{i', \cdot})^\top \right] \\ &= \frac{1}{2} \left[ \sum_{i' = 1}^n \mathbf{X}_{i', j} \kappa_{i'}^3 \left( \gamma_{i'}^0 (\delta_{i'}^0)^{-1} \right)^3 \left[ \mathbf{Z}_{i', \cdot} \frac{\partial D(\theta)}{\partial \theta_l} (\mathbf{Z}_{i', \cdot})^\top \right] + \sum_{i = 1}^n \mathbf{X}_{i,j} \gamma_i^0 (\delta_i^0)^{-1} \rho^0_i \kappa_{i'}^2 \left[ \mathbf{Z}_{i, \cdot} \frac{\partial D(\theta)}{\partial \theta_l} (\mathbf{Z}_{i, \cdot})^\top \right] \right] \\ &= \frac{1}{2} \sum_{i = 1}^n \mathbf{X}_{i, j} \left[ \mathbf{Z}_{i, \cdot} \frac{\partial D(\theta)}{\partial \theta_l} (\mathbf{Z}_{i, \cdot})^\top \right] \left(\kappa_{i}^3 \left( \gamma_{i}^0 (\delta_{i}^0)^{-1} \right)^3 - \gamma_i^0 (\delta_i^0)^{-1} \rho^0_i \kappa_{i}^2 \right) \end{aligned}\]where \(\kappa^2_{i} = \mathbb{E}\left[(\mathbf{y}_{i} - \mu_{i}^0)^2 \right]\) and \(\kappa^3_{i} = \mathbb{E}\left[(\mathbf{y}_{i} - \mu_{i}^0)^3 \right]\) are the second and third cumulants of $\mathbf{y}_{i}$ under the null (see this page on the equivalence between central moments and cumulants).
Proof.
$$ \begin{aligned} \mathcal{I}_{\alpha_j, \theta_l} &= \mathbb{E}\left[ U_{\alpha_j}(0) U_{\theta_l}(0)\right] \\ &\approx \mathbb{E}\left[ \left( \left(\mathbf{X}_{\cdot, j}\right)^\top \Gamma^0 (\Delta^0)^{-1} (\mathbf{y} - \mu^0) \right) \left( \frac{1}{2}\left( (\mathbf{y} - \mu^0)^\top (\Delta^0)^{-1}\Gamma^0 \mathbf{Z} \frac{\partial D(\theta)}{\partial \theta_l} \mathbf{Z}^\top \Gamma^0 (\Delta^0)^{-1}(\mathbf{y} - \mu^0) - \text{tr}\left[ \Gamma^0_0 \mathbf{Z} \frac{\partial D(\theta)}{\partial \theta_l} \mathbf{Z}^\top\right]\right) \right) \right] \\ &= \frac{1}{2} \underbrace{\mathbb{E}\left[ \left( \left(\mathbf{X}_{\cdot, j}\right)^\top \Gamma^0 (\Delta^0)^{-1} (\mathbf{y} - \mu^0) \right) \left( (\mathbf{y} - \mu^0)^\top (\Delta^0)^{-1}\Gamma^0 \mathbf{Z} \frac{\partial D(\theta)}{\partial \theta_l} \mathbf{Z}^\top \Gamma^0 (\Delta^0)^{-1}(\mathbf{y} - \mu^0) \right) \right]}_{(\star)} - \underbrace{\mathbb{E}\left[ \left( \left(\mathbf{X}_{\cdot, j}\right)^\top \Gamma^0 (\Delta^0)^{-1} (\mathbf{y} - \mu^0)\right) \text{tr}\left[ \Gamma_0^0 \mathbf{Z} \frac{\partial D(\theta)}{\partial \theta_l} \mathbf{Z}^\top \right] \right]}_{(\star \star)} \\ &= \frac{1}{2} \left[ \sum_{i' = 1}^n \mathbf{X}_{i', j} \mathbb{E}\left[ (\mathbf{y}_{i'} - \mu_{i'}^0)^3 \right] \left( \gamma_{i'}^0 (\delta_{i'}^0)^{-1} \right)^3 \left[ \mathbf{Z}_{i', \cdot} \frac{\partial D(\theta)}{\partial \theta_l} (\mathbf{Z}_{i', \cdot})^\top \right] - \sum_{i = 1}^n \mathbf{X}_{i,j} \gamma_i^0 (\delta_i^0)^{-1} \rho^0_i \mathbb{E}\left[ (\mathbf{y}_i - \mu_i^0)^2 \right] \left[ \mathbf{Z}_{i, \cdot} \frac{\partial D(\theta)}{\partial \theta_l} (\mathbf{Z}_{i, \cdot})^\top \right] \right] \end{aligned} \nonumber $$Proof of $(\star)$.
We can rewrite the first term in $(\star)$ as the summation: $$ \left(\mathbf{X}_{\cdot, j}\right)^\top \Gamma^0 (\Delta^0)^{-1} (\mathbf{y} - \mu^0) = \sum_{i = 1}^n \mathbf{X}_{i,j} \gamma_i^0 (\delta_i^0)^{-1} (\mathbf{y}_i - \mu_i^0) \nonumber $$ We can also write the second term in $(\star)$ as a summation: $$ (\mathbf{y} - \mu^0)^\top (\Delta^0)^{-1}\Gamma^0 \mathbf{Z} \frac{\partial D(\theta)}{\partial \theta_l} \mathbf{Z}^\top \Gamma^0 (\Delta^0)^{-1}(\mathbf{y} - \mu^0) = \sum_{i' = 1}^n \left( \sum_{i = 1}^n \left[(\mathbf{y}_{i} - \mu_i^0)\gamma_i^0 (\delta_i^0)^{-1}\right]\left[ \mathbf{Z}_{i, \cdot} \frac{\partial D(\theta)}{\partial \theta_l} (\mathbf{Z}_{i', \cdot})^\top \right] \right) (\mathbf{y}_{i'} - \mu_{i'}^0)\gamma_i^0 (\delta_i^0)^{-1} \nonumber $$Proof.
The large vector/matrix product can be broken down into the following pieces: $$ \begin{aligned} (\mathbf{y}- \mu^0)^\top (\Delta^0)^{-1} \Gamma^0 &= \begin{bmatrix} (\mathbf{y}_1 - \mu_1^0)(\gamma_1^0 / \delta_1^0) \\ \vdots \\ (\mathbf{y}_n - \mu_n^0)(\gamma_n^0 / \delta_n^0) \end{bmatrix}^\top \\ \Gamma^0 (\Delta^0)^{-1}(\mathbf{y}- \mu^0) &= \begin{bmatrix} (\mathbf{y}_1 - \mu_1^0)(\gamma_1^0 / \delta_1^0) \\ \vdots \\ (\mathbf{y}_n - \mu_n^0)(\gamma_n^0 / \delta_n^0) \end{bmatrix} \\ \mathbf{Z} \frac{\partial D(\theta)}{\partial \theta_l} \mathbf{Z}^\top &= \begin{bmatrix} \mathbf{Z}_{1, \cdot} \frac{\partial D(\theta)}{\partial \theta_l} \left( \mathbf{Z}_{1, \cdot} \right)^\top & \dots & \mathbf{Z}_{1, \cdot} \frac{\partial D(\theta)}{\partial \theta_l} \left( \mathbf{Z}_{n, \cdot} \right)^\top \\ \vdots & \ddots & \vdots \\ \mathbf{Z}_{n, \cdot} \frac{\partial D(\theta)}{\partial \theta_l} \left( \mathbf{Z}_{1, \cdot} \right)^\top & \dots & \mathbf{Z}_{n, \cdot} \frac{\partial D(\theta)}{\partial \theta_l} \left( \mathbf{Z}_{n, \cdot} \right)^\top \end{bmatrix} \end{aligned} \nonumber $$Proof.
$$ \begin{aligned} (\mathbf{y}- \mu^0)^\top (\Delta^0)^{-1} \Gamma^0 &= \begin{bmatrix} \mathbf{y}_1 - \mu_1^0 \\ \vdots \\ \mathbf{y}_n - \mu_n^0 \end{bmatrix}^\top \begin{bmatrix} 1/ \delta^0_1 & \dots & 0 \\ \vdots & \ddots & \vdots \\ 0 & \vdots & 1/ \delta^0_n \end{bmatrix} \begin{bmatrix} \gamma_1^0 & \dots & 0 \\ \vdots & \ddots & \vdots \\ 0 & \vdots & \gamma_n^0 \end{bmatrix} \\ &= \begin{bmatrix} \mathbf{y}_1 - \mu_1^0 \\ \vdots \\ \mathbf{y}_n - \mu_n^0 \end{bmatrix}^\top \begin{bmatrix} \gamma_1^0 / \delta^0_1 & \dots & 0 \\ \vdots & \ddots & \vdots \\ 0 & \vdots & \gamma_n^0 / \delta^0_n \end{bmatrix} \\ &= \begin{bmatrix} (\mathbf{y}_1 - \mu_1^0)(\gamma_1^0 / \delta_1^0) \\ \vdots \\ (\mathbf{y}_n - \mu_n^0)(\gamma_n^0 / \delta_n^0) \end{bmatrix}^\top \end{aligned} \nonumber $$ $$ \begin{aligned} \Gamma^0 (\Delta^0)^{-1} (\mathbf{y}- \mu^0) &= \begin{bmatrix} \gamma_1^0 & \dots & 0 \\ \vdots & \ddots & \vdots \\ 0 & \vdots & \gamma_n^0 \end{bmatrix} \begin{bmatrix} 1/ \delta^0_1 & \dots & 0 \\ \vdots & \ddots & \vdots \\ 0 & \vdots & 1/ \delta^0_n \end{bmatrix} \begin{bmatrix} \mathbf{y}_1 - \mu_1^0 \\ \vdots \\ \mathbf{y}_n - \mu_n^0 \end{bmatrix} \\ &= \begin{bmatrix} \gamma_1^0 / \delta^0_1 & \dots & 0 \\ \vdots & \ddots & \vdots \\ 0 & \vdots & \gamma_n^0 / \delta^0_n \end{bmatrix} \begin{bmatrix} \mathbf{y}_1 - \mu_1^0 \\ \vdots \\ \mathbf{y}_n - \mu_n^0 \end{bmatrix} \\ &= \begin{bmatrix} (\mathbf{y}_1 - \mu_1^0)(\gamma_1^0 / \delta_1^0) \\ \vdots \\ (\mathbf{y}_n - \mu_n^0)(\gamma_n^0 / \delta_n^0) \end{bmatrix} \end{aligned} \nonumber $$ $$ \begin{aligned} \mathbf{Z} \frac{\partial D(\theta)}{\partial \theta_l} \mathbf{Z}^\top &= \begin{bmatrix} \mathbf{Z}_{1, 1} & \dots & \mathbf{Z}_{1, (q \times k)} \\ \vdots & \ddots & \vdots \\ \mathbf{Z}_{n, 1} & \dots & \mathbf{Z}_{n, (q \times k)} \end{bmatrix} \begin{bmatrix} \frac{\partial D(\theta)_{1,1}}{\partial \theta_l} & \dots & \frac{\partial D(\theta)_{1,(q \times k)}}{\partial \theta_l} \\ \vdots & \ddots & \vdots \\ \frac{\partial D(\theta)_{(q \times k),1}}{\partial \theta_l} & \dots & \frac{\partial D(\theta)_{(q \times k), (q \times k)}}{\partial \theta_l} \end{bmatrix} \begin{bmatrix} \mathbf{Z}_{1, 1} & \dots & \mathbf{Z}_{1, n} \\ \vdots & \ddots & \vdots \\ \mathbf{Z}_{(q \times k), 1} & \dots & \mathbf{Z}_{(q \times k), n} \end{bmatrix} \\ &= \begin{bmatrix} \sum_{i = 1}^{q \times k} \mathbf{Z}_{1, i} \frac{\partial D(\theta)_{i, 1}}{\partial \theta_l} & \dots & \sum_{i = 1}^{q \times k} \mathbf{Z}_{1, i} \frac{\partial D(\theta)_{i, (q \times k)}}{\partial \theta_l} \\ \vdots & \ddots & \vdots \\ \sum_{i = 1}^{q \times k} \mathbf{Z}_{n, i} \frac{\partial D(\theta)_{i, 1}}{\partial \theta_l} & \dots & \sum_{i = 1}^{q \times k} \mathbf{Z}_{n, i} \frac{\partial D(\theta)_{i, (q \times k)}}{\partial \theta_l} \\ \end{bmatrix} \begin{bmatrix} \mathbf{Z}_{1, 1} & \dots & \mathbf{Z}_{1, n} \\ \vdots & \ddots & \vdots \\ \mathbf{Z}_{(q \times k), 1} & \dots & \mathbf{Z}_{(q \times k), n} \end{bmatrix} \\ &= \begin{bmatrix} \sum_{i' = 1}^{q \times k} \left( \sum_{i = 1}^{q \times k} \mathbf{Z}_{1, i} \frac{\partial D(\theta)_{i, i'}}{\partial \theta_l}\right) \mathbf{Z}_{i', 1} & \dots & \sum_{i' = 1}^{q \times k} \left( \sum_{i = 1}^{q \times k} \mathbf{Z}_{1, i} \frac{\partial D(\theta)_{i, i'}}{\partial \theta_l}\right) \mathbf{Z}_{i', n} \\ \vdots & \ddots & \vdots \\ \sum_{i' = 1}^{q \times k} \left( \sum_{i = 1}^{q \times k} \mathbf{Z}_{n, i} \frac{\partial D(\theta)_{i, i'}}{\partial \theta_l}\right) \mathbf{Z}_{i', 1} & \dots & \sum_{i' = 1}^{q \times k} \left( \sum_{i = 1}^{q \times k} \mathbf{Z}_{n, i} \frac{\partial D(\theta)_{i, i'}}{\partial \theta_l}\right) \mathbf{Z}_{i', n} \end{bmatrix}\\ &= \begin{bmatrix} \sum_{i' = 1}^{q \times k} \mathbf{Z}_{1, \cdot} \frac{\partial D(\theta)_{\cdot, i'}}{\partial \theta_l} \mathbf{Z}_{i', 1} & \dots & \sum_{i' = 1}^{q \times k} \mathbf{Z}_{1, \cdot} \frac{\partial D(\theta)_{\cdot, i'}}{\partial \theta_l} \mathbf{Z}_{i', n} \\ \vdots & \ddots & \vdots \\ \sum_{i' = 1}^{q \times k} \mathbf{Z}_{n, \cdot} \frac{\partial D(\theta)_{\cdot, i'}}{\partial \theta_l} \mathbf{Z}_{i', 1} & \dots & \sum_{i' = 1}^{q \times k} \mathbf{Z}_{n, \cdot} \frac{\partial D(\theta)_{\cdot, i'}}{\partial \theta_l} \mathbf{Z}_{i', n} \end{bmatrix} \\ &= \begin{bmatrix} \mathbf{Z}_{1, \cdot} \frac{\partial D(\theta)}{\partial \theta_l} \left( \mathbf{Z}_{1, \cdot} \right)^\top & \dots & \mathbf{Z}_{1, \cdot} \frac{\partial D(\theta)}{\partial \theta_l} \left( \mathbf{Z}_{n, \cdot} \right)^\top \\ \vdots & \ddots & \vdots \\ \mathbf{Z}_{n, \cdot} \frac{\partial D(\theta)}{\partial \theta_l} \left( \mathbf{Z}_{1, \cdot} \right)^\top & \dots & \mathbf{Z}_{n, \cdot} \frac{\partial D(\theta)}{\partial \theta_l} \left( \mathbf{Z}_{n, \cdot} \right)^\top \end{bmatrix} \end{aligned} \nonumber $$Proof of $(\star \star)$.
We can rewrite the trace term as a double summation: $$ \begin{aligned} \text{tr}\left[ \Gamma_0^0 \mathbf{Z} \frac{\partial D(\theta)}{\partial \theta_l} \mathbf{Z}^\top \right] &= \sum_{i = 1}^n \sum_{j = 1}^n \left(\Gamma_0^0 \right)_{i,j} \left[ \mathbf{Z} \frac{\partial D(\theta)}{\partial \theta_l} \mathbf{Z}^\top \right]_{j, i} \\ &= \sum_{i = 1}^n \gamma_{0, i}^0 \mathbf{Z}_{i, \cdot} \frac{\partial D(\theta)}{\partial \theta_l} (\mathbf{Z}_{i, \cdot})^\top \end{aligned} \nonumber $$ Then $(\star \star)$ becomes: $$ \begin{aligned} \mathbb{E}\left[ \left( \left(\mathbf{X}_{\cdot, j}\right)^\top \Gamma^0 (\Delta^0)^{-1} (\mathbf{y} - \mu^0)\right) \text{tr}\left[ \Gamma_0^0 \mathbf{Z} \frac{\partial D(\theta)}{\partial \theta_l} \mathbf{Z}^\top \right] \right] &= \mathbb{E}\left[ \left(\sum_{i = 1}^n \mathbf{X}_{i, j} \gamma_i^0 (\delta_i^0)^{-1} (\mathbf{y}_i - \mu_i^0) \right) \left( \sum_{i = 1}^n \gamma_{0, i}^0 \mathbf{Z}_{i, \cdot} \frac{\partial D(\theta)}{\partial \theta_l} (\mathbf{Z}_{i, \cdot})^\top\right) \right] \\ &= \sum_{i = 1}^n \mathbf{X}_{i,j} \mathbb{E}\left[ \xi_i \gamma_{0, i}^0 \right] H_{i, i, l} + \sum_{i = 1}^n \sum_{i' \neq i} \mathbf{X}_{i,j} \underbrace{\mathbb{E}[\xi_i]}_{=0} \mathbb{E}[\gamma_{0, i'}^0] H_{i', i', l} \\ &= \sum_{i = 1}^n \mathbf{X}_{i,j} \gamma_i^0 (\delta_i^0)^{-1} \mathbb{E}\left[ (\mathbf{y}_i - \mu_i^0) (\gamma_i^0 + \rho^0_i (\mathbf{y}_i - \mu_i^0)) \right] H_{i, i, l} \\ &= \sum_{i = 1}^n \mathbf{X}_{i,j} \gamma_i^0 (\delta_i^0)^{-1} \left (\gamma_i^0 \underbrace{\mathbb{E}\left[\mathbf{y}_i - \mu_i^0\right]}_{=0} + \rho^0_i \mathbb{E}\left[ (\mathbf{y}_i - \mu_i^0)^2 \right]\right) H_{i, i, l} \\ &= \sum_{i = 1}^n \mathbf{X}_{i,j} \gamma_i^0 (\delta_i^0)^{-1} \rho^0_i \mathbb{E}\left[ (\mathbf{y}_i - \mu_i^0)^2 \right] H_{i, i, l} \end{aligned} \nonumber $$All of the central moments for the Poisson distribution are equal to the mean, and $\frac{\partial D(\theta)}{\partial \theta} = \mathbb{I}_{k \times k}$, so we can just omit that part. Since $\gamma_i^0 = \delta_i^0$, a lot also simplifies. $$ \mathcal{I}_{\alpha, \theta} \approx \frac{1}{2} \sum_{i = 1}^n \mathbf{X}_{i} \mu_i^0 \mathbf{Z}_{i, \cdot} (\mathbf{Z}_{i, \cdot})^\top = \frac{1}{2} \sum_{i = 1}^n \mu_i^0 $$
Information For $(\theta, \theta)$
First, let’s rewrite the score for $\theta$:
\[\begin{aligned} U_{\theta_j} &= \frac{1}{2} \sum_{i = 1}^n \left( \left[ (\gamma^0_i)^2 (\delta_i^0)^{-2} (\mathbf{y}_i - \mu^0_i)^2 - \gamma_{0, i}^0\right]\left( \mathbf{Z} \frac{\partial D(\theta)}{\partial \theta_l} \mathbf{Z}^\top \right)_{i,i} + 2 \sum_{i' < i} (\mathbf{y}_i - \mu_i^0) (\mathbf{y}_{i'} - \mu_{i'}^0) (\gamma_i^0)^2 (\gamma_{i'}^0)^2 (\delta_i^0)^{-1} (\delta_{i'}^0)^{-1} \left( \mathbf{Z} \frac{\partial D(\theta)}{\partial \theta_l} \mathbf{Z}^\top \right)_{i, i'} \right) \end{aligned} \label{eq:rewrite-u-theta-j}\]Rewriting $U_{\theta_j}$.
$$ \begin{aligned} U_{\theta_l} &= \frac{1}{2}\left( (\mathbf{y} - \mu^0)^\top (\Delta^0)^{-1} \Gamma^0 \mathbf{Z} \frac{\partial D(\theta)}{\partial \theta_l} \mathbf{Z}^\top \Gamma^0 (\Delta^0)^{-1} (\mathbf{y} - \mu^0) - \text{tr}\left[ \Gamma_0^0 \mathbf{Z} \frac{\partial D(\theta)}{\partial \theta_l} \mathbf{Z}^\top \right] \right) \\ &= \frac{1}{2} \left( \text{tr}\left[ (\mathbf{y} - \mu^0)^\top (\Delta^0)^{-1} \Gamma^0 \mathbf{Z} \frac{\partial D(\theta)}{\partial \theta_l} \mathbf{Z}^\top \Gamma^0 (\Delta^0)^{-1} (\mathbf{y} - \mu^0) \right] - \text{tr} \left[ \Gamma_0^0 \mathbf{Z} \frac{\partial D(\theta)}{\partial \theta_l} \mathbf{Z}^\top\right] \right) &\left(\text{first term is a scalar}\right) \\ &= \frac{1}{2} \left( \text{tr}\left[ \Gamma^0 (\Delta^0)^{-1} (\mathbf{y} - \mu^0) (\mathbf{y} - \mu^0)^\top (\Delta^0)^{-1} \Gamma^0 \mathbf{Z} \frac{\partial D(\theta)}{\partial \theta_l} \mathbf{Z}^\top\right] - \text{tr} \left[ \Gamma_0^0 \mathbf{Z} \frac{\partial D(\theta)}{\partial \theta_l} \mathbf{Z}^\top\right] \right) &\left(\text{cyclic property}\right) \\ &= \frac{1}{2} \text{tr}\left[ \left( \Gamma^0 (\Delta^0)^{-1} (\mathbf{y} - \mu^0) (\mathbf{y} - \mu^0)^\top (\Delta^0)^{-1} \Gamma^0 - \Gamma_0^0 \right) \mathbf{Z} \frac{\partial D(\theta)}{\partial \theta_l} \mathbf{Z}^\top\right] \\ &= \frac{1}{2} \sum_{i = 1}^n \sum_{j = 1}^n \left( \Gamma^0 (\Delta^0)^{-1} (\mathbf{y} - \mu^0) (\mathbf{y} - \mu^0)^\top (\Delta^0)^{-1} \Gamma^0 - \Gamma_0^0 \right)_{i,j} \left( \mathbf{Z} \frac{\partial D(\theta)}{\partial \theta_l} \mathbf{Z}^\top \right)_{j,i} \\ &= \frac{1}{2} \sum_{i = 1}^n \left( \left[ (\gamma^0_i)^2 (\delta_i^0)^{-2} (\mathbf{y}_i - \mu^0_i)^2 - \gamma_{0, i}^0\right]\left( \mathbf{Z} \frac{\partial D(\theta)}{\partial \theta_l} \mathbf{Z}^\top \right)_{i,i} + 2 \sum_{i' < i} (\mathbf{y}_i - \mu_i^0) (\mathbf{y}_{i'} - \mu_{i'}^0) (\gamma_i^0)^2 (\gamma_{i'}^0)^2 (\delta_i^0)^{-1} (\delta_{i'}^0)^{-1} \left( \mathbf{Z} \frac{\partial D(\theta)}{\partial \theta_l} \mathbf{Z}^\top \right)_{i, i'} \right) \end{aligned} $$Proof.
$$ \begin{aligned} \mathcal{I}_{\theta_l, \theta_{l'}} &= \mathbb{E}\left[U_{\theta_l} U_{\theta_{l'}}\right] \\ &= \frac{1}{4}\mathbb{E}\left[ \left( \sum_{i = 1}^n \left[(\mathbf{y}_i - \mu^0_i)^2 (\delta_i^0)^{-2} (\gamma_{i}^0)^2 - \gamma_{0,i}^0 \right]\left( \mathbf{Z} \frac{\partial D(\theta)}{\partial \theta_{l}} \mathbf{Z}^\top \right)_{i,i} + 2\sum_{i' < i} (\mathbf{y}_i - \mu^0_i) (\mathbf{y}_{i'} - \mu^0_{i'}) \delta^{-1}_{i}\delta^{-1}_{i'} \gamma_{i}^0 \gamma^0_{i'} \left( \mathbf{Z} \frac{\partial D(\theta)}{\partial \theta_{l}} \mathbf{Z}^\top \right)_{i,i'}\right) \right. \\ &\hspace{15mm}\left. \times \left( \sum_{i = 1}^n \left[(\mathbf{y}_i - \mu^0_i)^2 (\delta_i^0)^{-2} (\gamma_{i}^0)^2 - \gamma_{0,i}^0 \right]\left( \mathbf{Z} \frac{\partial D(\theta)}{\partial \theta_{l'}} \mathbf{Z}^\top \right)_{i,i} + 2\sum_{i' < i} (\mathbf{y}_i - \mu^0_i) (\mathbf{y}_{i'} - \mu^0_{i'}) \delta^{-1}_{i}\delta^{-1}_{i'} \gamma_{i}^0 \gamma^0_{i'} \left( \mathbf{Z} \frac{\partial D(\theta)}{\partial \theta_{l'}} \mathbf{Z}^\top \right)_{i,i'}\right) \right] \\ &= \frac{1}{4} \mathbb{E}\left[ \left(\sum_{i = 1}^n \left[(\mathbf{y}_i - \mu^0_i)^2 (\delta_i^0)^{-2} (\gamma_{i}^0)^2 - \gamma_{0,i}^0 \right]\left( \mathbf{Z} \frac{\partial D(\theta)}{\partial \theta_{l}} \mathbf{Z}^\top \right)_{i,i} \right)\left(\sum_{i = 1}^n \left[(\mathbf{y}_i - \mu^0_i)^2 (\delta_i^0)^{-2} (\gamma_{i}^0)^2 - \gamma_{0,i}^0 \right]\left( \mathbf{Z} \frac{\partial D(\theta)}{\partial \theta_{l'}} \mathbf{Z}^\top \right)_{i,i} \right) \right. \\ &\hspace{15mm} + \left. \left(\sum_{i = 1}^n \left[(\mathbf{y}_i - \mu^0_i)^2 (\delta_i^0)^{-2} (\gamma_{i}^0)^2 - \gamma_{0,i}^0 \right]\left( \mathbf{Z} \frac{\partial D(\theta)}{\partial \theta_{l}} \mathbf{Z}^\top \right)_{i,i} \right) \left(2 \sum_{i = 1}^n \sum_{i' < i} (\mathbf{y}_i - \mu^0_i) (\mathbf{y}_{i'} - \mu^0_{i'}) \delta^{-1}_{i}\delta^{-1}_{i'} \gamma_{i}^0 \gamma_{i'}^0 \left( \mathbf{Z} \frac{\partial D(\theta)}{\partial \theta_{l'}} \mathbf{Z}^\top \right)_{i,i'} \right) \right. \\ &\hspace{15mm} + \left. \left(\sum_{i = 1}^n \left[(\mathbf{y}_i - \mu^0_i)^2 (\delta_i^0)^{-2} (\gamma_{i}^0)^2 - \gamma_{0,i}^0 \right]\left( \mathbf{Z} \frac{\partial D(\theta)}{\partial \theta_{l'}} \mathbf{Z}^\top \right)_{i,i} \right) \left(2 \sum_{i = 1}^n \sum_{i' < i} (\mathbf{y}_i - \mu^0_i) (\mathbf{y}_{i'} - \mu^0_{i'}) \delta^{-1}_{i}\delta^{-1}_{i'} \gamma_{i}^0 \gamma_{i'}^0 \left( \mathbf{Z} \frac{\partial D(\theta)}{\partial \theta_{l}} \mathbf{Z}^\top \right)_{i,i'} \right) \right. \\ &\hspace{15mm} + \left. \left(2 \sum_{i = 1}^n \sum_{i' < i} (\mathbf{y}_i - \mu^0_i) (\mathbf{y}_{i'} - \mu^0_{i'}) \delta^{-1}_{i}\delta^{-1}_{i'} \gamma_{i}^0 \gamma_{i'}^0 \left( \mathbf{Z} \frac{\partial D(\theta)}{\partial \theta_{l}} \mathbf{Z}^\top \right)_{i,i'} \right)\left(2 \sum_{i = 1}^n \sum_{i' < i} (\mathbf{y}_i - \mu^0_i) (\mathbf{y}_{i'} - \mu^0_{i'}) \delta^{-1}_{i}\delta^{-1}_{i'} \gamma_{i}^0 \gamma_{i'}^0 \left( \mathbf{Z} \frac{\partial D(\theta)}{\partial \theta_{l'}} \mathbf{Z}^\top \right)_{i,i'} \right) \right] \\ &\overset{(i)}{=} \frac{1}{4} \mathbb{E}\left[ \left(\sum_{i = 1}^n \left( \mathbf{Z} \frac{\partial D(\theta)}{\partial \theta_{l}} \mathbf{Z}^\top \right)_{i,i}\left( \mathbf{Z} \frac{\partial D(\theta)}{\partial \theta_{l'}} \mathbf{Z}^\top \right)_{i,i}\left[(\mathbf{y}_i - \mu^0_i)^2 (\delta_i^0)^{-2} (\gamma_{i}^0)^2 - \gamma_{0,i}^0 \right]^2 \right) + 4\left(\sum_{i = 1}^n \sum_{i' < i} \left( (\mathbf{y}_i - \mu^0_i) (\mathbf{y}_{i'} - \mu^0_{i'}) (\delta_i^0)^{-1} (\delta^0_{i'})^{-1} \gamma_{i}^0 \gamma_{i'}^0 \right)^2 \left( \mathbf{Z} \frac{\partial D(\theta)}{\partial \theta_{l}} \mathbf{Z}^\top \right)_{i,i'} \left( \mathbf{Z} \frac{\partial D(\theta)}{\partial \theta_{l'}} \mathbf{Z}^\top \right)_{i,i'} \right) \right] \\ &= \frac{1}{4}\left(\sum_{i = 1}^n \left( \mathbf{Z} \frac{\partial D(\theta)}{\partial \theta_{l}} \mathbf{Z}^\top \right)_{i,i}\left( \mathbf{Z} \frac{\partial D(\theta)}{\partial \theta_{l'}} \mathbf{Z}^\top \right)_{i,i}\mathbb{E}\left[((\mathbf{y}_i - \mu^0_i)^2 (\delta_i^0)^{-2} (\gamma_{i}^0)^2 - \gamma_{0,i}^0)^2\right] + 4\left(\sum_{i = 1}^n \sum_{i' < i} \left( \mathbf{Z} \frac{\partial D(\theta)}{\partial \theta_{l}} \mathbf{Z}^\top \right)_{i,i'} \left( \mathbf{Z} \frac{\partial D(\theta)}{\partial \theta_{l'}} \mathbf{Z}^\top \right)_{i,i'}\mathbb{E}\left[((\mathbf{y}_i - \mu^0_i) (\mathbf{y}_{i'} - \mu^0_{i'}) \delta^{-1}_{i}\delta^{-1}_{i'} \gamma_{i}^0 \gamma_{i'}^0)^2 \right] \right) \right) \end{aligned} \nonumber $$Proof of $(i)$.
In $(i)$, we use the fact that the $\mathbf{y}_i$'s are independent and have mean $\mu^0_i$ under the null. Thus, any term that has $(\mathbf{y}_i - \mu^0_i)$ (to only a power of $1$) will be $0$ in expectation and remove the term entirely from the sum.Proof of $(ii)$.
To simplify this further, notice the following: $$ \begin{aligned} \mathbb{E}\left[((\mathbf{y}_i - \mu^0_i)^2 (\delta_i^{-1})^2 (\gamma_i^0)^2 - \gamma_{0,i}^0)^2 \right] &= \mathbb{E}\left[((\mathbf{y}_i - \mu^0_i)^2 (\delta_i^{-1})^2 (\gamma_i^0)^2)^2 - 2\gamma_{0,i}^0(\mathbf{y}_i - \mu^0_i)^2 (\delta_i^{-1})^2 (\gamma_i^0)^2 + (\gamma_{0,i}^0)^2\right] \\ &= (\delta_i^{-1})^4 (\gamma_i^0)^4 \mathbb{E}\left[(\mathbf{y}_i - \mu^0_i)^4\right] - 2 \mathbb{E}\left[\gamma_{0,i}^0(\mathbf{y}_i - \mu^0_i)^2 (\delta_i^{-1})^2 (\gamma_i^0)^2\right] + \mathbb{E}\left[(\gamma_{0,i}^0)^2\right] \\ &= (\delta_i^{-1})^4 (\gamma_i^0)^4 (\kappa_{i}^4 + 3(\kappa_{i}^2)^2) - 2(\delta_i^{-1})^2(\gamma_i^0)^2 \mathbb{E}\left[ (\gamma_i^0 + \rho^0_i(\mathbf{y}_i - \mu^0_i))(\mathbf{y}_i-\mu^0_i)^2 \right] + \mathbb{E}\left[ (\gamma_i^0 + \rho^0_i(\mathbf{y}_i - \mu^0_i))^2 \right] \\ &= (\delta_i^{-1})^4 (\gamma_i^0)^4 (\kappa_{i}^4 + 3(\kappa_{i}^2)^2) - 2(\delta_i^{-1})^2(\gamma_i^0)^2 \left( \gamma_i^0\mathbb{E}\left[ (\mathbf{y}_i - \mu^0_i)^2 \right] + \rho^0_i \mathbb{E}\left[(\mathbf{y}_i - \mu^0_i)^3 \right]\right) + (\gamma_i^0)^2 + 2 \gamma_i^0 \rho^0_i \mathbb{E}\left[ \mathbf{y}_i - \mu^0_i \right] + (\rho_i^0)^2 \mathbb{E}\left[ (\mathbf{y}_i - \mu^0_i)^2 \right] \\ &= (\delta_i^{-1})^4 (\gamma_i^0)^4 \kappa_{i}^4 + 3(\delta_i^{-1})^4 (\gamma_i^0)^4 (\kappa_{i}^2)^2 - 2(\delta_i^{-1})^2(\gamma_i^0)^3 \kappa_{i}^2 - 2(\delta_i^{-1})^2(\gamma_i^0)^2 \rho^0_i \mathbb{E}\left[ (\mathbf{y}_i - \mu^0_i)^3 \right] + (\gamma_i^0)^2 + (\rho^0_i)^2 \kappa_{i}^2 \\ &\overset{(a)}{=} (\delta_i^{-1})^4 (\gamma_i^0)^4 \kappa_{i}^4 + 3(\delta_i^{-1})^2 (\gamma_i^0)^2 (\delta_i)^2 - 2\delta_i^{-1}(\gamma_i^0)^2 \delta_i - 2(\delta_i^{-1})^2 (\gamma_i^0)^2 \rho^0_i \mathbb{E}\left[ (\mathbf{y}_i - \mu^0_i)^3 \right] + (\gamma_i^0)^2 + (\rho_i^0)^2 \kappa_{i}^2 \\ &= (\delta_i^{-1})^4 (\gamma_i^0)^4 \kappa_{i}^4 + 3(\gamma_i^0)^2 - 2(\gamma_i^0)^2 + (\gamma_i^0)^2 - 2(\delta_i^{-1})^2 (\gamma_i^0)^2 \rho^0_i \mathbb{E}\left[ (\mathbf{y}_i - \mu^0_i)^3 \right] + (\rho_i^0)^2 \kappa_{2,i} \\ &\overset{(b)}{=} (\delta_i^{-1})^4 (\gamma_i^0)^4 \kappa_{i}^4 + 2(\gamma_i^0)^2 + (\rho_i^0)^2 \kappa_{i}^2 - 2 (\gamma_i^0)^2 (\delta_i^{-1})^2 \rho^0_i \kappa_{i}^3 \end{aligned} \nonumber $$ In $(a)$, we have that: $$ (\delta_i^0)^{-1} \gamma_i^0 = \left( \frac{1}{\frac{\partial \eta^0_i}{\partial \mu_i}} \right)^{-1} \frac{1}{v(\mu^0_i) (\frac{\partial \eta^0_i}{\partial \mu_i})^2} = \frac{1}{v(\mu^0_i) \frac{\partial \eta^0_i}{\partial \mu_i}} \implies (\delta^0_i)^{-1} \gamma^0_i \kappa_{i}^2 = \frac{1}{v(\mu^0_i) \frac{\partial \eta^0_i}{\partial \mu_i}} v(\mu^0_i) = \delta^0_i \nonumber $$ In $(b)$, we use the fact that the third cumulant is equal to the third central moment.Proof of $(iii)$.
In $(iii)$, we see that : $$ \begin{aligned} \mathbb{E}\left[((\mathbf{y}_i - \mu^0_i) (\mathbf{y}_{i'} - \mu^0_{i'}) \delta^{-1}_{i}\delta^{-1}_{i'} \gamma_{i}^0 \gamma^0_{i'})^2 \right] &= (\delta_i^{-1}\delta_{i'}^{-1})^2 (\gamma_i^0 \gamma^0_{i'})^2 \mathbb{E}\left[ (\mathbf{y}_i - \mu^0_i)^2(\mathbf{y}_{i'} - \mu^0_{i'})^2\right] \\ &= (\delta_i^{-1}\delta_{i'}^{-1})^2 (\gamma_i^0 \gamma^0_{i'})^2 \mathbb{E}\left[ (\mathbf{y}_i - \mu^0_i)^2\right] \mathbb{E}\left[(\mathbf{y}_{i'} - \mu^0_{i'})^2\right] \\ &= (\delta_i^{-1})^2 (\gamma^0_i)^2 v(\mu^0_i) (\delta_{i'}^{-1})^2 (\gamma^0_{i'})^2 v(\mu^0_{i'}) \\ &\overset{(a')}{=} \gamma^0_i v(\mu^0_i) \gamma^0_{i'} v(\mu^0_{i'}) \end{aligned} \nonumber $$ In $(a')$, we use the fact that: $$ ((\delta_i^0)^{-1})^2 v(\mu^0_i) = \left(\frac{\partial \eta^0_i}{\partial \mu_i} \right)^2 v(\mu^0_i) = (\gamma_i^0)^{-1} \nonumber $$This matches the form in Lin (1997) if we let \(r_{i,i} = (\delta_i^{-1})^4 (\gamma_i^0)^4 \kappa_{i}^4 + 2(\gamma_i^0)^2 + (\rho_i^0)^2 \kappa_{i}^2 - 2 (\gamma_i^0)^2 (\delta_i^{-1})^2 \rho^0_i \kappa_{i}^3\) and \(r_{i, i'} = 2 \gamma^0_i\gamma^0_{i'}\).
Again, all of the central moments for the Poisson distribution are equal to the mean, and $\frac{\partial D(\theta)}{\partial \theta} = \mathbb{I}_{k \times k}$. $$ \begin{aligned} \mathcal{I}_{\theta, \theta} &\approx \frac{1}{4} \left(\sum_{i = 1}^n \left(\mathbf{Z}_{i, \cdot} (\mathbf{Z}_{i, \cdot})^\top\right)^2 \mu_i^0 + 4 \sum_{i = 1}^n \sum_{i' < i} \left(\mathbf{Z}_{i, \cdot} (\mathbf{Z}_{i', \cdot})^\top \right)^2 (\mu_i^0 \mu_{i'}^0)^2 \right) \end{aligned} $$
Test
The test statistic for the global test is given by:
\[\chi_G^2 = U_\theta^\top(\hat{\alpha}_0) \tilde{\mathcal{I}}_{\theta, \theta}^{-1}(\hat{\alpha}_0) U_\theta(\hat{\alpha}_0)\]Under the null hypothesis, it should have a $\chi^2_1$ distribution, asymptotically.