A Primer
The score test in non-standard conditions has been the motivation for much of my reading these past few months. However, it has led me to wonder about the small details of the test in standard conditions. What exactly are the regularity conditions and when are they usually satisfied? When can we appeal to large-sample theory for the score test? It is slightly more challenging than anticipated to get a straight answer to these questions, and this is the purpose of this post.
This post relies on some measure theory, which I’ve covered in another post. Most of the content comes from Moran (1971)
Suppose we have some probability space $(S, \mathcal{A}, \mu)$. A random variable is some function $X: S \rightarrow \mathcal{X}$ where $\mathcal{X}$ is the sample space (which we only require to be a Borel space but is usually a subset of Euclidean space) with $\sigma$-field $\mathcal{B}$. Individual elements of $\mathcal{X}$ are denoted with $x$.
Let \(\mathcal{P}_\Theta\) be a family of distributions for $X$ parametrized by $\Theta: S \rightarrow \Omega$ where $\Omega$ is the parameter space with $\sigma$-field $\tau$. Individual elements of $\Omega$ are denoted with $\theta$. Denote the conditional distribution of $X$ given $\Theta = \theta$ with $P_\theta$ (which is a distribution on $(\mathcal{X},\mathcal{B})$).
To make our notation match less measure theoretical texts, we’ll use $X$ to denote a random variable with realizations denoted with the lowercase $x$. A parameter will be denoted with $\Theta$ with particular values denoted by $\theta$ and its true value by $\theta^*$. The density of $X$ given parameter $\Theta$ evaluated at a particular $x$ and $\theta$ will be denoted by $f_{X \rvert \Theta}(x; \theta)$ or, more compactly, $f(x; \theta)$.
We’ll first define a very important quantity in likelihood theory: the score function.
The score function is the gradient of the log density of the data with respect to the parameter. It describes the curvature of the log density at a particular value of the parameter $\Theta$.
The Fisher information describes the amount of information about $\Theta$ held by $X$. It is also the variance of the score function under conditions when the score has expectation zero.
Schervish outlines several regularity conditions, which he terms the Fisher Information (FI) conditions, that are needed for the definition of the Fisher information and some nice results about the properties of the score.
There exists a subset of the sample space, $B$, with measure $0$ (i.e. $\mu(B) = 0$) such that $\frac{\partial f_{X \rvert \Theta}(x; \theta)}{\partial \theta_i}$ exists for any $x \notin B$ and all values (and coordinates) of $\theta$.
Condition 1 requires that the partial derivatives with respect to all coordinates of $\theta$ (for all values of $\theta$) exists almost surely.
Intuitively (and a bit hand-wavily), this means that the derivatives must exist for all possible values of $\theta$ for pretty much any sample. This implies that log-likelihood functions that have cusps or points will not be differentiable at the particular value of $\theta$ where the feature occurs, implying a violation of this condition.
The order of integration and differentiation can be exchanged for all coordinates of $\theta$. That is: \(\frac{\partial}{\partial \theta_i} \int f_{X \rvert \Theta}(x; \theta) d\mu(x) = \int \frac{\partial f_X(x; \theta)}{\partial \theta_i} d \mu(x)\)
Condition 2 states the order of integration of differentiation can be exchanged. Since differentiation is basically just a particular limit, we can use results about the interchanging of the integral and limit to get results about interchaing the integral with differentiation.
The Dominated Convergence Theorem states that we can interchange the order of limits and integrals for (certain) functions that are always smaller than (in absolute value) some other function with finite integral. If we define a function that mimics the form of the derivative as a limit (something along the lines of $h(x) = \frac{f(x + \delta) - f(x)}{\delta}$), then we can use this theorem to get results for derivatives and integrals.
This is the basic idea of the Leibniz integral rule:
In summary, if our log-likelihood/density satisfies the (Lebesgue version of the) Leibniz Rule conditions, then it will satisfy Condition 2.
The set \(C = \{ x: f_X(x \rvert \theta) > 0 \}\) is the same $\forall \theta$.
Condition 3 states that the support of $f_X(x; \theta)$ should not depend on $\theta$. This is fairly easy to verify because we usually assume we know the family of distributions that our data are drawn from. I won’t go into any more details than this.
The FI conditions allow us to obtain the following results that are pretty fundamental for later likelihood theory.
When these conditions hold, the score has expectation $0$.
Furthermore, we have the following result under additional constraints.
If the log-likelihood is twice differentiable with respect to $\theta$, then the Fisher information is equal to:
\[\mathcal{I}_X(\theta) = - \mathbb{E}_\Theta \left[ \frac{\partial^2 \log f_{X \rvert \Theta}(x; \Theta)}{\partial \Theta \partial \Theta^\top} \bigg\rvert \theta \right]\]Taking the expected value:
\[\begin{aligned} \mathbb{E}_\Theta \left[ \frac{\partial^2}{\partial \Theta \partial \Theta^\top} \left[ \log f_{X \rvert \Theta}(x; \Theta)\right] \bigg\rvert \theta \right] &= \mathbb{E}_\Theta \left[ \frac{\frac{\partial^2}{\partial \theta \partial \theta^\top}[ f_{X \rvert \Theta}(x; \theta)]}{f_{X \rvert \Theta}(x; \theta)} - \left(\frac{\partial}{\partial \theta} [ \log f_{X \rvert \Theta}(x; \theta)]\right)^2 \right] \\ &= \mathbb{E}_\Theta \left[ \frac{\frac{\partial^2}{\partial \theta \partial \theta^\top}[ f_{X \rvert \Theta}(x; \theta)]}{f_{X \rvert \Theta}(x; \theta)} \right] - \mathbb{E}_\Theta \left[ \frac{\partial}{\partial \theta} [ \log f_{X \rvert \Theta}(x; \theta)] \right] \\ &= \mathbb{E}_\Theta \left[ \frac{\frac{\partial^2}{\partial \theta \partial \theta^\top}[ f_{X \rvert \Theta}(x; \theta)]}{f_{X \rvert \Theta}(x; \theta)} \right] - \mathcal{I}_X(\theta) \\ &= \int_\mathbb{R} \left( \frac{\frac{\partial^2}{\partial \theta \partial \theta^\top}[ f_{X \rvert \Theta}(x; \theta)]}{f_{X \rvert \Theta}(x; \theta)} \right) f(x; \theta) dx - \mathcal{I}_X(\theta) \\ &= \int_\mathbb{R} \left( \frac{\partial^2}{\partial \theta \partial \theta^\top}[ f_{X \rvert \Theta}(x; \theta)] \right) dx - \mathcal{I}_X(\theta) \\ &\overset{(i)}{=} \frac{\partial^2}{\partial \theta \partial \theta^\top} \left[ \underbrace{\int_\mathbb{R} f_{X \rvert \Theta}(x; \theta) dx}_{= 1} \right] - \mathcal{I}_X \\ &= - \mathcal{I}_X \end{aligned}\]In $(i)$, we rely on the regularity conditions (specifically number 2 above) so we can interchange the order of differentiation and integration.
If we consider the observations $x$ as fixed, then we can define the likelihood function as a function of $\Theta$:
\[\mathcal{L}(\theta; x) = f_{X \rvert \Theta}(x; \theta) \label{eq:lik-func}\]One of the most common settings in which the likelihood function will be useful is in statistical inference. A good starting point is in point estimation. Intuitively, it seems reasonable to judge the quality of a parameter estimate by how probable it is one would observe the sample at hand under the assumption that the estimate is the true parameter value. Or, in another way, we might think that the best estimate we could come up with is the one that is most likely to result in the observations we have. Thus, maximum likelihood estimation is born.
If the parameter space $\Omega$ is compact and the likelihood function is continuous over $\Omega$, then maximum likelihood estimate will exist for a given sample (i.e. the supremum of the maximum likelihood estimator will be achieved in $\Omega$). If the parameter space is open, then the likelihood may increase and never reach a supremum.
MLEs exhibit the invariance property, which is, in words, that a function of an MLE is the MLE of that function.
Let $\hat{\theta}$ be an MLE of $\Theta$, and let $g$ be some function of $\theta$. Then $g(\hat{\theta})$ is an MLE of $g(\Theta)$.
Define the induced likelihood function:
\[\mathcal{L}^*(\eta; x) = \underset{\theta: g(\theta) = \eta}{\sup} \left\{ \mathcal{L}(\theta; x) \right\}\]which is a function of $\eta$ equal to the maximum value of the likelihood function over all values of $\theta$ such that $g(\theta) = \eta$. Let:
\[\hat{\eta} = \underset{\eta}{\arg\sup}\left\{ \mathcal{L}^*(\eta; x) \right\}; \hspace{5mm} \hat{\theta} = \underset{\theta}{\arg\sup}\left\{ \mathcal{L}(\theta; x) \right\}\]We have:
\[\begin{aligned} \mathcal{L}^*(\hat{\eta}; x) &= \underset{\eta}{\sup}\left\{ \mathcal{L}^*(\eta; x) \right\} \\ &= \underset{\eta}{\sup}\left\{ \underset{\theta: g(\theta) = \eta}{\sup} \left\{ \mathcal{L}(\theta; x) \right\} \right\} \\ &= \underset{\theta}{\sup}\left\{ \mathcal{L}(\theta; x) \right\} \\ &= \mathcal{L}(\hat{\theta}; x) \\ &= \underset{\theta: g(\theta) = g(\hat{\theta})}{\sup} \left\{ \mathcal{L}(\theta; x) \right\} \\ &= \mathcal{L}^*(\hat{\theta}; x) \end{aligned}\]The easiest way to find an MLE is to use set the log-likelihood equal to zero (since monotonic transformations will not affect the $\arg \max$ or $\arg \min$).
For Fermat’s Theorem to apply, the log-likelihood must be differentiable in $\Omega$. The previous example is one such where this is not the case. This theorem also implies that this $x_0$ occurs on the interior of the domain of $f$. We would have to also check the boundary points if $A$ were closed.
The MLE is asymptotically normal under suitable conditions.
The requirements are that the the true parameter value is on the interior of the parameter space (if it is restricted); the MLE is consistent; the density is nice enough (continuous second derivatives); the order of integration and differentiation can be exchanged; and that there is a function with finite mean that bounds the difference between the second derivatives of the log-likelihoods for two values of $\Theta$. This last condition is a uniform law of large numbers.
We often supply \(\hat{\theta}_n\) for (usually unknown) \(\theta^*\) in the Fisher information, which yields the expected Fisher information. The expected Fisher information converges in probability to the Fisher Information given \(\Theta = \theta^*\).
We could also instead use the scaled matrix of second partial derivaties of the log-likelihood function near to $\hat{\theta}_n$:
\[- \frac{1}{n} \left\{ \frac{\partial^2 \ell(\Theta; x)}{\partial \Theta_i \Theta_j} \bigg\rvert_{\Theta = \hat{\theta}_n} \right\}\]which is called the observed Fisher information. It’s basically the sample analog of the expected information.
Here are some more articles you might like to read next: