A Primer
Recently, I’ve run into some confusion in parameter estimation. In this post, I’ll cover the relevant sections in Tsiatis’ Semiparametric Theory and Missing Data
We assume we have independent and identically distributed random variables, $X_1, \dots, X_n$, arranged into a vector $\mathbf{X}$ defined on the probability space $(\mathcal{X}, \mathcal{A}, P)$. We assume $X_i \sim P$ where $P \in \mathcal{P}$ for some family of distributions, $\mathcal{P}$. We also assume that $P$ is defined by the values of parameters, $\theta$, where some coordinates are of interest and some are a nuisance (i.e. we can partition the parameter space). We use $\beta$ and $\eta$ to denote the parameters of interest and the nuisance parameters, respectively. We have:
\[\begin{equation} \label{eq:parameter-vector} \theta = (\beta^\top, \eta^\top)^\top \in \mathbb{R}^{p \times 1} \hspace{2mm} (p = q + r), \hspace{8mm} \beta \in \mathbb{R}^{q \times 1}, \hspace{8mm} \eta \in \mathbb{R}^{r \times 1} \end{equation}\]As in Tsiatis’ book, we’ll restrict the parameter space, $\Omega$, to subspaces of linear vector spaces, which themselves could be finite- or infinite-dimensional (depending on whether there exist a finite number of elements in it that span it). In most applications/examples, this assumption will not be too restrictive since we will usually be working in Euclidean spaces.
Uppercase letters denote random variables, and lowercase letters denote realizations of these variables. Boldface indicates a vector or matrix, exactly which will be clear from context.
We will indicate parameters that are functions with parenthesis (e.g. $\gamma(\cdot)$), and we’ll use $\hat{\gamma}_n$ to denote an estimator of parameter $\gamma$. We’ll use a subscript of $0$ to denote the true parameter values (i.e. $\theta_0 = (\beta_0^\top, \eta_0^\top)^\top$).
$\nu_X(x)$ denotes the dominating measure on which densities for $\mathbf{X}$ are defined. That is, the dominating measure is any measure $\mu(\cdot)$ such that the density $f_\mathbf{X}(\mathbf{x})$ exists and such that $\mu’(A) = 0$ implies $\mu(A) = 0$. For continuous $X$, we usually use the Lebesgue measure, and for discrete $X$, we use the counting measure.
I am pretty sure all of the expectations are taking with respect to the true data-generating distribution, $P_X(x, \theta_0)$. I’ve denoted this with a subscript of $\theta_0$ on the expectation, but I may have missed a few.
Convergence in probability and convergence in distribution when the density of the random variable $X$ is equal to $p_X(x; \beta, \eta)$ are denoted by, respectively:
\[\xrightarrow{P\{ \beta, \eta \}}, \hspace{10mm} \xrightarrow{D\{ \beta, \eta \}}\]General convergence in probability and distribution are denoted by:
\[\overset{p}{\rightarrow}, \hspace{10mm} \rightsquigarrow\]Before we dive into our discussion of semiparametric inference, we need to first look at some geometric definitions.
Consider the space of functions of the form $h(X)$ where $h: \mathcal{X} \rightarrow \mathbb{R}^q$, are measurable (w.r.t. the probability space we described earlier), and are such that:
\[\mathbb{E}[h(X)] = 0, \hspace{10mm} \mathbb{E}\left[ h^\top(X) h(X)\right] < \infty\]That is, we consider the space of all measurable, $q$-dimensional random functions with mean zero and finite variance. Notice that this space is a linear space. Let $\mathbf{0}$ (the constant function outputting a $q$-dimensional vector of zeros) denote the origin. Though these are random functions, we will sometimes drop the $(X)$ from our notation to just write $h$.
Later we will concern ourselves with Hilbert spaces of these types of functions. We define these formally for completeness. Recall that a Hilbert space is a complete normed linear vector space with an inner product.
Our Hilbert spaces of interest (defined above) are dependent upon the true value $\theta_0$, and this is the value with respect to which we take the expectation in the inner product. Thus, the space will change if $\theta_0$ changes.
We come to a theorem that will be helpful later in some definitions and results.
Let $\mathcal{H}$ denote the Hilbert space defined above, and let $\mathcal{U}$ denote a closed linear subspace. For any $h \in \mathcal{H}$, there exists a unique $u_0 \in \mathcal{U}$ such that:
\[\rvert \rvert h - u_0 \rvert \rvert \leq \rvert \rvert h - u \rvert \rvert, \hspace{10mm} \forall u \in \mathcal{U}\]We call $u_0$ the projection of $h$ onto $\mathcal{U}$, and we usually use $\Pi (h \rvert \mathcal{U})$ to denote it. $\Pi(h \rvert \mathcal{U})$ satisfies:
\[\langle \Pi(h \rvert \mathcal{U}), u \rangle = 0, \hspace{10mm} \forall u \in \mathcal{U}\]Proof to be completed.
A version of the Pythagorean theorem and the Cauchy-Schwarz inequality can be derived as a result of the Projection theorem.
Before we dive into more specifics, let’s define a few different classes of estimators.
The name comes from the fact that we are maximizing the objective (i.e. finding an extreme value). An example is the maximum likelihood estimator (MLE), which has the (normalized) log-likelihood as its objective function:
\[\hat{Q}_n(X, \theta) = \frac{1}{n} \sum_{i = 1}^n \log (p_X(x_i; \theta))\]In certain cases, the maximization can be done by taking the derivative and setting this equal to zero. However, we should be cautious as there are cases when there are many such solutions. In this case, a solution to this first-order condition may not be the global maximum of $\hat{Q}_n(X, \theta)$ (i.e. a solution may be a local maximum). Luckily, if the extremum estimator is consistent and $\theta_0$ is on the interior of $\Omega$, then the extremum estimator will be included in the set of solutions to setting the derivative equal to zero.
The extremum estimator class is quite broad. A subclass of extremum estimator is the $m$-estimator.
Put simply, an extremum estimator that maximizes a sample average is an $m$-estimator[^fn-newey]. Before we look at an example, let’s define the score.
The MLE is also an example of an $m$-estimator. Consider the function $m(X, \theta) = \log (p_X(x; \theta))$, the log-likelihood. We can maximize the log-likelihood by taking the derivative with respect to $\theta$ and setting this equal to zero. This is called the score equation in $\theta$:
\[\sum_{i = 1}^n U_\theta(X_i, \theta) = 0\]Some estimators (asymptotically linear ones…to be defined later) can be analyzed with respect to a particular defining function, called the influence function.
Note that the influence function is always defined with respect to the true data-generating distribution; it is evaluated at $\theta_0$, and the expectations are always taken with respect to the parameter being $\theta_0$. Thus, we don’t really need to write $\psi(X, \theta_0)$, since the influence function is not a function of $\theta$ (it is just dependent on $\theta_0$). However, I keep the full notation to avoid confusion.
We can also define special properties of estimators to help us decribe the ones that are desirable. We say that $\hat{\theta}_n$ is asymptotically linear if it has an influence function(s). Thus, the maximum likelihood estimator is asymptotically linear.
The influence function(s) provides us a clear way to analyze the behavior of our (asymptotically linear) estimator as $n \rightarrow \infty$:
\[\begin{equation} \label{eq:al-dist} \begin{aligned} \frac{1}{\sqrt{n}}\sum_{i = 1}^n \psi(X_i, \theta_0) &\rightsquigarrow \mathcal{N}(\mathbf{0}_{q \times q}, \mathbb{E}_{\theta_0}\left[ \psi(X, \theta_0) \psi(X, \theta_0)^\top \right]) & \left(\text{CLT}\right) \\ \sqrt{n}(\hat{\theta}_n - \theta_0) &\rightsquigarrow \mathcal{N}(\mathbf{0}_{q \times q}, \mathbb{E}_{\theta_0}\left[ \psi(X, \theta_0) \psi(X, \theta_0)^\top \right]) & \left(\text{Slutsky's theorem}\right) \end{aligned} \end{equation}\]Eq. \eqref{eq:al-dist} implies that the asymptotic variance of the estimator is the variance of its influence function. Furthermore, an asymptotically linear estimator is (effectively) uniquely identified by its influence function.
An asymptotically linear estimator has an almost surely unique influence function.
Tsiatis proceeds by contradiction. Let our estimator be denoted by $\hat{\theta}_n$ with true value $\theta_0$, and let $n$ be our sample size. Let $\psi(X, \theta_0)$ be an influence function, and assume that there exists another influence function $\psi^*(X, \theta_0)$. Thus:
\[\mathbb{E}_{\theta_0}[\psi(X, \theta_0)] = \mathbb{E}_{\theta_0}[\psi^*(X, \theta_0)] = 0, \hspace{10mm} \sqrt{n}(\hat{\theta}_n - \theta_0) = \frac{1}{\sqrt{n}}\sum_{i = 1}^n \psi(X_i, \theta_0) + o_p(1) = \frac{1}{\sqrt{n}}\sum_{i = 1}^n \psi^*(X_i, \theta_0) + o_p(1)\]Recall that $X_1, \dots, X_n$ are i.i.d., so, by the central limit theorem:
\[\frac{1}{\sqrt{n}}\sum_{i = 1}^n (\psi(X_i, \theta_0) - \psi^*(X_i, \theta_0)) \rightsquigarrow \mathcal{N}\left(\mathbf{0}, \mathbb{E}_{\theta_0}\left[ (\psi(X, \theta_0) - \psi^*(X, \theta_0))(\psi(X, \theta_0) - \psi^*(X, \theta_0))^\top\right] \right) \nonumber\]However, by the continuous mapping theorem:
\[\begin{aligned} &\frac{1}{\sqrt{n}}\sum_{i = 1}^n (\psi(X_i, \theta_0) - \psi^*(X_i, \theta_0)) = o_p(1) \\ \implies &\underset{n \rightarrow \infty}{\lim} \mathbb{P}\left( \bigg\rvert \frac{1}{\sqrt{n}}\sum_{i = 1}^n (\psi(X_i, \theta_0) - \psi^*(X_i, \theta_0)) \bigg\rvert \geq \epsilon \right) = 0, \hspace{5mm} \forall \epsilon > 0 \end{aligned}\]For both of the above to be true, we need:
\[\mathbb{E}_{\theta_0}\left[ (\psi(X, \theta_0) - \psi^*(X, \theta_0))(\psi(X, \theta_0) - \psi^*(X, \theta_0))^\top\right] = \mathbf{0}_{q \times q}\]which implies $\psi(X, \theta_0) = \psi^*(X, \theta_0)$ almost surely.
The score vector also satisfies nice properties, which we summarize in the following theorem. First, let’s define what it means for an estimator to be “regular”.
In many cases, the maximum likelihood estimator is regular. When an estimator is both asymptotically linear and regular, we say it is RAL.
Let $\beta(\theta)$ be a $q$-dimensional function of $p$-dimensional parameter $\theta$ such that $q < p$. Assume the following exists:
\[\begin{equation} \label{eq:gamma} \Gamma(\theta) = \frac{\partial \beta(\theta)}{\partial \theta^\top} \end{equation}\]which is the $q \times p$ matrix of first order partial derivatives of vector $\beta(\theta)$ with respect to $\theta$, and assume it is continuous in $\theta$ in a neighborhood of $\theta_0$, the true parameter value. Let \(\hat{\beta}_n\) denote an asymptotically linear estimator with influence function $\psi(X, \theta_0)$ such that \(\mathbb{E}_\theta[\psi^\top(X, \theta_0) \psi(X, \theta_0)]\) exists and is also continuous in $\theta$ in a neighborhood of $\theta_0$. If $\hat{\beta}_n$ is regular, then:
\[\begin{equation} \label{eq:condition-3-2} \mathbb{E}_{\theta_0}\left[ \psi(X,\theta_0) U_\theta^\top(X, \theta_0) \right] = \Gamma(\theta_0) \end{equation}\]If the parameter space can be partitioned as $\theta = (\beta^\top, \eta^\top)^\top$ where $\beta$ is $q$-dimensional and $\eta$ is $r$-dimensional, then:
\[\begin{equation} \label{eq:corollary-1} \mathbb{E}_{\theta_0}\left[ \psi(X, \theta_0) U_\beta^\top(X, \theta_0)\right] = \mathbb{I}_{q \times q}, \hspace{10mm} \mathbb{E}_{\theta_0}\left[ \psi(X, \theta_0) U_\eta^\top(X, \theta_0) \right] = \mathbf{0}_{q \times r} \end{equation}\]where:
\[U_\beta(x, \theta_0) = \frac{\partial \log(p_X(x, \theta))}{\partial \beta} \bigg\rvert_{\theta = \theta_0}, \hspace{10mm} U_\eta(x, \theta_0) = \frac{\partial \log(p_X(x, \theta))}{\partial \eta} \bigg\rvert_{\theta = \theta_0},\]See pg. 34 in
Note that the influence functions for RAL estimators are in the subspace of elements of $\mathcal{H}$ (the Hilbert space of mean-zero measurable random functions) satisfying Eq. \eqref{eq:corollary-1}. Similarly, each element in that subspace is the influence function for some RAL estimator.
From this geometric perspective, we can see that the asymptotic variance of an RAL estimator is basically the squared distance between the origin and its influence function in our special Hilbert space.
We can show under what conditions extremum estimators are consistent ($\hat{\theta}_n \overset{p}{\rightarrow} \theta_0$). This is called the Basic Consistency Theorem by Newey and McFadden. In what follows, we introduce the probability limit, $Q_0(X, \theta)$, of $\hat{Q}_n(X, \theta)$, which is the quantity described in the appendix.
Let $\hat{Q}_n(X,\theta)$ be the objective function for an extremum estimator, $\hat{\theta}_n$. If there exists a function $Q_0(\theta)$ satisfying:
then $\hat{\theta}_n$ is consistent; i.e.:
\[\hat{\theta}_n \overset{p}{\rightarrow} \theta_0\]See pg. 2121 in
The authors note that the some of the conditions can be relaxed. Instead of assuming that $\hat{\theta}_n$ maximizes $\hat{Q}_n(X, \theta)$, we can instead assume that it “nearly” maximizes it:
\[\hat{Q}_n(\hat{\theta}_n) \geq \underset{\theta \in \Omega}{\sup} \hat{Q}_n(\theta) + o_p(1)\]The second condition can be relaxed if the objective function, $\hat{Q}_n(X, \theta)$, is concave. Then the assumption of compactness of the parameter space, $\Omega$, can be exchanged for just convexity.
In addition, the third condition can be relaxed to just upper semi-continuity rather than continuity proper. That is, we assume that, for any $\theta \in \Omega$ and any $\epsilon > 0$, there exists an open subset $\mathcal{B} \subset \Omega$ such that $\theta \in \mathcal{B}$ and such that:
\[Q_0(X, \theta') < Q_0(X, \theta) + \epsilon \hspace{5mm} \forall \theta' \in \mathcal{B}\]The fourth condition can be changed to just require that:
\[\hat{Q}_n(X, \theta_0) \overset{p}{\rightarrow} Q_0(\theta_0), \hspace{5mm} \text{and} \hspace{5mm} \hat{Q}_n(X, \theta) < Q_0(X, \theta) + \epsilon \hspace{5mm} \forall \epsilon > 0, \hspace{1mm} \forall \theta \in \Omega\]with probability approaching $1$. If we make the stronger assumption that:
\[\underset{\theta \in \Omega}{\sup} \big\rvert \hat{Q}_n(X, \theta) - Q_0(X, \theta) \big\rvert \overset{as}{\rightarrow} 0\]instead of the fourth condition, then we have that $\hat{\theta}_n \overset{as}{\rightarrow} \theta_0$ (i.e. $\hat{\theta}_n$ is strongly consistent).
In order to use Theorem 2.1, one must be able to show that the conditions (or their relaxations) hold. This can be difficult in practice, we often try to show some other property that are sufficient for the conditions. Newey and McFadden call these primitive conditions.
Let $A(X, \theta)$ be a matrix of functions of observation $X$ and parameter $\theta$. Let $\rvert \rvert A \rvert \rvert$ denote the Euclidean norm. Let $\Omega$ be a compact parameter space, and suppose our data are independent and identically distributed.
Suppose that each element of $A(X, \theta)$ is continuous at each $\theta \in \Omega$ with probability one, and suppose that there exists a function $d(X)$ such that $\rvert \rvert A(X, \theta) \rvert \rvert \leq d(X)$ for all $\theta \in \Omega$. Assume $\mathbb{E}[d(X)] < \infty$. Then:
\[\mathbb{E}[A(X, \theta)] \text{ is continuous}\]and:
\[\underset{\theta \in \Omega}{\sup} \left\vert \left\vert\frac{1}{n} \sum_{i = 1}^n \left(A(X_i, \theta) - \mathbb{E}[A(X, \theta)] \right) \right\vert \right\vert \overset{p}{\rightarrow} 0\]Proof to be completed.
The above lemma can be used for sample averages, which is exactly what we deal with in maximum likelihood estimation (and several other estimation settings). This leads us to the next theorem that states that, under certain conditions, the MLE is consistent:
Let $X_1, X_2, \dots$ are i.i.d. data and let $p_X(x_i, \theta_0)$ denote their probability density function. If the following conditions are met, then $\hat{\theta}_n \overset{p}{\rightarrow} \theta_0$ (the MLE is consistent):
The result follows from ensuring the conditions of Theorem 2.1 are satisfied and then applying Lemma 2.4.
To construct confidence intervals, we often rely upon asymptotical normality of an estimator.
Let $\hat{\theta}_n$ be an extremum estimator; that is, it maximizes some objective function $\hat{Q}_n(X, \theta)$ subject to $\theta \in \Omega$ give a sample size of $n$. If the following conditions are satisfied:
Then:
\[\begin{equation} \label{eq:asymptotic-dist} \sqrt{n}(\hat{\theta}_n - \theta_0) \rightsquigarrow \mathcal{N}\left(\mathbf{0}, H^{-1} \Sigma H^{-1}\right) \end{equation}\]where $\Sigma$ is the asymptotic variance of \(\sqrt{n} \left[ \frac{\partial \hat{Q}_n(X, \theta)}{\partial \theta} \right] \bigg\rvert_{\theta = \theta_0}\), and \(H = \underset{n \rightarrow \infty}{\text{plim}} \left[ \left. \frac{\partial^2 \hat{Q}_n(\theta)}{\partial \theta \partial \theta^\top} \right\vert_{\theta = \theta_0} \right]\).
See Section 3.5 in
Applied to maximum likelihood estimators, Theorem 3.1 gives:
\[\sqrt{n}(\hat{\theta}_n - \theta_0) \rightsquigarrow \mathcal{N}(\mathbf{0}, \mathcal{I}^{-1}(\theta_0))\]Newey and McFadden make some important points about the conditions stated in Theorem 3.1. Perhaps most notable is the need for the true parameter value to be located in the interior of the parameter space. When this condition is not met ($\hat{\theta}_n$ is on the boundary even as $n \rightarrow \infty$), the asymptotic normality result is no longer guaranteed (though it could still be true). This comes up in variance estimation, since we usually constrain our space to positive reals. We also need the average score over the sample to satisfy a central limit theorem (this gives us the “base” Normal distribution) and the inverse Hessian needs to converge to a constant (so we can apply Slutsky’s theorem).
For confidence intervals, we need a consistent estimator of the asymptotic variance, $H^{-1} \Sigma H^{-1}$, of the estimator. Usually we do this by estimating the components and then plugging these in; i.e. we find $\hat{H}^{-1}$ and $\hat{\Sigma}$ and use $\hat{H}^{-1} \hat{\Sigma} \hat{H}^{-1}$
Under the conditions of Theorem 3.1, if \(\hat{H} = \left. \left[ \frac{\partial \hat{Q}_n(\theta)}{\partial \theta}\right]\right\vert_{\theta = \hat{\theta}}$ and $\hat{\Sigma} \overset{p}{\rightarrow} \Sigma\), then:
\[\hat{H}^{-1} \hat{\Sigma} \hat{H}^{-1} \overset{p}{\rightarrow} H^{-1} \Sigma H^{-1}\]Proof to be completed.
This estimator is sometimes called a sandwich estimator since we have the asymptotic variance “sandwiched” between the same term twice.
For maximum likelihood estimators, one can estimate the asymptotic variance with any consistent estimator of the inverse Fisher information matrix. Some examples provided by Newey and McFadden are to use the negative Hessian (if regularity conditions are met) or the sample average of the outer product of the score. However, one should use caution because, under model misspecification, estimators of the inverse information matrix may not be consistent.
We can compare different asymptotically normal estimators by their efficiency. But before we begin, we need to define a few new quantities.
We can define a similar space when the parameter vector can be partitioned as $\theta = (\beta^\top, \eta^\top)^\top$. The nuisance tangent space is the linear subspace of $\mathcal{H}$ spanned by the nuisance score vector $U_\eta(X, \theta_0)$:
\[\begin{equation} \label{eq:nuisance-space} \Lambda = \left\{ B U_\eta(X, \theta_0): B \in \mathbb{R}^{q \times r}\right\} \end{equation}\]The tangent space generated by $U_\beta(X, \theta_0)$ is:
\[\begin{equation} \label{eq:interest-space} \mathcal{T}_\beta = \left\{ B U_\beta(X, \theta_0): B \in \mathbb{R}^{q \times p} \right\} \end{equation}\]Notably, the direct sum of these two spaces equals the tangent space generated by the entire score vector:
\[\mathcal{T} = \mathcal{T}_\beta \oplus \Lambda\]We can also construct the set of elements that are orthogonal to $\Lambda$ are then given by $h - \Pi(h \rvert \Lambda)$ for all $h \in \mathcal{H}$ where $\Pi(h \rvert \Lambda)$ is the residual of $h$ after projecting onto $\Lambda$.
We call the influence function with the smallest variance matrix the efficient influence function. In the case of RAL estimators, we say that $\psi^{(1)}(X, \theta_0)$ has smaller (asymptotic) variance than $\psi^{(2)}(X, \theta_0)$ if, for all $q \times 1$ constant vectors $a$:
\[\text{Var}(\psi^{(1)}(X, \theta_0)) \leq \text{Var}(\psi^{(2)}(X, \theta_0)) \iff \text{Var}(a^\top \psi^{(1)}(X, \theta_0)) \leq \text{Var}(a^\top \psi^{(2)}(X, \theta_0))\]This brings us to a theorem that defines the subspace of influence functions.
The set of all influence functions is the linear variety \(\psi^*(X, \theta_0) + \mathcal{T}^\perp\) where \(\psi^*(X, \theta_0)\) is any influence function, and \(\mathcal{T}^\perp\) is the space perpendicular to the tangent space.
See pg. 45-46 of
This result states that we can construct all influence functions of RAL estimators by taking an arbitrary influence function and adding any element from the orthogonal complement of the tangent space to it. Theorem 3.4 can be used to define the efficient influence function.
Let \(\psi^*(X, \theta_0)\) be any influence function, and let $\mathcal{T}$ be the tangent space generated by the score vector. The efficient influence function is given by the projection of $\psi^*(X, \theta_0)$ onto the tangent space:
\[\psi_{\text{eff}}(X, \theta_0) = \psi^*(X, \theta_0) - \Pi(\psi^*(X, \theta_0) \rvert \mathcal{T}^\perp) = \Pi(\psi^*(X, \theta_0) \rvert \mathcal{T})\]The efficient influence function can be written as:
\[\psi_{\text{eff}}(X, \theta_0) = \Gamma(\theta_0) \mathcal{I}^{-1}(\theta_0) U_\theta(X, \theta_0)\]where $\Gamma(\theta_0)$ is the matrix defined in Eq. \eqref{eq:gamma} and $\mathcal{I}(\theta_0)$ is the information matrix.
See pg. 46-47 in
We can define an efficient version of the score if we are in the setting where we can partition $\theta = (\beta^\top, \eta^\top)^\top$.
In this setting, we can construct the efficient influence function as:
\[\begin{equation} \label{eq:partitioned-efficient-influence} \psi_\text{eff}(X, \theta_0) = \left[ \mathbb{E}_{\theta_0}\left[ U_\text{eff}(X, \theta_0) U^\top_\text{eff}(X, \theta_0) \right] \right]^{-1} U_\text{eff}(X, \theta_0) \end{equation}\]and has variance equal to:
\[\text{Var}(\psi_{\text{eff}}(X, \theta_0)) = \left[ \mathbb{E}_{\theta_0} \left[ U_\text{eff}(X, \theta_0) U^\top(X, \theta_0) \right] \right]^{-1}\]which is the inverse variance matrix of the efficient score. If we partition the variance of the score vector as:
\[\text{Var}(U_\theta(X, \theta_0)) = \mathcal{I} = \begin{bmatrix} \mathcal{I}_{\beta, \beta} = \mathbb{E}_{\theta_0}\left[ U_\beta(X, \theta_0) U^\top_\beta(X, \theta_0) \right] & \mathcal{I}_{\beta, \eta} = \mathbb{E}_{\theta_0}\left[ U_\beta(X, \theta_0) U^\top_\eta(X, \theta_0) \right] \\ \mathcal{I}_{\eta, \beta} = \mathbb{E}_{\theta_0}\left[ U_\eta(X, \theta_0) U^\top_\beta(X, \theta_0) \right] & \mathcal{I}_{\eta, \eta} = \mathbb{E}_{\theta_0}\left[ U_\eta(X, \theta_0) U^\top_\eta(X, \theta_0) \right] \end{bmatrix}\]Then we can use the Schur complement formula to get the varaince of the efficient influence function is:
\[\begin{equation} \label{eq:var-eff-influence} \text{Var}(\psi_{\text{eff}}(X, \theta_0)) = \left[\mathcal{I}_{\beta, \beta} - \mathcal{I}_{\beta, \eta} \mathcal{I}^{-1}_{\eta, \eta} \mathcal{I}^\top_{\beta, \eta} \right]^{-1} \end{equation}\]The following result is called the Information inequality and states that if $\theta_0$ can be identified using a MLE, then the limiting objective function will have a unique maximum at the true value.
If, for any $\theta \in \Omega$ such that $\theta \neq \theta_0$ implies that $p_X(x, \theta) \neq p_X(x, \theta_0)$ (i.e. $\theta_0$ is identified) and $\mathbb{E}[\rvert \log p_X(x, \theta) \rvert ] < \infty$ for all $\theta$, then the limiting objective function $Q_0(\theta) = \mathbb{E}\left[ \log(p_X(x, \theta)) \right]$ has a unique maximum at $\theta_0$.
Proof to be completed.
Here are some more articles you might like to read next: