A Primer
A large part of statistical inference is hypothesis testing in which we wish to learn about some aspect of how a set of random variables behave via a sample. For example, we might be interested in learning about the mean of a random variable or about whether it follows a Gaussian distribution. We can frame testing as a decision we must make, and there is a whole branch of statistical/machine learning literature about decision theory.
In this post, I’m going to cover the very basics (mostly first principles and definitions) to provide a base for some of my other posts. Hopefully this will be a self-contained reference for my future self. I’ll mostly use Berger’s Statistical Decision Theory.
The basic set-up is as follows. We have some unknown quantity, denoted by $\theta$, and we want to reach some conclusion about its state (e.g. if it is positive or not, if it is exactly equal to zero or not, etc.). This quantity can take on many different possible states, and we denote the set of these states with $\Theta$.
We make decisions, or take actions, denoted by $d$, which lead us to incur some sort of penalty that measures how bad our decision was. We denote the set of all possible decisions with $\mathcal{D}$. A loss function specifies the penalty incurred when we make decision $d$ and $\theta$ is the true state of nature:
\[\mathcal{L}: \Theta \times \mathcal{D} \rightarrow \mathbb{R}\]In general, $\theta$ is referred to as the state of nature, but when $\theta$ labels a probability distribution, then we call it a parameter. In most statistical settings, we will be dealing with the latter case, and the decisions we make are usually estimates or conclusions about the value of $\theta$.
We will make our decisions about the parameter using a random sample of the random variable, $X$, that follows probability distribution, $P_\theta$, which depends on $\theta$. We observe, say, $n$ realizations of $X$ as $x_1, \dots, x_n$. We denote the sample space (set of values $X$ can take) with $\mathcal{X}$.
Statistical inference and testing theory relies upon the idea that we can learn things about the state of nature by looking at data. In a way, we can consider a sample as containing information about the state of nature. By obtaining a sample, we gather evidence to help guide us in our decision making.
It is also nice to consider function of a sample that “renders down” all of the information about $\theta$ into a simple value. These are called sufficient statistics.
Given a statistic, $T$, we can partition the sample space into subsets that yield the same values of the statistic. Define the range of $T$ as $\mathcal{T} = { T(x): x \in \mathcal{X} }$. This partition is defined as the set of subsets defined by:
\[\mathcal{X}_t = \{ x \in \mathcal{X}: T(x) = t \} \hspace{5mm} \text{for } t \in \mathcal{T}\]Rather than the standard way of thinking about data generation, we can instead consider getting a sufficient statistic value of $t$, then selecting a value of $x \in \mathcal{X}_t$ according to the probability density/mass function of $X$ over this subset.
How do we make our decisions? We use a decision rule, which we define now.
Note that the loss function is a function of both the decision taken and the true state of nature. It is therefore useful to define the risk of a decision rule, which averaged over the values of $\theta$.
The definition of a decision rule we have above is one for a non-randomized rule. That is, when observing a sample, the rule outputs the appropriate decision with probability $1$. In contrast, we can define a randomized decision rule.
We use the notation \(\delta^*(x, d)\) to denote the probability that we choose decision $d$ based upon observing $X = x$, and we’ll use \(\delta^*(x)\) to denote the probability distribution generated when observing $X = x$. As such, a non-randomized decision rule can be thought of as a randomized decision rule that is just an indicator function for \(\delta(x) \in D\).
Because a randomized decision rule is a function of both the random sample and the subset of the decision space, its loss and risk must be defined in slightly different ways.
Going back to the idea of sufficiency, it is enough to consider only those decision rules that are based on sufficient statistics. Intuitively, this makes sense because a sufficient statistic should contain all of the information about $\theta$ that we could get from a sample.
Let $T$ be a sufficient statistic for $\theta$, and let \(\delta^*_0(x)\) be a randomized decision rule. Then, under certain conditions, there exists a randomized rule \(\delta_1^*(t)\) that only depends upon $T(x)$ such that:
\[R(\theta, \delta_1^*) = R(\theta, \delta_0^*)\]Proof to be completed. See Berger pg. 32.
Though we’ve defined two types of decision rules and their risks, we still have not explained how to judge decision rules and decisions. A decision principle is sort of like a philosophy by which we choose what makes a decision rule good or bad.
Suppose we have some idea about what the true state of nature is before we begin our experiment. We formalize this knowledge with a prior distribution, denoted by $\pi(\theta)$, which is a distribution over $\Theta$. The Bayes Principle states that a decision rule, $\delta_1$, is better than another decision rule, $\delta_2$, if its average risk with respect to $\pi(\theta)$ is smaller. That is:
\[\mathbb{E}_{\theta \sim \pi(\theta)}\left[ R(\theta, \delta_1) \right] < \mathbb{E}_{\theta \sim \pi(\theta)}\left[ R(\theta, \delta_2)\right]\]We can then define the best rule according to this principle.
Another way to rank decision rules is by the penalty they incur in the worst case scenario. The Minimax Principle states that a decision rule, $\delta_1$, is better than another decision rule, $\delta_2$, if its risk in the worst possible case is smaller. That is:
\[\underset{\theta \in \Theta}{\sup} \left\{ R(\theta, \delta_1) \right\} < \underset{\theta \in \Theta}{\sup} \left\{ R(\theta, \delta_2) \right\}\]We can then define the best rule according to this principle.
In this section, we’ll just go through several important results concerning decision rules and related concepts.
First, let $T$ be a sufficient statistic for $\theta$. The conditional distribution of $X$ given $T(X) = t$ gives probability $1$ to $\mathcal{X}_t$ (the subset of $\mathcal{X}$ that yields a value of $t$ for the sufficient statistic) since $X \notin \mathcal{X}_t$ with probability zero if $T(X) = t$. Thus, we can define $f_t(x)$ to be the probability density/mass function of $X$ on $\mathcal{X}_t$ specifically. These functions are independent of $\theta$ by the definition of a sufficient statistic.
Furthermore, using these functions, we can define the conditional expectation of a quantity given $T(X) = t$ as:
\[\mathbb{E}_{X \rvert t}\left[ h(X) \right] = \begin{cases} \int_{\mathcal{X}_t} h(x) f_t(x) dx & \text{continuous}\\ \sum_{x \in \mathcal{X}_t} h(x) f_t(x) & \text{discrete} \end{cases}\]The conditional expectation we defined above is helpful in the following theorems. The first implies that only non-randomized rules are viable when the loss is convex.
Let $\mathcal{D} \subseteq \mathbb{R}^m$ be convex, and assume that $\mathcal{L}(\theta, \mathbf{d})$ be convex for each $\theta \in \Theta$ and $\mathbf{d} \in \mathcal{D}$. Let $\delta^*$ be a randomized decision rule satisfying:
\[\mathbb{E}_{\mathbf{d} \sim \delta^*(x)}\left[ \rvert \mathbf{d} \rvert \right] < \infty \hspace{5mm} \forall x \in \mathcal{X}\]Then, under certain condtions, the non-randomized rule defined by:
\[\delta(x) = \mathbb{E}_{\mathbf{d} \sim \delta^*(x)} \left[ \mathbf{d} \right]\]satisfies:
\[\mathcal{L}(\theta, \delta(x)) \leq \mathcal{L}(\theta, \delta^*(x)) \hspace{5mm} \forall x, \theta\]Proof to be completed. See Berger pg. 35.
We also have the famous Rao-Blackwell Theorem.
Let $\mathcal{D} \subseteq \mathbb{R}^m$ be convex, and assume that $\mathcal{L}(\theta, \mathbf{d})$ is convex in $\mathbf{d}$ for all $\theta \in \Theta$. Let $T$ be a sufficient statistic for $\theta$, and let $\delta_0(x)$ be a non-randomized decision rule in $\mathbf{D}$. Then, if the following rule exists:
\[\delta_1(t) = \mathbb{E}_{X \sim P_\theta \rvert t}\left[ \delta_0(X) \right]\]it will satisfy:
\[R(\theta, \delta_0) \geq R(\theta, \delta_1)\]Proof to be completed. See Berger pg. 36.