Hypothesis Testing

A Primer

A large part of statistical inference is hypothesis testing in which we wish to learn about some aspect of how a set of random variables behave via a sample. For example, we might be interested in learning about the mean of a random variable or about whether it follows a Gaussian distribution.

In this post, I’ll cover some of the basics of hypothesis testing using Lehmann (2008). Just because I find it easier to conceptualize this way, I’ll be considering the case of hypothesis tests concerning a parameter value.

Background

We’ll focus on parametric settings, so we’ll assume to have a random variable, $X$, taking on values $x \in \mathcal{X}$ with distribution $P_{\theta}$. We assume $P_{\theta}$ falls within some class of distributions, $\mathcal{P} = { P_{\theta} \rvert \theta \in \Theta }$, parametrized by $\theta$ with parameter space, $\Theta$.

Hypothesis testing involves constructing a decision rule, which is a function that takes in data and outputs a decision that relates to the inferential goals at hand. Letting $\mathcal{D}$ be the set of all possible decisions, we can denote a decision rule with:

\[\begin{equation} \label{eq:decision-rule} \delta: X \rightarrow \mathcal{D} \end{equation}\]

We quantify how good a decision rule is with a loss function, which is a function of the choice of parameter (and, therefore, $P_{\theta}$ as $\theta$ uniquely labels each distribution within $\mathcal{P}$) and decision and defined as:

\[\begin{equation} \label{eq:loss-function} \mathcal{L}: \Theta \times \mathcal{D} \rightarrow \mathbb{R}_{\geq 0} \end{equation}\]

We define the risk of $\delta$ as the average loss under the assumption that $P_{\theta}$ is the true distribution of $X$:

\[\begin{equation} \label{eq:risk} R(\theta, \delta) = \mathbb{E}_{X \sim P_{\theta}}\left[ \mathcal{L}(\theta, \delta(X)) \right] \end{equation}\]

The question then becomes identifying the decision rule that performs best with respect to the choice of loss function (e.g. minimizes the risk). It’s important to keep in mind that the solution to the problem we’ve described above is dependent upon the assumed distribution class, $\mathcal{P}$, the choice of loss function, $\mathcal{L}$, and the decision space, $\mathcal{D}$. There is a bit of art that goes into making assumptions on these three aspects that are restrictive enough to allow us to identify a solution but not too restrictive so as to make the problem trivial.

Sometimes, it is too restrictive to use risk minimization as our selection criterion. For one, the true value of $\theta$ is often unknown, making $R(\theta, \delta)$ impossible to compute. One could also imagine that for many values of $\theta$ besides the true one, the decision has very bad performance.

To combat this, we define decision procedures, which are methods that tell us which decision rule to prefer, even in cases when one rule does not always have smaller risk than another. A decision procedure is defined by how we want to judge the risk functions of our decision rules.

Suppose we do not know the true value of $\theta$, but we do know a priori that it follows some distribution with density function $\rho(\theta)$. We can then use the Bayes risk.

Definition (Bayes Risk).
Assume that $\theta \sim \rho(\theta)$ a priori. For a decision rule, $\delta$, the Bayes risk is defined as the average risk with respect to the parameter's density: $$ \begin{equation} \label{eq:bayes-risk} r(\rho, \delta) = \int \mathbb{E}_{\theta} \left[ \mathcal{L}(\theta, \delta(X)) \right] \rho(\theta) d\theta \end{equation} $$

A decision rule that minimizes a Bayes risk is called a Bayes solution. One can also imagine identifying the Bayes solution subject to some constraint on its Bayes risk. This is called a restricted Bayes solution.

If we don’t know anything about the distribution of $\theta$, then we could instead decide that we want to use the maximum risk as our criterion.

Definition (Minimax).
For a decision rule, $\delta^*$, is called minimax if it minimizes the maximum risk over all parameter values. That is: $$ \begin{equation} \label{eq:minimax} \underset{\theta \in \Theta}{\sup} \left\{ R(\theta, \delta^*) \right\} = \underset{\delta}{\inf} \underset{\theta \in \Theta}{\sup} \left\{ R(\theta, \delta) \right\} \end{equation} $$

Testing

A hypothesis test is just a type of decision procedure: we want to decide whether some preconceived notion (our hypothesis) is true. We’ll denote our hypothesis with $H$.

We make this actionable by assuming that, if we knew the true value of $\theta$, then we would know whether to accept or reject our hypothesis. This induces a partition of the distribution class, $\mathcal{P}$: $\mathcal{H} \subseteq \mathcal{P}$ is the subset of distributions labelled by values of $\theta$ for which we accept our hypothesis, and $\mathcal{K} \subseteq \mathcal{P}$ (the class of alternatives) is the subset for which we reject it.

If we let $\Theta_H, \Theta_K \subseteq \Theta$ be the corresponding subsets of the parameter space, then we have that:

\[\mathcal{H} \cup \mathcal{K} = \mathcal{P} \hspace{15mm} \text{and} \hspace{15mm} \Theta_H \cup \Theta_K = \Theta\]

We denote the decision of accepting $H$ with $d_0$ and the decision of rejecting it with $d_1$.

Non-Randomized and Randomized Tests

A non-randomized test will assign either $d_0$ or $d_1$ to each value $x \in \mathcal{X}$ (with probability $1$). Thus, we can define $S_0$ and $S_1$ as the subsets of $\mathcal{X}$ that contain the values for which the test assigns $d_0$ and $d_1$, respectively. $S_0$ is the acceptance region, and $S_1$ is the critical region or the rejection region.

A randomized test will assigned either $d_1$ or $d_0$ to each value $x \in \mathcal{X}$ with probabilities $\phi(x)$ and $1 - \phi(x)$, respectively. We call $0 \leq \phi(x) \leq 1$ the critical function of the test, and it characterizes it completely.

In contrast to a non-randomized test, there is some uncertainty associated with the assignment of decisions; a non-randomized test basically “knows” which decision to assign to all possible realizations of $X$

Simple and Composite Hypotheses

A class of distributions is called simple if it contains only a single distribution. Otherwise, it is called composite. The same concept extends to hypotheses. For example, if $\mathcal{H}$ contains only a single distribution, then it is called simple.

Errors

One can imagine conducting a test and coming to the incorrect conclusion. There are two different ways this can happen: we decide to reject $H$ in favor of $K$ when $H$ is true, or we decide to accept $H$ when it is not true. The former is called a Type I error, and the latter is a Type II error.

In a perfect world, we could choose a test procedure that minimizes the probability of both, but this is not possible. Instead, we design our test to be such that the probability of a Type I error is no greater than some value, called the significance level.

Definition (Significance Level).
The significance level, $\alpha \in (0, 1)$, of a hypothesis test is a selected value such that: $$ \begin{equation} \label{eq:sig-level} \mathbb{P}_{X \sim P_{\theta}}\left(\delta(X) = d_1 \right) = \mathbb{P}_{X \sim P_{\theta}}\left(X \in S_1 \right) \leq \alpha \hspace{5mm} \forall \theta \in \Theta_H \end{equation} $$

A test that has a signficance level of $\alpha$ will (usually) satisfy:

\[\underset{\theta \in \Theta_H}{\sup} \left\{ \mathbb{P}_{X \sim P_{\theta}}\left(X \in S_1 \right) \right\} = \alpha\]

We then minimize the probability of a Type II error subject to this constraint. That is, we minimize $P_{X \sim P_\theta}\left(\delta(X) = d_0 \right)$ for all $\theta \in \Theta_K$. This is equivalent to maximizing the probability of rejection of the hypothesis for all $\theta \in \Theta_K$.

Definition (Power Function).
The power function of a test, $\beta(\theta)$, is a function of $\theta \in \Theta$ and defined as: $$ \begin{equation} \label{eq:power-function} \beta(\theta) = \mathbb{P}_{X \sim P_\theta}\left(\delta(X) = d_1 \right) = \mathbb{P}_{X \sim P_{\theta}}\left(X \in S_1\right) \end{equation} $$

Evaluating the power function at a particular value of $\theta \in \Theta_K$ yields the power of the test against the alternative, $\theta$.

Basic Characteristics

We can characterize tests in a number of ways. Here, we cover a few of the simpler ones. Let $\phi$ be a level $\alpha$ (randomized) test. A test for which the probability of rejecting a false hypothesis is greater than the probability of rejecting a true hypothesis is called unbiased.

Definition (Unbiased). The test, $\phi$, is called an unbiased test if it satisfies: $$ \begin{aligned} & \beta_\phi(\theta) \leq \alpha & \text{ if } \theta \in \Theta_H \\ & \beta_{\phi}(\theta) \geq \alpha & \text{ if } \theta \in \Theta_K \end{aligned} $$

A related concept is the exact test, which will control the Type I error.

Definition (Exact). The test, $\phi$, is called exact if it satisfies: $$ \begin{aligned} \mathbb{E}_{X \sim P_{\theta}} \left[ \phi(x) \right] = \alpha \end{aligned} $$ for all $\theta \in \Theta_H$.