Home

Betting-Based Confidence Sequences

× These are my own notes, so errors and typos are bound to appear!

A colleague introduced me to some recent work from Waudby-Smith and Ramdas here at Carnegie Mellon. Since I’ve been working on applications of concentration bounds, it certainly seems important to review their paper Estimating means of bounded random variables by betting1, which can often achieve better bounds than the older work from Maurer and Pontil2. Here, I’m going to use the preprint, but their work has been published in JRSSB3.

This paper (among many others4 5 6 7) set inference in the context of playing a betting game with two players. For each “round”, we obtain an observation (i.e. a realization of some random variable). The Forecaster assigns a probability to the observations, and the Skeptic places a bet of the form of a real-valued function of the observations. In each “round”, the Skeptic pays a cost equal to the expected value of the observation and earns a payout of the observed value.

A lot of the discussion relies on measure and probability theory. I’ve covered most of the basic concepts in my post on measure theory, so I’ll refer the reader to that if you are looking for definitions and intuition for the relevant concepts here.

Note: Not all of the proofs are finished/included. I am hoping to find the time to return to thist post and complete them.


Background

The paper focuses on constructing valid (and tight) confidence sets for the mean of a sequence of bounded random variables.


Notation

Here, we’ll recap the important notation introduced in Waudby-Smith and Ramdas1. We denote the set of all distributions on $[0, 1]$ with mean $m$ with $\mathcal{Q}^m$.

Let \((X_t)_{t = 1}^\infty \sim P\) be a sequence of random variables taking values in $[0, 1]$ with (unknown) conditional mean $\mu \in [0, 1]$ where $P \in \mathcal{P}^\mu$ (we use the notation we defined earlier; i.e. $\mathcal{P}^\mu$ is the set of all distributions on $[0, 1]^\infty$ such that \(\mathbb{E}_P[X_t \rvert X_1, \dots, X_{t-1}] = \mu\)).

We’ll use $(a \pm b)$ as an abbreviation of the interval $(a-b, a+b)$. We’ll also use $X_1^t$ to denote $(X_1, \dots, X_t)$. Let \(\mathcal{F}_t = \sigma(X_1^t)\) denote the $\sigma$-field generated by $X_1^t$. We’ll suppose that $\mathcal{F}_0$ is the trivial $\sigma$-field (that is, the $\sigma$-field containing the empty set and the entire set itself). We’ll use \(\mathcal{F} = \{ \mathcal{F}_t \}_{t = 0}^\infty\) to denote the canonical filtration of $X_1^t$ where \(\mathcal{F}_0 \subset \mathcal{F}_1 \subset \mathcal{F}_2 \subset \dots\).

A stochastic process $(M_t)_{t = 0}^\infty$ adapted to canonical filtration $\mathcal{F}$ with $M_0 = 1$ is termed a test supermartingale for $P$ if:

\[\mathbb{E}_P[M_t \rvert \mathcal{F}_{t - 1}] \leq M_{t-1} \hspace{5mm} \forall t \geq 1 \nonumber\]

If instead, \(\mathbb{E}_P[M_t \rvert \mathcal{F}_{t - 1}] = M_{t-1}\), then the process is called a test martingale. If either of these hold for all $P \in \mathcal{P}$, then we say that \((M_t)_{t = 0}^\infty\) is a test (super)martingale for $\mathcal{P}$. Finally, a sequence \((\lambda_t)_{t = 1}^\infty\) is called predictable if $\lambda_t$ is $\mathcal{F}_{t-1}$-measurable for all $t \geq 1$.


Confidence Sets and Sequences

Recall that for some random sample $(X_t)_1^n \sim P$ and error tolerance $\alpha \in (0, 1)$, a $(1 - \alpha)$ confidence interval (CI) for $\mu$ is given by the random set (which is dependent upon our sample) such that the true conditional $\mu$ is contained in our interval with probability at least $1 - \alpha$ for any distribution in family $\mathcal{P}^\mu$.

\[C_n = C(X_1, \dots, X_n) \subseteq [0, 1] \hspace{5mm} \text{such that } \forall n \geq 1, \underset{P \in \mathcal{P}^\mu}{\inf} \left\{ P(\mu \in C_n) \right\} \geq 1 - \alpha\]

Hoeffding’s inequality leads to the CI:

\[C_n = \left( \bar{X}_n \pm \sqrt{\frac{\log(2/\alpha)}{2n}} \right) \cap [0, 1]\]

where the $n$ comes from the fact that $X_i \in [0, 1]$, and the intersection restricts our interval to the known possible values that $\mu$ can take on.

An interval implies a contiguous region of $\mathbb{R}$, but we could have the union of disjoint subsets. This is why we use the term “confidence set”.

A confidence set is constructed for some sample of fixed size $n$. An alternative scenario is that we are observing data in batches, and we want to construct a new confidence set for each batch such that the sequence of sets $(C_t)_{t=1}^\infty$ satisfies:

\[\underset{P \in \mathcal{P}^\mu}{\sup} \left\{ P(\exists t \geq 1: \mu \neq C_t) \right\} \leq \alpha\]

In words, we want the sequence of sets to be such that the probability of there being some set that does not contain $\mu$ to be at most $\alpha$ for any distribution in the family $\mathcal{P}^\mu$. We call such a $(C_t)_{t=1}^\infty$ a confidence sequence, and, ideally, we want $\underset{t \rightarrow \infty}{\lim} C_t = { \mu }$ (that is, we want the sequence to converge to the confidence set that is only the true value as we observe more and more data).

Notably, the above definitions hold for arbitrary stopping times, $\tau$. Since these can, and are often desired to be, data-dependent, we need a way to decide when our data is sufficient enough to stop constructing our sequences (we have achieved the result we wanted, we have enough data, etc.).

Definition (Level-$\alpha$ Sequential Test).
Let $H_0$ be some hypothesis, and let $\alpha \in [0, 1]$ be some tolerance. A level-$\alpha$ sequential test is a sequence $(\psi_t)_{t \in T}$ (with index set $T$) taking values in $\{ 0, 1 \}$ such that $\mathbb{P}_{H_0}(\psi_\tau = 1) \leq \alpha$ for any arbitrary stopping time, $\tau$.

A level-$\alpha$ sequential test will control the Type I error rate at $\alpha$ for any stopping time. This is by definition; the probability that $\psi_\tau$ outputs a $1$ is at most $\alpha$ under the null hypothesis.


Results

The first theorem is a procedure to construct a confidence set for the mean of a sequence of bounded random variables using supermartingales.

Theorem (Theorem 1 in Waudby-Smith and Ramdas (2022)).
Let $(X_t)_{t = 1}^\infty \sim P$ be a sequence of (infinitely or finitely many) random variables taking values in $[0, 1]$ with unknown conditional mean $\mu$ and distribution $P \sim \mathcal{P}^\mu$ where $\mathcal{P}^\mu$ is the collection of all distributions on $[0, 1]^\infty$ such that $\mathbb{E}_P\left[ X_t \rvert X_1 \dots X_{t-1}\right]$.
  1. For each $m \in [0, 1]$, we have a null hypothesis $H_0^m: P \in \mathcal{P}^m$.
  2. For each $m \in [0, 1]$, construct the non-negative process $M_t^m = M^m(X_1, \dots, X_t)$ (a function of $X_1, \dots, X_t$) such that $(M_t^\mu)_{t = 0}^\infty$ satisfies for each $P \in \mathcal{P}^\mu$, $(M_t^\mu)_{t = 0}^\infty$ is upperbounded by a test supermartingale for $P$. Note that this test supermartingale can change with $P$.
  3. For each $m \in [0, 1]$, let $(\phi_t^m)_{t = 1}^\infty$ denote the sequential test where $\phi_t^m = \mathbb{1}\left\{ M_t^m \geq \frac{1}{\alpha} \right\}$. Here, $\phi_t^m = 1$ indicates rejection of $H_0^m$ after $t$ observations.
  4. Define $C_t = \left\{ m \in [0, 1] \rvert \phi_t^{m} = 0 \right\}$ as the set of $m \in [0, 1]$ such that $\phi_t^m$ failes to reject $H_0^m$.
$(C_t)_{t = 1}^\infty$ forms a $(1 - \alpha)$-confidence sequence fo $\mu$. That is, the following holds: $$ \underset{P \in \mathcal{P}}{\sup} \left\{ \mathbb{P}(\exists t \geq 1 \rvert \mu \not\in C_t) \right\} \leq \alpha \nonumber $$
Proof. The proof is fairly simple and relies mainly on Ville's inequality.
For any $m \in [0, 1]$, by Step 2, if $M_t^m \geq \frac{1}{\alpha}$, then we know $\phi_t^m = 1$. Thus, under any $P \in \mathcal{P}^\mu$, the probability $\mathbb{P}_P(\exists t \geq 1 \rvert \phi_t^m = 1)$ is the same as the probability $\mathbb{P}_P\left(\exists t \geq 1 \rvert M_t^m \geq \frac{1}{\alpha}\right)$.
By construction, $(M_t^\mu)_{t = 0}^\infty$ is upperbounded by some test supermartingale for each $P \in \mathcal{P}^\mu$, so we can apply Ville's inequality to get: $$ \mathbb{P}_P(\exists t \geq 1 \rvert \phi_t^m = 1) = \mathbb{P}_P\left(\exists t \geq 1 \rvert M_t^m \geq \frac{1}{\alpha}\right) \leq \alpha \nonumber $$ Thus, $(\phi_t^m)_{t = 1}^\infty$ is a level $\alpha$ sequential hypothesis test.
The construction of $(C_t)_{t=0}^\infty$ entails only choosing those $m \in [0, 1]$ such that $H_0^m$ is rejected (for a particular choice of $t$). Thus, there is some $t \geq 1$ such that $\mu \not\in C_t$ if, and only if, there is some $t \geq 1$ such that $\phi_t^\mu = 1$.
The above holds for any $P \in \mathcal{P}^\mu$, which yields the desired result: $$ \underset{P \in \mathcal{P}^\mu}{\sup} \left\{ \mathbb{P}_P(\exists t \geq 1 \rvert \mu \not\in C_t) \right\} = \underset{P \in \mathcal{P}^\mu}{\sup} \left\{ \mathbb{P}_P(\exists t \geq 1 \rvert \phi_t^\mu = 1) \right\} \leq \alpha \nonumber $$

A few notes are worth mentioning. First, the process $(M_t^m)_{t = 0}^\infty$ that is constructed in Step 2 of the above theorem is an example of an $e$-process for $\mathcal{P}$ (see my post on martingales). Secondly, the power and utility of confidence sequences constructed in this way depends upon the choice of upperbounding test supermartingales. Poor choices can lead to bad performance (statistically speaking) and computational infeasibility. Lastly, because the confidence sequence is time-uniform (holds over all time points), we can pretend we have a fixed stopping point to derive a valid $(1-\alpha)$-confidence set/interval for $\mu$ for a fixed sample size (no more batches are gathered).


Predictable Plug-Ins

The main results of this paper are based on the predictable plug-in, which is a sequence of predictable real-valued random variables that are chosen specially to be plugged into other sequences in order to form test (super)martingales. The choice of predictable plug-in depends upon the following Lemma.

Lemma 1: Predictable Plug-In Chernoff Supermartingales (Waudby-Smith and Ramdas (2022)).
Let $(X_t)_{t = 1}^\infty \sim P$ be a sequence of random variables. Let $\mathcal{F}$ denote the canonical filtration with respect to $(X_t)_{t =1}^\infty$.
Suppose that, for some choice of $\mu$, $v_t$, and $\psi(\lambda)$, the following is satisfied, for each $t \geq 1$, for any $\lambda \in \Lambda \subseteq \mathbb{R}$: $$ \mathbb{E}_P\left[ \exp\left(\lambda(X_t - \mu) - v_t \psi(\lambda) \right) \rvert \mathcal{F}_{t-1} \right] \leq 1 \nonumber $$ For any $(\lambda_t)_{t=1}^\infty$ with $\lambda_t \in \Lambda$ for all $t$ that is predictable with respect to $\mathcal{F}$, we have that: $$ M_t^\psi(\mu) = \prod_{i = 1}^t \exp\left( \lambda_i(X_i - \mu) - v_i \psi(\lambda_i) \right) \nonumber $$ is a test supermartingale with respect to $\mathcal{F}$.
Proof. I am not sure how to prove it, but I think that the process defined above as $M_t^\psi(\mu)$ is an example of Bernstein's supermartingale (see Shafer & Vovk (2019)) and is therefore a supermartingale. It suffices now to show that $M_t^\psi(\mu)$ has expectation at most $1$ at $t = 1$ and that $\mathbb{E}_P [ M_t^\psi(\mu) \rvert \mathcal{F}_{t-1} ] \leq M_{t-1}^\psi(\mu)$.
Notice that, for $t = 1$, the conditional expectation of $M_t^\psi(\mu)$ is: $$ \mathbb{E}_P \left[M_1^\psi(\mu) \bigg\rvert \mathcal{F}_0 \right] = \mathbb{E}_P\left[ \exp\left(\lambda_1(X_1 - \mu) - v_1\psi(\lambda_1) \right) \right] \leq 1 \nonumber $$ by assumption. Because $\mathcal{F}_0$ is the trivial $\sigma$-field, which gives us no information, we drop the conditioning.
Now consider $t \geq 2$. For any $i \leq t - 1$, $\exp(\lambda_i(X_i - \mu) - v_i \psi(\lambda_i))$ is $\mathcal{F}_{t-1}$-measurable. This is because $(\lambda_t)_{t=1}^\infty$ is predictable with respect to $\mathcal{F}$, implying $\lambda_i$ is $\mathcal{F}_{t-1}$-measurable. To see why, consider probability space $(\Omega, \mathcal{A}, P)$ and sub-$\sigma$-field $\mathcal{A}_1 \subset \mathcal{A}$. If a random variable $X$ is $\mathcal{A}_1$-measurable, then for any Borel set $B$, $X^{-1}(B) \in \mathcal{A}_1 \subset \mathcal{A}$. Thus, $X$ is also $\mathcal{A}$-measurable. In addition, $\mathcal{F}$ is the canonical filtration, so $X_i$ must be $\mathcal{F}_{t - 1}$-measurable since $\mathcal{F}_{t-1} = \sigma(X_1, \dots, X_{t-1})$.
Since $\exp(\lambda_i(X_i - \mu) - v_i \psi(\lambda_i))$ is $\mathcal{F}_{t-1}$-measurable for all $i \leq t-1$, we know that $\exp(\lambda_i(X_i - \mu) - v_i \psi(\lambda_i)) \in \mathcal{F}_{t-1}$. Since we are dealing with bounded random variables, we can use Theorem 4.1.14 from Durrett (2019) to write the conditional expectation of $M_t^\psi(\mu)$ as: $$ \begin{aligned} \mathbb{E}_P \left[ M_t^\psi(\mu) \bigg\rvert \mathcal{F}_{t-1} \right] &= \mathbb{E}_P \left[ \prod_{i = 1}^t \exp\left(\lambda_i(X_i - \mu) - v_i \psi(\lambda_i) \right) \bigg\rvert \mathcal{F}_{t-1}\right] \\ &= \prod_{i = 1}^{t-1} \exp(\lambda_i(X_i - \mu) - v_i\psi(\lambda_i)) \underbrace{\mathbb{E}_P\left[ \exp\left( \lambda_t (X_t - \mu) - v_t \psi(\lambda_t)\right) \rvert \mathcal{F}_{t-1} \right]}_{\leq 1} \\ &\leq \prod_{i = 1}^{t-1} \exp(\lambda_i(X_i - \mu) - v_i\psi(\lambda_i)) \\ &= M_{t-1}^\psi(\mu) \end{aligned} \nonumber $$ Sicne $\mathbb{E}_P[M_t^\psi(\mu) \rvert \mathcal{F}_{t-1}] \leq M_{t-1}^\psi(\mu)$ for all $t \geq 1$, and $\mathbb{E}_P[M_1^\psi(\mu) \rvert \mathcal{F}_0] \leq 1$, $(M_t^\psi(\mu))_{t = 1}^\infty$ is a test supermartingale.

Hoeffding

Let’s first define the Hoeffding process.

Definition (Hoeffding Process).
Let $(X_t)_{t = 1}^\infty \sim P$ be a sequence of random variables taking values in $[0, 1]$ with distribution $P \in \mathcal{P}^\mu$ where $\mathcal{P}$ is the set of distributions on $[0, 1]^\infty$ such that $\mathbb{E}_P[X_t \vert \mathcal{F}_{t-1}] = \mu$ for each $t$.
The Hoeffding process for (candidate mean) $m \in [0, 1]$, denoted by $(M_t^H(m))_{t=0}^\infty$ is defined as: $$ M_t^{\text{PrPl-H}}(m) = \prod_{i = 1}^t \exp\left( \lambda_i(X_i - m) - \phi_H(\lambda_i)\right) \nonumber $$ where $\psi_H(\lambda) = \frac{\lambda^2}{8}$ is an upper bound on the cumulant generating function for random variables in $[0, 1]$, and $(\lambda_t)_{t = 1}^\infty$ is a sequence of predictable $\lambda_t \in \mathbb{R}$ that is constructed. We also assume $M_0^{\text{PrPl-H}}(m) = 1$.

This sequence is a test supermartingale for $\mathcal{P}^m$, and we call $(\lambda_t)_{t=1}^\infty$ a predictable plug-in.

This result implies that if one can be clever about their choice of $(\lambda_t)_{t=1}^\infty$, one can construct a test supermartingale that can be used in Step 2 of Theorem 1 to get a confidence sequence for $\mu$. This is formalized in the following:

Proposition 1: Predictable Plug-In Hoeffding CS (Waudby-Smith and Ramdas (2022)).
Let $(X_t)_{t=1}^\infty \sim P$ be a stochastic process with distribution $P \in \mathcal{P}^\mu$ where $\mathcal{P}^\mu$ is the set of distributions on $[0, 1]^\infty$ such that $\mathbb{E}_P[X \rvert \mathcal{F}_{t-1}] = \mu$ for each $t$. For any predictable $(\lambda_t)_{t=1}^\infty$ where $\lambda_t \in \mathbb{R}$ for all $t$, we can construct a $(1-\alpha)$ confidence sequence for $\mu$ as: $$ C_t^{\text{PrPl-H}} = \left(\frac{\sum_{i =1 }^t \lambda_i X_i }{\sum_{i = 1}^t \lambda_i} \pm \frac{\log(2/\alpha) + \sum_{i = 1}^t \psi_H(\lambda_i)}{\sum_{i = 1}^t \lambda_i} \right) \nonumber $$ The running intersection, $\cap_{i \leq t} C_t^\text{PrPl-H}$, is also a valid $(1-\alpha)$ confidence sequence for $\mu$.

The authors recommend choosing $(\lambda_t)_{t=1}^\infty$ such that:

\[\lambda_t^{PrPl-H} = \min \left\{ \sqrt{\frac{8 \log(2/\alpha)}{t \log(t + 1)}}, 1 \right\}\]

Empirical Bernstein

Similar to the Hoeffding process, we can define the empirical Bernstein process.

Definition (Empirical Bernstein Process).
Let $(X_t)_{t = 1}^\infty \sim P$ be a sequence of random variables taking values in $[0, 1]$ with distribution $P \in \mathcal{P}^\mu$ where $\mathcal{P}$ is the set of distributions on $[0, 1]^\infty$ such that $\mathbb{E}_P[X_t \vert \mathcal{F}_{t-1}] = \mu$ for each $t$.
The empirical Bernstein process for (candidate mean) $m \in [0, 1]$, denoted by $(M_t^{EB}(m))_{t=0}^\infty$ is defined as: $$ M_t^{EB}(m) = \prod_{i = 1}^t \exp\left( \lambda_i(X_i - m) - v_i \psi_{EB}(\lambda_i) \right) \nonumber $$ where $v_i = 4(X_i - \hat{\mu}_{i - 1})^2$, $\psi_{EB}(\lambda) = (-\log(1-\lambda) - \lambda)/4$ for $\lambda \in [0, 1)$, and $(\lambda_t)_{t = 1}^\infty$ is a sequence of predictable $\lambda_t \in \mathbb{R}$ that is constructed.

One can perform a similar procedure as before, using $(M_t^\text{PrPl-EB}(m))$ in Step 2 of Theorem 1 to get a confidence sequence.

Theorem 2: Predictable Plug-In Empirical Bernstein CS (Waudby-Smith and Ramdas (2022)).
Let $(X_t)_{t=1}^\infty \sim P$ be a stochastic process with distribution $P \in \mathcal{P}^\mu$ where $\mathcal{P}^\mu$ is the set of distributions on $[0, 1]^\infty$ such that $\mathbb{E}_P[X \rvert \mathcal{F}_{t-1}] = \mu$ for each $t$. For any predictable $(\lambda_t)_{t=1}^\infty$ where $\lambda_t \in (0, 1)$ for all $t$, we can construct a $(1-\alpha)$ confidence sequence for $\mu$ as: $$ C_t^{\text{PrPl-EB}} = \left(\frac{\sum_{i =1 }^t \lambda_i X_i }{\sum_{i = 1}^t \lambda_i} \pm \frac{\log(2/\alpha) + \sum_{i = 1}^t v_i \psi_{EB}(\lambda_i)}{\sum_{i = 1}^t \lambda_i} \right) \nonumber $$ The running intersection, $\cap_{i \leq t} C_t^\text{PrPl-EB}$, is also a valid $(1-\alpha)$ confidence sequence for $\mu$.

The authors recommend choosing $(\lambda_t)_{t=1}^\infty$ such that:

\[\lambda_t^{PrPl-EB} = \min \left\{ \sqrt{\frac{2\log(2/\alpha)}{\hat{\sigma}_{t-1}^2 t \log(1 + t)}}, c \right\} \hspace{5mm} \text{ where } \hspace{2mm} \hat{\sigma}_t^2 = \frac{\frac{1}{4} + \sum_{i = 1}^t (X_i - \hat{\mu}_i)^2}{t+1} \hspace{2mm}\text{ and }\hspace{2mm} \hat{\mu}_t = \frac{\frac{1}{2} + \sum_{i = 1}^t X_i}{t + 1} \hspace{2mm}\text{ and } \hspace{2mm} c = \frac{1}{2} \text{ or } \frac{3}{4} \label{eq:empirical-bernstein-choices}\]

This procedure yields two sequences, \((\hat{\mu}_t)_{t=1}^\infty\) and \((\hat{\sigma}_t^2)_{t=1}^\infty\), which can be thought of as predictable, regularized sample means and variances.

For fixed time, we can simply use the running intersection to form a $(1-\alpha)$ confidence interval for $\mu$. Suppose we have $n$ samples. Then we can form a confidence interval as $C_n^{\text{PrPl-EB_CI}} = \cap_{i \leq n} C_i^{\text{PrPl-EB}}$ with any predictable sequence $(\lambda_t)_{t = 1}^\infty$. The authors recommend choosing:

\[\lambda_t^{\text{PrPl-EB}(n)} = \min \left\{ \sqrt{\frac{2 \log(2/\alpha)}{n \hat{\sigma}_{t-1}^2}} \right\}\]

with $\hat{\sigma}_i^2$ and $c$ as in Eq. \ref{eq:empirical-bernstein-choices}.

Something very interesting is that the width of the above confidence interval goes to $\sigma \sqrt{2 \log(2/\alpha)}$ for i.i.d. data (where $\sigma$ is the true standard deviation). In contrast, the original empriical Bernstein confidence intervals introduced by Maurer and Pontil (2009) only go to $\sigma \sqrt{2 \log(4 / \alpha)}$.


A Betting Perspective

In order to improve the confidence sequences derived above, the authors dive deeper into the gambling analogy. Suppose we are playing a game in which we can accumulate wealth by making bets against some hypothesis, $H_0^m$, being true. We start off with one dollar, and, in each “round”, we will make a bet that is determined by our predictable sequence $(\lambda_t^m)_{t=1}^\infty$. More specifically, we make bet $b_t = s_t\rvert \lambda_t^m \rvert$ where $s_t = -1$ if we think $\mu < m$ and $s_t = 1$ if we think $\mu > m$. The second term, $\rvert \lambda_t^m \rvert$, is the amount we are willing to lose/win in our bet (e.g. $\rvert \lambda_t^m \rvert = 0$ implies we are risking nothing).

We can conceptualize the total wealth we have at any given time with the capital process.

Definition (Capital Process).
Let $(X_t)_{t = 1}^\infty \sim P$ be a sequence of random variables taking values in $[0, 1]$ with distribution $P \in \mathcal{P}^\mu$ where $\mathcal{P}$ is the set of distributions on $[0, 1]^\infty$ such that $\mathbb{E}_P[X_t \vert \mathcal{F}_{t-1}] = \mu$ for each $t$.
The capital process for any $m \in [0, 1]$ is defined as: $$ \mathcal{K}_t(m) = \prod_{i = 1}^t ( 1 + \lambda_i(m) \cdot (X_i - m)) \nonumber $$ where $\mathcal{K}_0(m) = 1$ and $(\lambda_t(m))_{t = 1}^\infty$ is a predictable sequence taking values in $\left(-\frac{1}{1-m}, \frac{1}{m}\right)$ (where $\frac{1}{m} = \infty$ if $m = 0$ and $\frac{1}{1-m} = \infty$ if $m = 1$).

There is a deep connection between test (super)martingales, bounded random variables, and the capital process that the authors emphasize. This is formalized below.

Proposition 2 (Waudby-Smith and Ramdas (2022)).
Let $X_1, X_2, \dots \sim P$ be a sequence of random variables taking values in $[0, 1]$, and let $\mu \in [0, 1]$. The following are equivalent:
  1. $\mathbb{E}_P[X_t \rvert \mathcal{F}_{t-1}] = \mu$ for all $t \in \mathbb{N}$ where $\mathcal{F}_{t-1} = \sigma(X_1, \dots, X_{t-1})$
  2. There exists $\lambda \in \mathbb{R} \setminus \{ 0 \}$ such that the capital process $(\mathcal{K}_t(\mu))_{t = 0}^\infty$ is a strictly positive test martingale for $P$
  3. For any $\lambda \in \left(-\frac{1}{1-\mu}, \frac{1}{\mu} \right)$, the capital process $(\mathcal{K}_t(\mu))_{t = 0}^\infty$ is a test martingale for $P$
  4. For all $\left(-\frac{1}{1-\mu}, \frac{1}{\mu} \right)$-valued predictable sequence $(\lambda_t)_{t = 1}^\infty$, the capital process $(\mathcal{K}_t(\mu))_{t = 0}^\infty$ is a test martingale for $P$

The main takeaway is that \((\mathcal{K}_t(\mu))_{t=0}^\infty\) is a test martingale (not supermartingale) for $P$ if the random variables have conditional expectation equal to $\mu$. If the hypothesis $H_0^m$ is true (i.e. $m = \mu$), then, by Proposition 2, we expect to not make any money since the conditional expectation is constant at every round (i.e. \(\mathbb{E}_P[X_t \rvert \mathcal{F}_{t-1}] = \mu\) for all $t \in \mathbb{N}$). However, if nature is wrong and $m \neq \mu$, then, by Ville’s inequality, the probability of ever earning at least $\frac{1}{\alpha}$ capital is at most $\alpha$.

This naturally extends to hypothesis testing: under the null hypothesis, the probability of earning a lot of money is very low, so we should reject $H_0^m$ if our capital becomes too large because it goes against the expectation of constant wealth. The authors then explain that one could imagine playing a game for every value of $m \in [0, 1]$. For a time $t$, one could the construct a confidence sequence as the collection of games with small capital at $t$.

The reason the authors introduce this concept is because the capital process is a test martingale when $m = \mu$. Unlike the Hoeffding and empirical Bernstein processes, which are test supermartingales. Since test supermartingales expect the capital to decrease, test constructed from them can be more conservative (this is because we expect to be losing wealth over time, so $\mathbb{E}[X_0]$ will be the maximal value, and Ville’s inequality will yield a relatively larger bound).

Hedging Our Bets

Since the capital process describes the wealth we accumulate during a better game, we can ask the question: Is there a way to mitigate our risk? In general betting games, one can hedge one’s bets. This means that, after placing an initial bet, the player bets again on a different outcome (or multiple different outcomes) as they see the game play out. This can reduce their net loss because they could possibly win back some money with their later bets that are based on more information about the game.

Now, consider making two simultaneous bets: one with some part $(\theta)$ of one’s current wealth and the other with the rest. One bet will be made for the case that $\mu \geq m$, and the other will be made for the case that $\mu < m$. We can describe separate capital processes for these two bets and use these to describe the wealth accumulated across both betting strategies. This is called the hedged capital process.

Definition (Hedged Capital Process).
Let $(X_t)_{t = 1}^\infty \sim P$ be a sequence of random variables taking values in $[0, 1]$ with distribution $P \in \mathcal{P}^\mu$ where $\mathcal{P}$ is the set of distributions on $[0, 1]^\infty$ such that $\mathbb{E}_P[X_t \vert \mathcal{F}_{t-1}] = \mu$ for each $t$.
Let $(\mathcal{K}_t^+(m))_{t = 1}^\infty$ and $(\mathcal{K}_t^-(m))_{t=1}^\infty$ be sequences where: $$ \mathcal{K}_t^+ = \prod_{i = 1}^t (1 + \lambda_i^+(m) \cdot (X_i - m)) \hspace{5mm} \text{ and } \hspace{5mm} \mathcal{K}_t^- = \prod_{i = 1}^t (1 + \lambda_i^-(m) \cdot (X_i - m)) \nonumber $$ for predictable sequences $(\lambda_t^+(m))_{t=1}^\infty$ and $(\lambda_t^-(m))_{t=1}^\infty$ with $\lambda_t^+(m) \in \left[0, \frac{1}{m}\right)$ and $\lambda_t^-(m) \in \left[0, \frac{1}{1-m}\right)$.
The hedged capital process for any $m \in [0, 1]$ and any $\theta \in [0, 1]$ is defined as: $$ \mathcal{K}_t^\pm(m) = \max \left\{ \theta \mathcal{K}_t^+(m), (1-\theta) \mathcal{K}_t^-(m) \right\} \nonumber $$

In the event that $m = \mu$, we would expect our hedged bets to both yield no earnings. However, if $m \neq \mu$, then one bet should lose and the other should win money. With this capital process, we can construct a confidence sequence.

Theorem 3: Hedged Capital CS (Waudby-Smith and Ramdas (2022)).
Let $(X_t)_{t = 1}^\infty \sim P$ be a sequence of random variables taking values in $[0, 1]$ with distribution $P \in \mathcal{P}^\mu$ where $\mathcal{P}$ is the set of distributions on $[0, 1]^\infty$ such that $\mathbb{E}_P[X_t \vert \mathcal{F}_{t-1}] = \mu$ for each $t$.
Let $(\tilde{\lambda}_t^+)_{t=1}^\infty$ and $(\tilde{\lambda}_t^-)_{t=1}^\infty$ be predictable, real-valued sequences, independent of $m$, and define, for each $t \geq 1$: $$ \lambda_t^+(m) = \min\left\{ \rvert \tilde{\lambda}_t^+\rvert, \frac{c}{m} \right\} \hspace{5mm} \text{ and } \hspace{5mm} \lambda_t^-(m) = \min\left\{ \rvert \tilde{\lambda}_t^-\rvert, \frac{c}{1-m} \right\} \nonumber $$ and for some chosen $c \in [0, 1)$.
The sequence $(\mathcal{B}_t^{\pm})_{t = 1}^\infty$ forms a $(1-\alpha)$ confidence sequence for $\mu$ where: $$ \mathcal{B}_t^{\pm} = \left\{ m \in [0, 1] \bigg\rvert \mathcal{K}_t^\pm(m) < \frac{1}{\alpha} \right\} \nonumber $$ The running intersection, $\cap_{i \leq t} \mathcal{B}_i^\pm$ is also a valid $(1-\alpha)$ confidence sequence for $\mu$. Furthermore, $\mathcal{B}_t^{\pm}$ is a valid $(1-\alpha)$ confidence interval for each $t \geq 1$.

The authors recommend using $\tilde{\lambda}_t^+ = \tilde{\lambda}_t^- = \lambda_t^{\text{PrPl}\pm}$ for the predictable plug-ins, where:

\[\lambda_t^{\text{PrPl}\pm} = \sqrt{\frac{2 \log(2/\alpha)}{\hat{\sigma}_{t-1}^2 t \log(t+1)}} \hspace{5mm} \text{ where } \hspace{2mm} \hat{\sigma}_t^2 = \frac{\frac{1}{4} + \sum_{i = 1}^t (X_i - \hat{\mu}_i)^2}{t+1} \hspace{2mm}\text{ and }\hspace{2mm} \hat{\mu}_t = \frac{\frac{1}{2} + \sum_{i = 1}^t X_i}{t + 1} \hspace{2mm}\text{ and }\hspace{2mm} c = \frac{1}{2} \text{ or } \frac{3}{4} \label{eq:hedged-capital-choices}\]

Similar to the empirical Bernstein case above, we can use the running intersection, $\cap_{i \leq n} \mathcal{B}_i^\pm$, to form a $(1- \alpha)$ confidence interval for a fixed sample size $n$, where we choose:

\[\tilde{\lambda}_t^+ = \tilde{\lambda}_t^- = \tilde{\lambda}_t^\pm = \sqrt{\frac{2 \log(2/\alpha)}{n \hat{\sigma}_{t - 1}^2}}\]

for the same $\hat{\sigma}^2$ as defined in Eq. \ref{eq:hedged-capital-choices}.

A key finding of Waudby-Smith and Ramdas is that, for large sample sizes, the hedged capital confidence intervals are almost surely better than the confidence intervals one can create from Hoeffding’s inequality. Even more, the hedged capital confidence intervals converge at a rate of $O(1/\sqrt{n})$, which is optimal.


References

  1. Waudby-Smith, I., & Ramdas, A. (2022). Estimating means of bounded random variables by betting. arXiv [Math.ST]. Retrieved from http://arxiv.org/abs/2010.09686  2

  2. Maurer, A., & Pontil, M. (2009). Empirical Bernstein Bounds and Sample Variance Penalization (arXiv:0907.3740). arXiv. https://doi.org/10.48550/arXiv.0907.3740 

  3. Waudby-Smith, I., & Ramdas, A. (2023). Estimating means of bounded random variables by betting. Journal of the Royal Statistical Society Series B: Statistical Methodology, 86(1), 1–27. doi:10.1093/jrsssb/qkad009 

  4. Ramdas, A., Ruf, J., Larsson, M., & Koolen, W. (2022). Admissible anytime-valid sequential inference must rely on nonnegative martingales (arXiv:2009.03167). arXiv. https://doi.org/10.48550/arXiv.2009.03167 

  5. Ramdas, A., Grunwald, P., Vovk, V., & Shafer, G. (2023). Game-Theoretic Statistics and Safe Anytime-Valid Inference. 

  6. Shafer, G., & Vovk, V. (2019). Game-theoretic foundations for probability and finance (First edition). John Wiley & Sons, Inc. 

  7. Howard, S. R., Ramdas, A., McAuliffe, J., & Sekhon, J. (2021). Time-uniform, nonparametric, nonasymptotic confidence sequences. The Annals of Statistics, 49(2). https://doi.org/10.1214/20-AOS1991 

This project is maintained by aerosengart