A colleague introduced me to some recent work from Waudby-Smith and Ramdas here at Carnegie Mellon. Since I’ve been working on applications of concentration bounds, it certainly seems important to review their paper Estimating means of bounded random variables by betting, which can often achieve better bounds than the older work from Maurer and Pontil
This paper (among many others
A lot of the discussion relies on measure and probability theory. I’ve covered most of the basic concepts in my post on measure theory, so I’ll refer the reader to that if you are looking for definitions and intuition for the relevant concepts here.
Note: Not all of the proofs are finished/included. I am hoping to find the time to return to this post and complete them.
The paper focuses on constructing valid (and tight) confidence sets for the mean of a sequence of bounded random variables.
Here, we’ll recap the important notation introduced in Waudby-Smith and Ramdas
Let \((X_t)_{t = 1}^\infty \sim P\) be a sequence of random variables taking values in $[0, 1]$ with (unknown) conditional mean $\mu \in [0, 1]$ where $P \in \mathcal{P}^\mu$ (we use the notation we defined earlier; i.e. $\mathcal{P}^\mu$ is the set of all distributions on $[0, 1]^\infty$ such that \(\mathbb{E}_P[X_t \rvert X_1, \dots, X_{t-1}] = \mu\)).
We’ll use $(a \pm b)$ as an abbreviation of the interval $(a-b, a+b)$. We’ll also use $X_1^t$ to denote $(X_1, \dots, X_t)$. Let \(\mathcal{F}_t = \sigma(X_1^t)\) denote the $\sigma$-field generated by $X_1^t$. We’ll suppose that $\mathcal{F}_0$ is the trivial $\sigma$-field (that is, the $\sigma$-field containing the empty set and the entire set itself). We’ll use \(\mathcal{F} = \{ \mathcal{F}_t \}_{t = 0}^\infty\) to denote the canonical filtration of $X_1^t$ where \(\mathcal{F}_0 \subset \mathcal{F}_1 \subset \mathcal{F}_2 \subset \dots\).
A stochastic process $(M_t)_{t = 0}^\infty$ adapted to canonical filtration $\mathcal{F}$ with $M_0 = 1$ is termed a test supermartingale for $P$ if:
\[\mathbb{E}_P[M_t \rvert \mathcal{F}_{t - 1}] \leq M_{t-1} \hspace{5mm} \forall t \geq 1\]If instead, \(\mathbb{E}_P[M_t \rvert \mathcal{F}_{t - 1}] = M_{t-1}\), then the process is called a test martingale. If either of these hold for all $P \in \mathcal{P}$, then we say that \((M_t)_{t = 0}^\infty\) is a test (super)martingale for $\mathcal{P}$. Finally, a sequence \((\lambda_t)_{t = 1}^\infty\) is called predictable if $\lambda_t$ is $\mathcal{F}_{t-1}$-measurable for all $t \geq 1$.
Recall that for some random sample $(X_t)_1^n \sim P$ and error tolerance $\alpha \in (0, 1)$, a $(1 - \alpha)$ confidence interval (CI) for $\mu$ is given by the random set (which is dependent upon our sample) such that the true conditional $\mu$ is contained in our interval with probability at least $1 - \alpha$ for any distribution in family $\mathcal{P}^\mu$.
\[C_n = C(X_1, \dots, X_n) \subseteq [0, 1] \hspace{5mm} \text{such that } \forall n \geq 1, \underset{P \in \mathcal{P}^\mu}{\inf} \left\{ P(\mu \in C_n) \right\} \geq 1 - \alpha\]Hoeffding’s inequality leads to the CI:
\[C_n = \left( \bar{X}_n \pm \sqrt{\frac{\log(2/\alpha)}{2n}} \right) \cap [0, 1]\]where the $n$ comes from the fact that $X_i \in [0, 1]$, and the intersection restricts our interval to the known possible values that $\mu$ can take on.
An interval implies a contiguous region of $\mathbb{R}$, but we could have the union of disjoint subsets. This is why we use the term “confidence set”.
A confidence set is constructed for some sample of fixed size $n$. An alternative scenario is that we are observing data in batches, and we want to construct a new confidence set for each batch such that the sequence of sets $(C_t)_{t=1}^\infty$ satisfies:
\[\underset{P \in \mathcal{P}^\mu}{\sup} \left\{ P(\exists t \geq 1: \mu \neq C_t) \right\} \leq \alpha\]In words, we want the sequence of sets to be such that the probability of there being some set that does not contain $\mu$ to be at most $\alpha$ for any distribution in the family $\mathcal{P}^\mu$. We call such a $(C_t)_{t=1}^\infty$ a confidence sequence, and, ideally, we want $\underset{t \rightarrow \infty}{\lim} C_t = { \mu }$ (that is, we want the sequence to converge to the confidence set that is only the true value as we observe more and more data).
Notably, the above definitions hold for arbitrary stopping times, $\tau$. Since these can, and are often desired to be, data-dependent, we need a way to decide when our data is sufficient enough to stop constructing our sequences (we have achieved the result we wanted, we have enough data, etc.).
A level-$\alpha$ sequential test will control the Type I error rate at $\alpha$ for any stopping time. This is by definition; the probability that $\psi_\tau$ outputs a $1$ is at most $\alpha$ under the null hypothesis.
The first theorem is a procedure to construct a confidence set for the mean of a sequence of bounded random variables using supermartingales.
Let \((X_t)_{t = 1}^\infty \sim P\) be a sequence of (infinitely or finitely many) random variables taking values in $[0, 1]$ with unknown conditional mean $\mu$ and distribution \(P \sim \mathcal{P}^\mu\) where \(\mathcal{P}^\mu\) is the collection of all distributions on \([0, 1]^\infty\) such that \(\mathbb{E}_P\left[ X_t \rvert X_1 \dots X_{t-1}\right]\).
$(C_t)_{t = 1}^\infty$ forms a $(1 - \alpha)$-confidence sequence fo $\mu$. That is, the following holds:
\[\underset{P \in \mathcal{P}}{\sup} \left\{ \mathbb{P}(\exists t \geq 1 \rvert \mu \not\in C_t) \right\} \leq \alpha\]The proof is fairly simple and relies mainly on Ville’s inequality.
For any $m \in [0, 1]$, by Step 2, if $M_t^m \geq \frac{1}{\alpha}$, then we know $\phi_t^m = 1$. Thus, under any $P \in \mathcal{P}^\mu$, the probability $\mathbb{P}_P(\exists t \geq 1 \rvert \phi_t^m = 1)$ is the same as the probability $\mathbb{P}_P\left(\exists t \geq 1 \rvert M_t^m \geq \frac{1}{\alpha}\right)$.
By construction, $(M_t^\mu)_{t = 0}^\infty$ is upperbounded by some test supermartingale for each $P \in \mathcal{P}^\mu$, so we can apply Ville’s inequality to get:
\[\mathbb{P}_P(\exists t \geq 1 \rvert \phi_t^m = 1) = \mathbb{P}_P\left(\exists t \geq 1 \rvert M_t^m \geq \frac{1}{\alpha}\right) \leq \alpha \nonumber\]Thus, $(\phi_t^m)_{t = 1}^\infty$ is a level $\alpha$ sequential hypothesis test.
The construction of $(C_t)_{t=0}^\infty$ entails only choosing those $m \in [0, 1]$ such that $H_0^m$ is rejected (for a particular choice of $t$). Thus, there is some $t \geq 1$ such that $\mu \not\in C_t$ if, and only if, there is some $t \geq 1$ such that $\phi_t^\mu = 1$.
The above holds for any $P \in \mathcal{P}^\mu$, which yields the desired result:
\[\underset{P \in \mathcal{P}^\mu}{\sup} \left\{ \mathbb{P}_P(\exists t \geq 1 \rvert \mu \not\in C_t) \right\} = \underset{P \in \mathcal{P}^\mu}{\sup} \left\{ \mathbb{P}_P(\exists t \geq 1 \rvert \phi_t^\mu = 1) \right\} \leq \alpha \nonumber\]A few notes are worth mentioning. First, the process $(M_t^m)_{t = 0}^\infty$ that is constructed in Step 2 of the above theorem is an example of an $e$-process for $\mathcal{P}$ (see my post on martingales). Secondly, the power and utility of confidence sequences constructed in this way depends upon the choice of upperbounding test supermartingales. Poor choices can lead to bad performance (statistically speaking) and computational infeasibility. Lastly, because the confidence sequence is time-uniform (holds over all time points), we can pretend we have a fixed stopping point to derive a valid $(1-\alpha)$-confidence set/interval for $\mu$ for a fixed sample size (no more batches are gathered).
The main results of this paper are based on the predictable plug-in, which is a sequence of predictable real-valued random variables that are chosen specially to be plugged into other sequences in order to form test (super)martingales. The choice of predictable plug-in depends upon the following Lemma.
Let \((X_t)_{t = 1}^\infty \sim P\) be a sequence of random variables. Let $\mathcal{F}$ denote the canonical filtration with respect to \((X_t)_{t =1}^\infty\).
Suppose that, for some choice of $\mu$, $v_t$, and $\psi(\lambda)$, the following is satisfied, for each $t \geq 1$, for any $\lambda \in \Lambda \subseteq \mathbb{R}$:
\[\mathbb{E}_P\left[ \exp\left(\lambda(X_t - \mu) - v_t \psi(\lambda) \right) \rvert \mathcal{F}_{t-1} \right] \leq 1\]For any $(\lambda_t)_{t=1}^\infty$ with $\lambda_t \in \Lambda$ for all $t$ that is predictable with respect to $\mathcal{F}$, we have that:
\[M_t^\psi(\mu) = \prod_{i = 1}^t \exp\left( \lambda_i(X_i - \mu) - v_i \psi(\lambda_i) \right)\]is a test supermartingale with respect to $\mathcal{F}$.
I am not sure how to prove it, but I think that the process defined above as \(M_t^\psi(\mu)\) is an example of Bernstein’s supermartingale
Notice that, for $t = 1$, the conditional expectation of $M_t^\psi(\mu)$ is:
\[\mathbb{E}_P \left[M_1^\psi(\mu) \bigg\rvert \mathcal{F}_0 \right] = \mathbb{E}_P\left[ \exp\left(\lambda_1(X_1 - \mu) - v_1\psi(\lambda_1) \right) \right] \leq 1\]by assumption. Because $\mathcal{F}_0$ is the trivial $\sigma$-field, which gives us no information, we drop the conditioning.
Now consider $t \geq 2$. For any $i \leq t - 1$, \(\exp(\lambda_i(X_i - \mu) - v_i \psi(\lambda_i))\) is \(\mathcal{F}_{t-1}\)-measurable. This is because \((\lambda_t)_{t=1}^\infty\) is predictable with respect to $\mathcal{F}$, implying $\lambda_i$ is \(\mathcal{F}_{t-1}\)-measurable. To see why, consider probability space \((\Omega, \mathcal{A}, P)\) and sub-$\sigma$-field \(\mathcal{A}_1 \subset \mathcal{A}\). If a random variable $X$ is \(\mathcal{A}_1\)-measurable, then for any Borel set $B$, \(X^{-1}(B) \in \mathcal{A}_1 \subset \mathcal{A}\). Thus, $X$ is also \(\mathcal{A}\)-measurable. In addition, \(\mathcal{F}\) is the canonical filtration, so $X_i$ must be \(\mathcal{F}_{t - 1}\)-measurable since \(\mathcal{F}_{t-1} = \sigma(X_1, \dots, X_{t-1})\).
Since \(\exp(\lambda_i(X_i - \mu) - v_i \psi(\lambda_i))\) is $\mathcal{F}_{t-1}$-measurable for all $i \leq t-1$, we know that \(\exp(\lambda_i(X_i - \mu) - v_i \psi(\lambda_i)) \in \mathcal{F}_{t-1}\). Since we are dealing with bounded random variables, we can use Theorem 4.1.14 from Durrett
Since \(\mathbb{E}_P[M_t^\psi(\mu) \rvert \mathcal{F}_{t-1}] \leq M_{t-1}^\psi(\mu)\) for all $t \geq 1$, and \(\mathbb{E}_P[M_1^\psi(\mu) \rvert \mathcal{F}_0] \leq 1\), \((M_t^\psi(\mu))_{t = 1}^\infty\) is a test supermartingale.
Let’s first define the Hoeffding process.
This sequence is a test supermartingale for $\mathcal{P}^m$, and we call $(\lambda_t)_{t=1}^\infty$ a predictable plug-in.
This result implies that if one can be clever about their choice of $(\lambda_t)_{t=1}^\infty$, one can construct a test supermartingale that can be used in Step 2 of Theorem 1 to get a confidence sequence for $\mu$. This is formalized in the following:
Let \((X_t)_{t=1}^\infty \sim P\) be a stochastic process with distribution \(P \in \mathcal{P}^\mu\) where \(\mathcal{P}^\mu\) is the set of distributions on \([0, 1]^\infty\) such that \(\mathbb{E}_P[X \rvert \mathcal{F}_{t-1}] = \mu\) for each $t$. For any predictable \((\lambda_t)_{t=1}^\infty\) where \(\lambda_t \in \mathbb{R}\) for all $t$, we can construct a $(1-\alpha)$ confidence sequence for $\mu$ as:
\[C_t^{\text{PrPl-H}} = \left(\frac{\sum_{i =1 }^t \lambda_i X_i }{\sum_{i = 1}^t \lambda_i} \pm \frac{\log(2/\alpha) + \sum_{i = 1}^t \psi_H(\lambda_i)}{\sum_{i = 1}^t \lambda_i} \right)\]The running intersection, \(\cap_{i \leq t} C_t^\text{PrPl-H}\), is also a valid $(1-\alpha)$ confidence sequence for $\mu$.
Proof to be completed.
The authors recommend choosing $(\lambda_t)_{t=1}^\infty$ such that:
\[\lambda_t^{PrPl-H} = \min \left\{ \sqrt{\frac{8 \log(2/\alpha)}{t \log(t + 1)}}, 1 \right\}\]Similar to the Hoeffding process, we can define the empirical Bernstein process.
One can perform a similar procedure as before, using $(M_t^\text{PrPl-EB}(m))$ in Step 2 of Theorem 1 to get a confidence sequence.
Let \((X_t)_{t=1}^\infty \sim P\) be a stochastic process with distribution \(P \in \mathcal{P}^\mu\) where \(\mathcal{P}^\mu\) is the set of distributions on \([0, 1]^\infty\) such that \(\mathbb{E}_P[X \rvert \mathcal{F}_{t-1}] = \mu\) for each $t$. For any predictable \((\lambda_t)_{t=1}^\infty\) where $\lambda_t \in (0, 1)$ for all $t$, we can construct a $(1-\alpha)$ confidence sequence for $\mu$ as:
\[C_t^{\text{PrPl-EB}} = \left(\frac{\sum_{i =1 }^t \lambda_i X_i }{\sum_{i = 1}^t \lambda_i} \pm \frac{\log(2/\alpha) + \sum_{i = 1}^t v_i \psi_{EB}(\lambda_i)}{\sum_{i = 1}^t \lambda_i} \right)\]The running intersection, \(\cap_{i \leq t} C_t^\text{PrPl-EB}\), is also a valid $(1-\alpha)$ confidence sequence for $\mu$.
Proof to be completed.
The authors recommend choosing $(\lambda_t)_{t=1}^\infty$ such that:
\[\begin{equation} \label{eq:empirical-bernstein-choices} \lambda_t^{PrPl-EB} = \min \left\{ \sqrt{\frac{2\log(2/\alpha)}{\hat{\sigma}_{t-1}^2 t \log(1 + t)}}, c \right\} \end{equation}\]where:
\[\begin{aligned} \hat{\sigma}_t^2 &= \frac{\frac{1}{4} + \sum_{i = 1}^t (X_i - \hat{\mu}_i)^2}{t+1} \\ \hat{\mu}_t &= \frac{\frac{1}{2} + \sum_{i = 1}^t X_i}{t + 1} \\ c &= \frac{1}{2} \text{ or } \frac{3}{4} \end{aligned}\]This procedure yields two sequences, \((\hat{\mu}_t)_{t=1}^\infty\) and \((\hat{\sigma}_t^2)_{t=1}^\infty\), which can be thought of as predictable, regularized sample means and variances.
For fixed time, we can simply use the running intersection to form a $(1-\alpha)$ confidence interval for $\mu$. Suppose we have $n$ samples. Then we can form a confidence interval as $C_n^{\text{PrPl-EB_CI}} = \cap_{i \leq n} C_i^{\text{PrPl-EB}}$ with any predictable sequence $(\lambda_t)_{t = 1}^\infty$. The authors recommend choosing:
\[\lambda_t^{\text{PrPl-EB}(n)} = \min \left\{ \sqrt{\frac{2 \log(2/\alpha)}{n \hat{\sigma}_{t-1}^2}} \right\}\]with $\hat{\sigma}_i^2$ and $c$ as in Eq. \eqref{eq:empirical-bernstein-choices}.
Something very interesting is that the width of the above confidence interval goes to $\sigma \sqrt{2 \log(2/\alpha)}$ for i.i.d. data (where $\sigma$ is the true standard deviation). In contrast, the original empriical Bernstein confidence intervals introduced by Maurer and Pontil (2009) only go to $\sigma \sqrt{2 \log(4 / \alpha)}$.
In order to improve the confidence sequences derived above, the authors dive deeper into the gambling analogy. Suppose we are playing a game in which we can accumulate wealth by making bets against some hypothesis, $H_0^m$, being true. We start off with one dollar, and, in each “round”, we will make a bet that is determined by our predictable sequence $(\lambda_t^m)_{t=1}^\infty$. More specifically, we make bet $b_t = s_t\rvert \lambda_t^m \rvert$ where $s_t = -1$ if we think $\mu < m$ and $s_t = 1$ if we think $\mu > m$. The second term, $\rvert \lambda_t^m \rvert$, is the amount we are willing to lose/win in our bet (e.g. $\rvert \lambda_t^m \rvert = 0$ implies we are risking nothing).
We can conceptualize the total wealth we have at any given time with the capital process.
There is a deep connection between test (super)martingales, bounded random variables, and the capital process that the authors emphasize. This is formalized below.
Let $X_1, X_2, \dots \sim P$ be a sequence of random variables taking values in $[0, 1]$, and let $\mu \in [0, 1]$. The following are equivalent:
Proof to be completed.
The main takeaway is that \((\mathcal{K}_t(\mu))_{t=0}^\infty\) is a test martingale (not supermartingale) for $P$ if the random variables have conditional expectation equal to $\mu$. If the hypothesis $H_0^m$ is true (i.e. $m = \mu$), then, by Proposition 2, we expect to not make any money since the conditional expectation is constant at every round (i.e. \(\mathbb{E}_P[X_t \rvert \mathcal{F}_{t-1}] = \mu\) for all $t \in \mathbb{N}$). However, if nature is wrong and $m \neq \mu$, then, by Ville’s inequality, the probability of ever earning at least $\frac{1}{\alpha}$ capital is at most $\alpha$.
This naturally extends to hypothesis testing: under the null hypothesis, the probability of earning a lot of money is very low, so we should reject $H_0^m$ if our capital becomes too large because it goes against the expectation of constant wealth. The authors then explain that one could imagine playing a game for every value of $m \in [0, 1]$. For a time $t$, one could the construct a confidence sequence as the collection of games with small capital at $t$.
The reason the authors introduce this concept is because the capital process is a test martingale when $m = \mu$. Unlike the Hoeffding and empirical Bernstein processes, which are test supermartingales. Since test supermartingales expect the capital to decrease, test constructed from them can be more conservative (this is because we expect to be losing wealth over time, so $\mathbb{E}[X_0]$ will be the maximal value, and Ville’s inequality will yield a relatively larger bound).
Since the capital process describes the wealth we accumulate during a better game, we can ask the question: Is there a way to mitigate our risk? In general betting games, one can hedge one’s bets. This means that, after placing an initial bet, the player bets again on a different outcome (or multiple different outcomes) as they see the game play out. This can reduce their net loss because they could possibly win back some money with their later bets that are based on more information about the game.
Now, consider making two simultaneous bets: one with some part $(\theta)$ of one’s current wealth and the other with the rest. One bet will be made for the case that $\mu \geq m$, and the other will be made for the case that $\mu < m$. We can describe separate capital processes for these two bets and use these to describe the wealth accumulated across both betting strategies. This is called the hedged capital process.
In the event that $m = \mu$, we would expect our hedged bets to both yield no earnings. However, if $m \neq \mu$, then one bet should lose and the other should win money. With this capital process, we can construct a confidence sequence.
Let \((X_t)_{t = 1}^\infty \sim P\) be a sequence of random variables taking values in \([0, 1]\) with distribution \(P \in \mathcal{P}^\mu\) where \(\mathcal{P}\) is the set of distributions on \([0, 1]^\infty\) such that \(\mathbb{E}_P[X_t \vert \mathcal{F}_{t-1}] = \mu\) for each $t$.
Let \((\tilde{\lambda}_t^+)_{t=1}^\infty\) and \((\tilde{\lambda}_t^-)_{t=1}^\infty\) be predictable, real-valued sequences, independent of $m$, and define, for each $t \geq 1$:
\[\lambda_t^+(m) = \min\left\{ \rvert \tilde{\lambda}_t^+\rvert, \frac{c}{m} \right\} \hspace{5mm} \text{ and } \hspace{5mm} \lambda_t^-(m) = \min\left\{ \rvert \tilde{\lambda}_t^-\rvert, \frac{c}{1-m} \right\}\]and for some chosen \($c \in [0, 1)\). The sequence \((\mathcal{B}_t^{\pm})_{t = 1}^\infty\) forms a $(1-\alpha)$ confidence sequence for $\mu$ where:
\[\mathcal{B}_t^{\pm} = \left\{ m \in [0, 1] \bigg\rvert \mathcal{K}_t^\pm(m) < \frac{1}{\alpha} \right\}\]The running intersection, \(\cap_{i \leq t} \mathcal{B}_i^\pm\) is also a valid $(1-\alpha)$ confidence sequence for $\mu$. Furthermore, \(\mathcal{B}_t^{\pm}\) is a valid $(1-\alpha)$ confidence interval for each $t \geq 1$.
Proof to be completed.
The authors recommend using $\tilde{\lambda}_t^+ = \tilde{\lambda}_t^- = \lambda_t^{\text{PrPl}\pm}$ for the predictable plug-ins, where:
\[\begin{equation} \label{eq:hedged-capital-choices} \lambda_t^{\text{PrPl}\pm} = \sqrt{\frac{2 \log(2/\alpha)}{\hat{\sigma}_{t-1}^2 t \log(t+1)}} \end{equation}\]where:
\[\begin{aligned} \hat{\sigma}_t^2 &= \frac{\frac{1}{4} + \sum_{i = 1}^t (X_i - \hat{\mu}_i)^2}{t+1} \\ \hat{\mu}_t &= \frac{\frac{1}{2} + \sum_{i = 1}^t X_i}{t + 1} \\ c &= \frac{1}{2} \text{ or } \frac{3}{4} \end{aligned}\]Similar to the empirical Bernstein case above, we can use the running intersection, $\cap_{i \leq n} \mathcal{B}_i^\pm$, to form a $(1- \alpha)$ confidence interval for a fixed sample size $n$, where we choose:
\[\tilde{\lambda}_t^+ = \tilde{\lambda}_t^- = \tilde{\lambda}_t^\pm = \sqrt{\frac{2 \log(2/\alpha)}{n \hat{\sigma}_{t - 1}^2}}\]for the same $\hat{\sigma}^2$ as defined in Eq. \eqref{eq:hedged-capital-choices}.
A key finding of Waudby-Smith and Ramdas is that, for large sample sizes, the hedged capital confidence intervals are almost surely better than the confidence intervals one can create from Hoeffding’s inequality. Even more, the hedged capital confidence intervals converge at a rate of $O(1/\sqrt{n})$, which is optimal.
Here are some more articles you might like to read next: