This post covers the pertinent results in Empirical Bernstein Bounds and Sample Variance Penalization by Maurer and Pontil.
We’ll assume we have some random variable, $X$, under study that comes from some sample space, $\mathcal{X}$. Let $\mu$ denote the law of $X$, and let $\ell$ be some loss function. We’ll use $\ell_h(X)$ to denote the loss of hypothesis $h$ for observed value $X$.
Assuming that the boundedness conditions are satisfied by $\ell_h$, we can use Hoeffding’s inequality to get, with probability $1 - \delta$ (for any $\delta > 0$):
\[\big\rvert R_n(\ell_h, \mathbf{X}) - R(\ell_h, \mu) \big \rvert \leq \sqrt{\frac{1}{2}\sum_{i = 1}^n (b_i - a_i)^2 \log\left(\frac{2}{\delta} \right)}\]Recall that machine learning is often concerned with hypothesis selection (e.g. model fitting, parameter estimation, etc.) via empirical risk minimization, which is a procedure that picks the hypothesis that minimizes $R_n(f, \mu)$ over all $f \in \mathcal{F}$. The basic idea is that if the empirical risk is similar to the true risk, then we will end up picking the best hypothesis. If we have some guarantee on the difference between the empirical and true risk, then learning will be possible with large enough sample sizes!
In later proofs, we will need to use some measure of the complexity of a function class. For a fixed $\epsilon > 0$, function class $\mathcal{F}$, and sample size $n$, we’ll define the growth function, which is an extension of covering numbers. However, we’ll need some intermediate definitions first.
Intuitively, a subset $\mathcal{N}$ is an $\epsilon$-net of $K$ if all the points in $K$ can be contained in a ball of radius $\epsilon$ centered at some points in $\mathcal{N}$ (i.e. $K$ can be covered).
Vershynin provides a very good alternative explanation of the covering number: the smallest number of balls with centers in $K$ and radii of $\epsilon$ that completely cover $K$.
The growth function is the least upper bound (over all samples of size $n$) on the covering number of the set of all hypotheses in $\mathcal{F}$ applied to the sample $\mathbf{x}$ with respect to the distance metric associated with $\rvert \rvert \cdot \rvert \rvert_{\infty}$.
This section will rely on common notation. Let $[0, 1]^n$ be the $n$-dimensional unit cube, and let \(\mathbf{x} := (x_1, \dots, x_n) \in [0, 1]^n\) be a point in the cube. Denote and define the sample mean and variance with \(P_n(\mathbf{x}) := \frac{1}{n}\sum_{i = 1}^n x_i\) and \(V_n(\mathbf{x}) := \frac{1}{n(n-1)} \sum_{i = 1}^n \sum_{j = 1}^n \frac{(x_i - x_j)^2}{2}\), respectively.
For a function $f$, we will use \(f(\mathbf{x}) := (f(x_1), \dots, f(x_n))\) to denote the function applied element-wise to the sample vector $\mathbf{x}$. Similarly, we will use \(P_n(f, \mathbf{x}) := P_n(f(\mathbf{x}))\) and \(V_n(f, \mathbf{x}) := V_n(f(\mathbf{x}))\).
For distribution $\mu$, the $n$-fold product is $\mu^n$. General product measures are denoted with $\times$ or $\prod$. For $X \sim \mu$ and function $f$, \(P(f, \mu) := \mathbb{E}_{X \sim \mu}[f(X)]\) and \(V(f, \mu) := V_{X \sim \mu} f(X)\) which will be notated more simply as $\mathbb{E}[f(X)]$ and $V(f(X))$, respectively.
Notice that, although the Bernstein inequalities and Bennett’s inequality can provide better bounds, they require knowledge of the unknown variance. Maurer and Pontil first derive an inequality similar to those of Bernstein and Bennett that only requires calculation of the sample variance.
Let $\mathbf{X} = (X_1, \dots, X_n)$ be a vector with independent random variables taking values in $[0, 1]$. Let $\delta > 0$. With probability at least $1 - \delta$ in $\mathbf{X}$:
\[\mathbb{E}\left[ P_n(\mathbf{X}) \right] \leq P_n(\mathbf{X}) + \sqrt{\frac{2V_n(\mathbf{X}) \log(2/\delta)}{n}} + \frac{7 \log(2/\delta)}{3(n-1)}\]where we recall that $P_n(\mathbf{X}) := \frac{1}{n} \sum_{i = 1}^n X_i$, the sample average.
We’ll first write $W := \frac{1}{n} \sum_{i = 1}^n V(X_i)$, the average variance of each $X_i$ (since they aren’t necessarily identically distributed). Note that $W \leq \mathbb{E}[V_n(\mathbf{X})]$.
\[\begin{aligned} W &= \frac{1}{n} \sum_{i = 1}^n V(X_i) \\ &= \frac{1}{n} \sum_{i = 1}^n \mathbb{E}\left[ (X_i - \mathbb{E}[X_i])^2 \right] \\ &\leq \frac{1}{n} \sum_{i = 1}^n \mathbb{E}\left[ (X_i - \mathbb{E}[X_i])^2 \right] + \frac{1}{2n(n-1)}\sum_{i = 1}^n \sum_{i \neq j} \left( \mathbb{E}[X_i] - \mathbb{E}[X_j] \right)^2 \hspace{10mm} \text{(Adding a positive quantity)} \\ &= \frac{1}{n}\sum_{i = 1}^n \mathbb{E}\left[ X_i^2 + (\mathbb{E}[X_i])^2 - 2 X_i \mathbb{E}[X_i] \right] + \frac{1}{n(n-1)}\sum_{i = 1}^n \sum_{j < i}\left( (\mathbb{E}[X_i])^2 + (\mathbb{E}[X_j])^2 - 2 \mathbb{E}[X_i]\mathbb{E}[X_j] \right) \\ &= \frac{1}{n} \sum_{i = 1}^n \left( \mathbb{E}\left[ X_i^2 \right] - (\mathbb{E}[X_i])^2 \right) + \frac{1}{n(n-1)} \sum_{i = 1}^n \sum_{j < i} \left[(\mathbb{E}[X_i])^2 + (\mathbb{E}[X_j])^2 - 2\mathbb{E}[X_i]\mathbb{E}[X_j] \right] \\ &= \frac{1}{n} \sum_{i = 1}^n \mathbb{E}\left[ X_i^2 \right] - \frac{1}{n} \sum_{i = 1}^n (\mathbb{E}[X_i])^2 + \frac{n-1}{n(n-1)} \sum_{i = 1}^n (\mathbb{E}[X_i])^2 + \frac{1}{n(n-1)} \sum_{i = 1}^n \sum_{j < i} \left[ (\mathbb{E}[X_j])^2 - 2\mathbb{E}[X_i]\mathbb{E}[X_j] \right] \\ &= \frac{1}{n} \sum_{i = 1}^n \mathbb{E}\left[ X_i^2 \right] + \frac{1}{n(n-1)} \sum_{i = 1}^n \sum_{j < i} \left[ (\mathbb{E}[X_j])^2 - 2\mathbb{E}[X_i]\mathbb{E}[X_j] \right] \\ &= \sum_{i = 1}^n \left( \frac{1}{n} \mathbb{E}\left[ X_i^2 \right] + \frac{1}{n(n-1)} \sum_{j < i} \left[ (\mathbb{E}[X_j])^2 - 2\mathbb{E}[X_i]\mathbb{E}[X_j] \right]\right) \\ &= \frac{1}{n(n-1)} \sum_{i = 1}^n \sum_{j < i} \left( \mathbb{E}\left[ X_i^2 \right] + (\mathbb{E}[X_j])^2 - 2\mathbb{E}[X_i]\mathbb{E}[X_j]\right) \\ &= \frac{1}{n(n-1)} \sum_{i = 1}^n \sum_{j < i}\left( \mathbb{E}[(X_i - X_j)^2] \right) \\ &= \frac{1}{2n(n-1)} \sum_{i = 1}^n \sum_{j \neq i}\left( \mathbb{E}[(X_i - X_j)^2] \right) \\ &= \frac{1}{2n(n-1)} \sum_{i = 1}^n \sum_{j = 1}^n \left( \mathbb{E}[(X_i - X_j)^2] \right) \hspace{10mm} X_i - X_j = 0 \text{ for } i = j \\ &= \mathbb{E}[V_n(\mathbf{X})] \end{aligned}\]Since $X_i \in [0, 1]$ for all $i$, we can use a derivative of Bennett’s inequality (see Corollary 2) to get:
\[\mathbb{E}[P_n(\mathbf{X})] \leq P_n(\mathbf{X}) + \sqrt{\frac{2 W \log \left(\frac{1}{\delta}\right)}{n}} + \frac{\log \left( \frac{1}{\delta} \right)}{3n} \leq P_n(\mathbf{X}) + \sqrt{\frac{2 \mathbb{E}[V_n(\mathbf{X})] \log \left(\frac{1}{\delta}\right)}{n}} + \frac{\log \left( \frac{1}{\delta} \right)}{3n}\]Looking at Theorem 10, we also have that $\sqrt{V_n(\mathbf{X})} - \sqrt{\frac{2\log\left(\frac{1}{\delta}\right)}{n-1}} > \sqrt{\mathbb{E}[V_n(\mathbf{X})]}$ with probability $1 - \delta$ for any $\delta > 0$.
Since the probability of having both statements above satisfied is less than or equal to the sum of their individual probabilities (via the union bound), we can pick $\delta = 2\epsilon$ for any $\epsilon > 0$ to ensure that, with probability at least $1 - 2\epsilon = 1 - \delta$:
\[\begin{aligned} P_n(\mathbf{X}) + 2\sqrt{\mathbb{E}[V_n(\mathbf{X})]} \sqrt{\frac{\log\left(\frac{1}{\epsilon}\right)}{2n}} + \frac{\log\left(\frac{1}{\epsilon}\right)}{3n} &\leq P_n(\mathbf{X}) + 2\left(\sqrt{V_n(\mathbf{X})} - \sqrt{\frac{2 \log\left(\frac{1}{\epsilon}\right)}{n-1}} \right)\sqrt{\frac{\log\left(\frac{1}{\epsilon}\right)}{2n}} + \frac{\log\left(\frac{1}{\epsilon}\right)}{3n} \\ &= P_n(\mathbf{X}) + \sqrt{\frac{2 V_n(\mathbf{X}) \log\left(\frac{1}{\epsilon}\right)}{n}} - \sqrt{\frac{4\log^2\left(\frac{1}{\epsilon}\right)}{n(n-1)}} + \frac{\log\left(\frac{1}{\epsilon}\right)}{3n} \\ &= P_n(\mathbf{X}) + \sqrt{\frac{2 V_n(\mathbf{X}) \log\left(\frac{1}{\epsilon}\right)}{n}} - \frac{2\log\left(\frac{1}{\epsilon}\right)}{\sqrt{n(n-1)}} + \frac{\log\left(\frac{1}{\epsilon}\right)}{3n} \\ &\leq P_n(\mathbf{X}) + \sqrt{\frac{2 V_n(\mathbf{X}) \log\left(\frac{1}{\epsilon}\right)}{n}} - \frac{2\log\left(\frac{1}{\epsilon}\right)}{\sqrt{n^2}} + \frac{\log\left(\frac{1}{\epsilon}\right)}{3n} & \left( n^2 \geq n(n-1) \implies -\frac{x}{n^{2}} \geq -\frac{x}{n(n-1)} \right)\\ &= P_n(\mathbf{X}) + \sqrt{\frac{2 V_n(\mathbf{X}) \log\left(\frac{1}{\epsilon}\right)}{n}} - \frac{6\log\left(\frac{1}{\epsilon}\right)}{3n} + \frac{\log\left(\frac{1}{\epsilon}\right)}{3n} \\ &= P_n(\mathbf{X}) + \sqrt{\frac{2 V_n(\mathbf{X}) \log\left(\frac{1}{\epsilon}\right)}{n}} - \frac{5\log\left(\frac{1}{\epsilon}\right)}{3n} \\ &= P_n(\mathbf{X}) + \sqrt{\frac{2 V_n(\mathbf{X}) \log\left(\frac{2}{\delta}\right)}{n}} - \frac{5\log\left(\frac{2}{\delta}\right)}{3n} &\left(\text{set } \epsilon = \frac{\delta}{2} \right) \end{aligned}\]If we want a probability inequality that matches the form of Markov’s, Chebyshev’s, and Hoeffding’s, we need to do some rearranging.
Let $\mathbf{X} = (X_1, \dots, X_n)$ be a vector with independent random variables taking values in $[0, 1]$. Then for $t > 0$:
\[\mathbb{P}\left(\mathbb{E}[P_n(\mathbf{X})]- P_n(\mathbf{X}) \geq t \right) \leq 2 \exp\left(- 2\left[ \sqrt{\frac{3(n-1)}{7} \left( t + \sqrt{\frac{V_n(\mathbf{X})}{2n}}\right)} - \frac{3(n-1)}{7}\sqrt{\frac{V_n(\mathbf{X})}{2n}} \right]\right)\]We have that:
\[\begin{aligned} & &\mathbb{P}\left( \mathbb{E}\left[ \frac{1}{n}\sum_{i = 1}^n X_i \right] - \frac{1}{n} \sum_{i = 1}^n X_i \leq \sqrt{\frac{2 V_n(\mathbf{X})\log(2/\delta)}{n}} + \frac{7 \log(2/\delta)}{3(n-1)} \right) &\geq 1 - \delta \\ &\implies &1 - \mathbb{P}\left( \mathbb{E}\left[ \frac{1}{n}\sum_{i = 1}^n X_i \right] - \frac{1}{n} \sum_{i = 1}^n X_i \leq \sqrt{\frac{2 V_n(\mathbf{X})\log(2/\delta)}{n}} + \frac{7 \log(2/\delta)}{3(n-1)} \right) &\leq \delta \\ &\implies &\mathbb{P}\left( \mathbb{E}\left[ \frac{1}{n}\sum_{i = 1}^n X_i \right] - \frac{1}{n} \sum_{i = 1}^n X_i \geq \sqrt{\frac{2 V_n(\mathbf{X})\log(2/\delta)}{n}} + \frac{7 \log(2/\delta)}{3(n-1)} \right) &\leq \delta \end{aligned}\]We then set the righthand side of the first inequality equal to $t$ and solve for $\delta$:
\[\begin{aligned} & &t &= \sqrt{\frac{2 V_n(\mathbf{X})\log(2/\delta)}{n}} + \frac{7 \log(2/\delta)}{3(n-1)} \\ &\implies &t &= 2 \sqrt{\log(2/\delta)} \sqrt{\frac{V_n(\mathbf{X})}{2n}} + \frac{7}{3(n-1)}\log(2/\delta) \\ &\implies &\frac{3(n-1)}{7} t &= 2 \frac{3(n-1)}{7} \sqrt{\log(2/\delta)} \sqrt{\frac{V_n(\mathbf{X})}{2n}} + \log(2/\delta) \\ &\implies &\frac{3(n-1)}{7} t + \frac{3(n - 1)}{7}\sqrt{\frac{V_n(\mathbf{X})}{2n}} &= \frac{3(n - 1)}{7}\sqrt{\frac{V_n(\mathbf{X})}{2n}} + 2\frac{3(n-1)}{7} \sqrt{\log(2/\delta)} \sqrt{\frac{V_n(\mathbf{X})}{2n}} + \log(2/\delta) \\ &\implies &\frac{3(n-1)}{7} t + \frac{3(n - 1)}{7}\sqrt{\frac{V_n(\mathbf{X})}{2n}} &= \left(\sqrt{\log(2/\delta)} + \frac{3(n-1)}{7}\sqrt{\frac{V_n(\mathbf{X})}{2n}}\right)^2 \\ &\implies &\sqrt{\log(2/\delta)} + \frac{3(n-1)}{7}\sqrt{\frac{V_n(\mathbf{X})}{2n}} &= \sqrt{\frac{3(n-1)}{7} \left( t + \sqrt{\frac{V_n(\mathbf{X})}{2n}}\right)} \\ &\implies &\sqrt{\log(2/\delta)} &= \sqrt{\frac{3(n-1)}{7} \left( t + \sqrt{\frac{V_n(\mathbf{X})}{2n}}\right)} - \frac{3(n-1)}{7}\sqrt{\frac{V_n(\mathbf{X})}{2n}} \\ &\implies &\log(2/\delta) &= \left(\sqrt{\frac{3(n-1)}{7} \left( t + \sqrt{\frac{V_n(\mathbf{X})}{2n}}\right)} - \frac{3(n-1)}{7}\sqrt{\frac{V_n(\mathbf{X})}{2n}}\right)^2 \\ &\implies &\frac{2}{\delta} &= \exp\left(2\left(\sqrt{\frac{3(n-1)}{7} \left( t + \sqrt{\frac{V_n(\mathbf{X})}{2n}}\right)} - \frac{3(n-1)}{7}\sqrt{\frac{V_n(\mathbf{X})}{2n}}\right)\right) \\ &\implies &\delta &= 2\exp\left(- 2\left[ \sqrt{\frac{3(n-1)}{7} \left( t + \sqrt{\frac{V_n(\mathbf{X})}{2n}}\right)} - \frac{3(n-1)}{7}\sqrt{\frac{V_n(\mathbf{X})}{2n}} \right]\right) \end{aligned}\]With a little more algebra, we can rewrite the result above for a random vector with coordinates in arbitrary closed intervals.
Let $\mathbf{X} = (X_1, \dots, X_n)$ be a vector with independent random variables taking values in $[a, b]$. Then with probability at least $1 - \delta$:
\[\mathbb{E}[P_n(\mathbf{X})]- P_n(\mathbf{X}) \leq \sqrt{\frac{2V_n(\mathbf{X}) \log(2 / \delta)}{n}} + \frac{7(b-a)\log(2/\delta)}{3(n-1)}\]Suppose that the $X_i \in [a, b]$. We can repeat the argument for Theorem 11 with \(Z_i := \frac{X_i - a}{b - a}\) so that $Z_i \in [0,1]$ and then rearrange the result to get a bound for $X_i$.
\[\begin{aligned} V_n(\mathbf{Z}) &= \frac{1}{2n(n-1)}\sum_{i = 1}^n \sum_{j = 1}^n (Z_i - Z_j)^2 \\ &= \frac{1}{2n(n-1)}\sum_{i = 1}^n \sum_{j = 1}^n \left( \frac{X_i - a}{b-a} - \frac{X_j - a}{b-a}\right)^2 \\ &= \frac{1}{2n(n-1)(b-a)^2}\sum_{i = 1}^n \sum_{j = 1}^n \left( X_i - X_j \right)^2 \\ &= \frac{1}{(b-a)^2} V_n(\mathbf{X}) \end{aligned}\]Applying Theorem 11 to the $Z_i$’s, we get that, with probability at least $1 - \delta$:
\[\begin{aligned} & &\mathbb{E}\left[ \frac{1}{n}\sum_{i = 1}^n \frac{X_i - a}{b-a} \right] &\leq \frac{1}{n} \sum_{i = 1}^n \frac{X_i - a}{b-a} + \sqrt{\frac{2V_n(\mathbf{Z}) \log(2 / \delta)}{n}} + \frac{7 \log(2/\delta)}{3(n-1)} \\ &\implies &\frac{1}{n(b-a)}\sum_{i = 1}^n \left[ \mathbb{E}\left[ X_i \right] - a \right] &\leq \frac{1}{n(b-a)}\sum_{i = 1}^n (X_i - a) + \sqrt{\frac{2V_n(\mathbf{Z}) \log(2 / \delta)}{n}} + \frac{7 \log(2/\delta)}{3(n-1)} \\ &\implies &\frac{1}{n(b-a)}\sum_{i = 1}^n \left[ \mathbb{E}\left[ X_i \right] - a - X_i + a\right] &\leq \sqrt{\frac{2V_n(\mathbf{Z}) \log(2 / \delta)}{n}} + \frac{7\log(2/\delta)}{3(n-1)} \\ &\implies &\frac{1}{n}\sum_{i = 1}^n \left[ \mathbb{E}\left[ X_i \right] - X_i \right] &\leq (b-a)\sqrt{\frac{2V_n(\mathbf{Z}) \log(2 / \delta)}{n}} + \frac{7 (b-a)\log(2/\delta)}{3(n-1)} \\ &\implies & \frac{1}{n}\sum_{i = 1}^n \left[ \mathbb{E}\left[ X_i \right] - X_i \right] &\leq (b-a)\sqrt{\frac{2V_n(\mathbf{X}) \log(2 / \delta)}{(b-a)^2n}} + \frac{7 (b-a)\log(2/\delta)}{3(n-1)} \\ &\implies &\frac{1}{n}\sum_{i = 1}^n \left[ \mathbb{E}\left[ X_i \right] - X_i \right] &\leq \sqrt{\frac{2V_n(\mathbf{X}) \log(2 / \delta)}{n}} + \frac{7 (b-a)\log(2/\delta)}{3(n-1)} \\ \end{aligned}\]We can also write Theorem 11 in a slightly different way.
Again assume $X_i \in [a, b]$. Then, for any $t > 0$:
\[\mathbb{P}\left( \bigg\rvert \sum_{i = 1}^n \left[X_i - \mathbb{E}[X_i] \right]\bigg\rvert \geq t\right) \leq 4\exp\left(- 2\left(\sqrt{\frac{3(n-1)t}{7 (b-a) n} + \left(\frac{3(n-1)\sqrt{\frac{n V_n(\mathbf{X})}{2}}}{7 (b-a) n}\right)^2} - \frac{3(n-1)\sqrt{\frac{nV_n(\mathbf{X})}{2}}}{7(b-a)n} \right)\right)\]Theorem 11 states that, with probability at least $1 - \delta$:
\[\mathbb{E}[P_n(\mathbf{X})]- P_n(\mathbf{X}) \leq \sqrt{\frac{2V_n(\mathbf{X}) \log(2 / \delta)}{n}} + \frac{7 (b-a)\log(2/\delta)}{3(n-1)}\]We set $t$ equal to righthand side of the above and solve for $\delta$:
\[\begin{aligned} & &t &= \sqrt{\frac{2 V_n(\mathbf{X}) \log(2/\delta)}{n}} + \frac{7 (b-a) \log(2/\delta)}{3(n-1)} \\ &\implies &t &= 2\sqrt{\log(2/\delta)}\sqrt{\frac{V_n(\mathbf{X})}{2n}} + \frac{7 (b-a) \log(2/\delta)}{3(n-1)} \\ &\implies &\frac{3(n-1)t}{7 (b-a)} &= 2\sqrt{\log(2/\delta)}\left(\frac{3(n-1)\sqrt{\frac{V_n(\mathbf{X})}{2n}}}{7 (b-a)}\right) + \log(2/\delta) \\ &\implies &\frac{3(n-1)t}{7 (b-a)} + \left(\frac{3(n-1)\sqrt{\frac{V_n(\mathbf{X})}{2n}}}{7 (b-a)}\right)^2 &= 2\sqrt{\log(2/\delta)}\left(\frac{3(n-1)\sqrt{\frac{V_n(\mathbf{X})}{2n}}}{7 (b-a)}\right) + \log(2/\delta) + \left(\frac{3(n-1)\sqrt{\frac{V_n(\mathbf{X})}{2n}}}{7 (b-a)}\right)^2 \\ &\implies &\frac{3(n-1)t}{7 (b-a)} + \left(\frac{3(n-1)\sqrt{\frac{V_n(\mathbf{X})}{2n}}}{7 (b-a)}\right)^2 &= \left(\sqrt{\log(2/\delta)} + \frac{3(n-1)\sqrt{\frac{V_n(\mathbf{X})}{2n}}}{7(b-a)}\right)^2 \\ &\implies &\sqrt{\frac{3(n-1)t}{7 (b-a)} + \left(\frac{3(n-1)\sqrt{\frac{V_n(\mathbf{X})}{2n}}}{7 (b-a)}\right)^2} &= \sqrt{\log(2/\delta)} + \frac{3(n-1)\sqrt{\frac{V_n(\mathbf{X})}{2n}}}{7(b-a)} \\ &\implies &\sqrt{\log(2/\delta)} &= \sqrt{\frac{3(n-1)t}{7 (b-a)} + \left(\frac{3(n-1)\sqrt{\frac{V_n(\mathbf{X})}{2n}}}{7 (b-a)}\right)^2} - \frac{3(n-1)\sqrt{\frac{V_n(\mathbf{X})}{2n}}}{7(b-a)}\\ &\implies &\log(2/\delta) &= \left(\sqrt{\frac{3(n-1)t}{7(b-a)} + \left(\frac{3(n-1)\sqrt{\frac{V_n(\mathbf{X})}{2n}}}{7 (b-a)}\right)^2} - \frac{3(n-1)\sqrt{\frac{V_n(\mathbf{X})}{2n}}}{7(b-a)} \right)^2 \\ &\implies &2 / \delta &= \exp \left( \left(\sqrt{\frac{3(n-1)t}{7 (b-a)} + \left(\frac{3(n-1)\sqrt{\frac{V_n(\mathbf{X})}{2n}}}{7 (b-a)}\right)^2} - \frac{3(n-1)\sqrt{\frac{V_n(\mathbf{X})}{2n}}}{7(b-a)} \right)^2 \right) \\ &\implies &\delta &= 2\exp\left(- 2\left(\sqrt{\frac{3(n-1)t}{7 (b-a)} + \left(\frac{3(n-1)\sqrt{\frac{V_n(\mathbf{X})}{2n}}}{7 (b-a)}\right)^2} - \frac{3(n-1)\sqrt{\frac{V_n(\mathbf{X})}{2n}}}{7(b-a)} \right)\right) \end{aligned}\]Thus, for any $t > 0$:
\[\mathbb{P}\left( \mathbb{E}[P_n(\mathbf{X})] - P_n(\mathbf{X}) \geq t\right) \leq 2\exp\left(- 2\left(\sqrt{\frac{3(n-1)t}{7 (b-a)} + \left(\frac{3(n-1)\sqrt{\frac{V_n(\mathbf{X})}{2n}}}{7 (b-a)}\right)^2} - \frac{3(n-1)\sqrt{\frac{V_n(\mathbf{X})}{2n}}}{7(b-a)} \right)\right)\]To make the above statement two-sided, we just follow similar steps to bound:
\[\mathbb{P}\left(\mathbb{E}[P_n(\mathbf{X})] - P_n(\mathbf{X}) \leq -t \right)\]The authors then extend this result to finite function classes and present a uniform bound for particular function classes. See the original paper for proofs.
The previous theorem essentially bounds (with high probability) the difference between the empirical and true risk of all hypotheses ins the defined function class as a function of the sample variance and the function class’s complexity (as measured by the growth function).
Some additional theorems/lemmata are necessary for the proofs of the main results.
Suppose $X$ and $Y$ are i.i.d. random variables that are bounded in $[a, a + 1]$. We have that:
\[\mathbb{E}_X\left[ \left( \mathbb{E}_Y \left[ (X - Y)^2 \right] \right)^2 \right] \leq \frac{1}{2}\mathbb{E}\left[ (X - Y)^2 \right] \nonumber\]First, the righthand side is equal to the variance.
\[\begin{aligned} \frac{1}{2}\mathbb{E}\left[ (X - Y)^2 \right] &= \frac{1}{2} \mathbb{E}\left[ X^2 + Y^2 - 2XY \right] \\ &= \frac{1}{2}\mathbb{E}\left[ X^2 \right] + \frac{1}{2}\mathbb{E}\left[ Y^2 \right] - \mathbb{E}\left[ X\right] \mathbb{E}\left[ Y \right] \hspace{10mm} (X, Y \text{ are independent})\\ &= \mathbb{E}\left[ X^2 \right] - \left(\mathbb{E}\left[ X \right]\right)^2 \hspace{10mm} (X, Y \text{ are identically distributed}) \\ &= \mathbb{E}\left[ X^2 - XY \right] \hspace{10mm} (X, Y \text{ are identically distributed}) \end{aligned}\]Expanding out the lefthand side yields:
\[\begin{aligned} \mathbb{E}_X\left[ \left( \mathbb{E}_Y \left[ (X - Y)^2 \right] \right)^2 \right] &= \mathbb{E}_X\left[ \left( \mathbb{E}_Y \left[ X^2 + \mathbb{E}[Y^2] - 2X\mathbb{E}[Y] \right] \right)^2 \right] \\ &= \mathbb{E} \left[ \left( X^2 + Y^2 - 2XY \right)^2 \right] \\ &= \mathbb{E}_X \left[ X^4 + X^2 \mathbb{E}[Y^2] - 2X^3\mathbb{E}[Y] + X^2 \mathbb{E}[Y^2] + \left( \mathbb{E}[Y^2] \right)^2 - 2X \mathbb{E}[Y] \mathbb{E}[Y^2] - 2X^3\mathbb{E}[Y] - 2X \mathbb{E}[Y] \mathbb{E}[Y^2] + 4 X^2\left(\mathbb{E}[Y]\right)^2\right] \\ &= \mathbb{E}_X\left[ X^4 + 2X^2\mathbb{E}[Y^2] - 4X^3 \mathbb{E}[Y] + (\mathbb{E}[Y^2])^2 -4X\mathbb{E}[Y]\mathbb{E}[Y^2] + 4X^2 (\mathbb{E}[Y])^2\right] \\ &= \mathbb{E}[X^4] + 2 \mathbb{E}[X^2]\mathbb{E}[Y^2] - 4 \mathbb{E}[X^3]\mathbb{E}[Y] + (\mathbb{E}[Y^2])^2 - 4 \mathbb{E}[X] \mathbb{E}[Y] \mathbb{E}[Y^2] + 4 \mathbb{E}[X^2] (\mathbb{E}[Y])^2 \\ &= \mathbb{E}[X^4] + 3\mathbb{E}[X^2]\mathbb{E}[Y^2] - 4\mathbb{E}[X^3]\mathbb{E}[Y] \hspace{10mm} (X,Y \text{ are identically distributed}) \end{aligned}\]Using the above facts, it suffices to show that the difference between the righthand side and lefthand side is greater than or equal to zero. That is, to show:
\[\mathbb{E}[g(X,Y)] \geq 0 \hspace{5mm} \text{ where } \hspace{5mm} g(X,Y) := X^2 - XY - X^4 - 3X^2Y^2 + 4X^3Y\]Consider the following:
\[\begin{aligned} (X - Y + 1)(Y - X + 1)(Y-X)^2 &= (Y-X)^2(XY - X^2 + X - Y^2 + XY - Y + Y - X + 1) \\ &= (Y - X)^2(2XY - X^2 - Y^2 + 1) \\ &= (Y^2 + X^2 - 2XY)(2XY - X^2 - Y^2 + 1) \\ &= 2Y^3X - X^2 Y^2 - Y^4 + Y^2 + 2X^3Y - X^4 - X^2 Y^2 + X^2 - 4X^2Y^2 + 2X^3Y + 2Y^3X - 2XY \\ &= 4Y^3X + 4X^3Y -3X^2Y^2 - 3X^2Y^2 - XY - XY + Y^2 + X^2 - Y^4 - X^4 \\ &= (4X^3Y - X^4+ 4X^3Y - 3X^2Y^2- XY + X^2) + (4Y^3X -3X^2Y^2 - XY + Y^2 - Y^4) \\ &= g(X,Y) + g(Y,X) \end{aligned}\]Since $\rvert X - Y \rvert \leq 1$ due to their boundedness, $(X - Y + 1) \geq 0$ and $(Y - X + 1) \geq 0$ as well. Thus, $g(X,Y) + g(Y,X) \geq 0$.
\[2\mathbb{E}[g(X,Y)] = \mathbb{E}[g(X,Y) + g(Y,X)] \geq 0 \implies \mathbb{E}[g(X,Y)] \geq 0\]Lemma 8 yields the following corollary if $X$ and $Y$ are uniformly distributed over some set of values ${ x_1, \dots, x_n }$ with finite cardinality. If we assume ${ x_1, \dots, x_n } \subset [0, 1]$, then:
\[\frac{1}{n} \sum_{k = 1}^n \left( \frac{1}{n}\sum_{j = 1}^n (x_k - x_j)^2 \right)^2 \leq \frac{1}{2n^2} \sum_{k = 1}^n \sum_{j = 1}^j (x_k - x_j)^2\]We use the above corollary in the proof of the following theorem.
Let $n \geq 2$, and let $\mathbf{X} = (X_1, \dots, X_n)$ be a vector of independent random variables such that $X_i \in [0, 1]$ for all $i$. Let $\mathbb{E}[V_n] := \mathbb{E}_\mathbf{X}[V_n(\mathbf{X})]$ where \(V_n := V_n(\mathbf{x}) = \frac{1}{n(n-1)} \sum_{i = 1}^n \sum_{j = 1}^n \frac{(x_i - x_j)^2}{2}\) is the sample variance. For any $\delta > 0$, we have:
\[\begin{aligned} (a) &\hspace{10mm} \mathbb{P}\left( \sqrt{\mathbb{E}[V_n]} > \sqrt{V_n(\mathbf{X})} + \sqrt{\frac{2 \log(1/\delta)}{n - 1}}\right) \leq \delta \\ (b) &\hspace{10mm} \mathbb{P} \left( \sqrt{V_n(\mathbf{X})} > \sqrt{\mathbb{E}[V_n]} + \sqrt{\frac{2\log(1/\delta)}{n - 1}} \right) \leq \delta \end{aligned}\]Let \(Z(\mathbf{X}) := n V_n(\mathbf{X})\). Choose some $k$ and any $y \in [0, 1]$. It follows that:
\[\begin{aligned} Z(\mathbf{X}) - Z(\mathbf{X}_{y,k}) &= \frac{n}{n(n-1)}\sum_{i = 1}^n \sum_{j = 1}^n \frac{(X_i - X_j)^2}{2} - \frac{1}{n(n-1)} \left(\frac{(y - y)^2}{2} + \sum_{j \neq k} \frac{(y - X_j)^2}{2} + \sum_{i \neq k} \left( \frac{(X_i - y)^2}{2} + \sum_{j \neq k} \frac{(X_i - X_j)^2}{2}\right) \right) \\ &= \frac{2}{n-1}\sum_{j = 1}^n \left[ \frac{(X_k - X_j)^2}{2} - \frac{(y - X_j)^2}{2} \right] \\ &= \frac{1}{n-1} \sum_{j = 1}^n \left[ (X_k - X_j)^2 - \underbrace{(y - X_j)^2}_{\geq 0} \right] \\ &\leq \frac{1}{n-1} \sum_{j = 1}^n (X_k - X_j)^2 \\ \end{aligned}\]We should note that the summation is over $n$ indices, but we could easily reduce it to only $n-1$ (restricting $j \neq k$). Thus, the last line above is the sample average of the $(X_k - X_j)^2$ terms. Since each $X_i \in [0, 1]$ and $y \in [0, 1]$ it is clear that $Z(\mathbf{X}) - \underset{y \in [0, 1]}{\inf} Z(\mathbf{X}_{y,k}) \leq 1$. Then:
\[\begin{aligned} \sum_{k = 1}^n \left( Z(\mathbf{X}) - \underset{y \in [0, 1]}{\inf} Z(\mathbf{X}_{y,k}) \right)^2 &\leq \sum_{k = 1}^n \left( \frac{1}{n-1} \sum_{j = 1}^n (X_k - X_j)^2 \right)^2 \hspace{10mm} \text{(above result)} \\ &= \sum_{k = 1}^n \left( \frac{1}{n-1}\sum_{j = 1}^n (X_k - X_j)^2 \right)^2 \\ &\leq \frac{n}{2(n-1)^2} \sum_{k = 1}^n \sum_{j = 1}^n (X_k - X_j)^2 \hspace{10mm} \text{(corollary of Lemma 8)}\\ &= \frac{n}{n-1}\left[ \frac{1}{n-1} \sum_{k = 1}^n \sum_{j = 1}^n \frac{(X_k - X_j)^2}{2} \right] \\ &= \frac{n}{n-1}Z(\mathbf{X}) \end{aligned}\]The above steps show that $Z(\mathbf{X})$ satisfy conditions $(a)$ and $(b)$ in Theorem 7 where $a = \frac{n}{n-1}$. Now, notice that:
\[\begin{aligned} \mathbb{P}\left(\mathbb{E}\left[ V_n(\mathbf{X}) \right] - V_n(\mathbf{X}) > t\right) &= \mathbb{P}\left(\mathbb{E}\left[ \frac{1}{n}V_n(\mathbf{X})\right] - \frac{1}{n}Z(\mathbf{X})> t \right) = \mathbb{P}\left( \mathbb{E}\left[ Z(\mathbf{X}) \right] - Z(\mathbf{X}) > nt \right) \\ \mathbb{P}\left(-\mathbb{E}\left[ V_n(\mathbf{X}) \right] + V_n(\mathbf{X}) > t\right) &= \mathbb{P}\left(-\mathbb{E}\left[ \frac{1}{n}V_n(\mathbf{X})\right] + \frac{1}{n}Z(\mathbf{X})> t \right) = \mathbb{P}\left( -\mathbb{E}\left[ Z(\mathbf{X}) \right] + Z(\mathbf{X}) > nt \right) \\ \end{aligned}\]Applying Theorem 7, we have, for any $t > 0$:
\[\begin{aligned} \mathbb{P}\left(\mathbb{E}[Z(\mathbf{X})] - Z(\mathbf{X}) > nt \right) &\leq \exp\left( -\frac{t^2n^2(n-1)}{2n\mathbb{E}[Z(\mathbf{X})]}\right) = \exp \left(- \frac{t^2n(n-1)}{2\mathbb{E}\left[n V_n(\mathbf{X}) \right]} \right) = \exp \left(-\frac{t^2(n-1)}{2\mathbb{E}\left[V_n(\mathbf{X}) \right]} \right)\\ \mathbb{P}\left(Z(\mathbf{X}) - \mathbb{E}[Z(\mathbf{X})] > nt\right) &\leq \exp \left( - \frac{t^2n^2(n-1)}{2n\mathbb{E}[Z(\mathbf{X})] + n^2t} \right) = \exp \left(-\frac{t^2n(n-1)}{2\mathbb{E}\left[nV_n(\mathbf{X})\right] + nt} \right) = \exp \left( - \frac{t^2(n-1)}{2\mathbb{E}[V_n(\mathbf{X})] + t} \right) \end{aligned}\]This implies that:
\[\begin{equation} \label{eq:intermed-probs} \begin{aligned} &\mathbb{P}\left(\mathbb{E}\left[ V_n(\mathbf{X}) \right] - V_n(\mathbf{X}) > t\right) \leq \exp \left(-\frac{t^2(n-1)}{2\mathbb{E}\left[V_n(\mathbf{X}) \right]} \right) \\ &\mathbb{P}\left(-\mathbb{E}\left[ V_n(\mathbf{X}) \right] + V_n(\mathbf{X}) > t\right) \leq \exp \left( - \frac{t^2(n-1)}{2\mathbb{E}[V_n(\mathbf{X})] + t} \right) \end{aligned} \end{equation}\]Solving:
\[\mathbb{P}\left(\mathbb{E}\left[ V_n(\mathbf{X}) \right] - V_n(\mathbf{X}) > t\right) \leq \delta = \exp \left(-\frac{t^2(n-1)}{2\mathbb{E}\left[V_n(\mathbf{X}) \right]} \right)\]for $t$ yields:
\[\begin{aligned} & &\delta &= \exp \left(-\frac{t^2(n-1)}{2\mathbb{E}\left[V_n(\mathbf{X}) \right]} \right) \\ &\implies &\log (\delta) &= -\frac{t^2(n-1)}{2\mathbb{E}\left[V_n(\mathbf{X}) \right]} \\ &\implies &\log \left( \frac{1}{\delta} \right) &= \frac{t^2(n-1)}{2\mathbb{E}\left[V_n(\mathbf{X}) \right]} \\ &\implies &t^2(n-1) &= 2 \mathbb{E}\left[V_n(\mathbf{X}) \right] \log \left( \frac{1}{\delta} \right) \\ &\implies &t &= \sqrt{\frac{2 \mathbb{E}\left[V_n(\mathbf{X}) \right] \log \left( \frac{1}{\delta} \right)}{n-1}} \end{aligned}\]which means that the first line in Eq. \eqref{eq:intermed-probs} implies that with probability at least $1 - \delta$:
\[\begin{aligned} & &\mathbb{E}\left[ V_n(\mathbf{X}) \right] - V_n(\mathbf{X}) &\leq \sqrt{\frac{2 \mathbb{E}\left[V_n(\mathbf{X}) \right] \log \left( \frac{1}{\delta} \right)}{n-1}} \\ &\implies &V_n(\mathbf{X}) &\geq \mathbb{E}\left[ V_n(\mathbf{X}) \right] - \sqrt{\frac{4 \mathbb{E}\left[V_n(\mathbf{X}) \right] \log \left( \frac{1}{\delta} \right)}{2(n-1)}} \\ &\implies &V_n(\mathbf{X}) &\geq \mathbb{E}\left[ V_n(\mathbf{X}) \right] - 2\sqrt{\mathbb{E}\left[V_n(\mathbf{X})\right]}\sqrt{ \frac{\log \left( \frac{1}{\delta} \right)}{2(n-1)}} \end{aligned}\]Notice that:
\[\begin{aligned} & &V_n(\mathbf{X}) &\geq \mathbb{E}\left[ V_n(\mathbf{X}) \right] - 2\sqrt{\mathbb{E}\left[V_n(\mathbf{X})\right]}\sqrt{ \frac{\log \left( \frac{1}{\delta} \right)}{2(n-1)}} \\ &\implies &V_n(\mathbf{X}) + \frac{\log \left( \frac{1}{\delta} \right)}{2(n-1)} &\geq \mathbb{E}\left[ V_n(\mathbf{X}) \right] - 2\sqrt{\mathbb{E}\left[V_n(\mathbf{X})\right]}\sqrt{ \frac{\log \left( \frac{1}{\delta} \right)}{2(n-1)}} + \frac{\log \left( \frac{1}{\delta} \right)}{2(n-1)} \\ &\implies &V_n(\mathbf{X}) + \frac{\log \left( \frac{1}{\delta} \right)}{2(n-1)} &\geq \left(\sqrt{\mathbb{E}\left[ V_n(\mathbf{X}) \right]} - \sqrt{\frac{\log \left( \frac{1}{\delta} \right)}{2(n-1)}}\right) \\ &\implies &\sqrt{\mathbb{E}\left[ V_n(\mathbf{X}) \right]} - \sqrt{\frac{\log \left( \frac{1}{\delta} \right)}{2(n-1)}} &\leq \sqrt{V_n(\mathbf{X}) + \frac{\log \left( \frac{1}{\delta} \right)}{2(n-1)}} \\ & & &\overset{(i)}{\leq} \sqrt{V_n(\mathbf{X})} + \sqrt{\frac{\log \left( \frac{1}{\delta} \right)}{2(n-1)}} \\ &\implies &\sqrt{\mathbb{E}\left[ V_n(\mathbf{X}) \right]} &\leq \sqrt{V_n(\mathbf{X})} + 2\sqrt{\frac{\log \left( \frac{1}{\delta} \right)}{2(n-1)}} \\ & & &= \sqrt{V_n(\mathbf{X})} + \sqrt{\frac{4\log \left( \frac{1}{\delta} \right)}{2(n-1)}} \\ & & &= \sqrt{V_n(\mathbf{X})} + \sqrt{\frac{2\log \left( \frac{1}{\delta} \right)}{n-1}} \end{aligned}\]where we use the fact that $\sqrt{a + b} \leq \sqrt{a} + \sqrt{b}$ in $(i)$. Thus, with probability at least $1 - \delta$:
\[\sqrt{\mathbb{E}\left[ V_n(\mathbf{X}) \right]} \leq \sqrt{V_n(\mathbf{X})} + \sqrt{\frac{2\log \left( \frac{1}{\delta} \right)}{n-1}}\]This proves Claim $(a)$ in the Theorem statement. To show Claim $(b)$, we do something similar. Solving:
\[\mathbb{P}\left(V_n(\mathbf{X}) - \mathbb{E}\left[ V_n(\mathbf{X}) \right] > t\right) \leq \delta = \exp \left(-\frac{t^2(n-1)}{2\mathbb{E}\left[V_n(\mathbf{X}) + t\right]} \right)\]for $t$, we see:
\[\begin{aligned} & &\delta &= \exp \left(-\frac{t^2(n-1)}{2\mathbb{E}\left[V_n(\mathbf{X}) + t\right]} \right) \\ &\implies &\log (\delta) &= -\frac{t^2(n-1)}{2\mathbb{E}\left[V_n(\mathbf{X}) \right] + t} \\ &\implies &\log \left( \frac{1}{\delta} \right) &= \frac{t^2(n-1)}{2\mathbb{E}\left[V_n(\mathbf{X}) \right] + t} \\ &\implies &t^2 &= \left(2 \mathbb{E}\left[V_n(\mathbf{X}) \right] + t \right) \log \left( \frac{1}{\delta} \right) \\ &\implies &t^2 &= \frac{\left(2 \mathbb{E}\left[V_n(\mathbf{X}) \right] + t \right) \log \left( \frac{1}{\delta} \right)}{n-1}\\ &\implies &\frac{2 \mathbb{E}\left[ V_n(\mathbf{X}) \right]\log\left(\frac{1}{\delta} \right)}{n-1} &= t^2 - \frac{2\log\left(\frac{1}{\delta}\right)t}{2(n-1)} \\ &\implies &t^2 - \frac{2\log\left(\frac{1}{\delta}\right)t}{2(n-1)} + \left(\frac{\log\left(\frac{1}{\delta}\right)}{2(n-1)}\right)^2 &= \frac{2 \mathbb{E}\left[ V_n(\mathbf{X}) \right]\log\left(\frac{1}{\delta} \right)}{n-1} + \left(\frac{\log\left(\frac{1}{\delta}\right)}{2(n-1)}\right)^2 \\ &\implies &\left(t - \frac{\log\left(\frac{1}{\delta}\right)}{2(n-1)}\right)^2 &= \frac{2 \mathbb{E}\left[ V_n(\mathbf{X}) \right]\log\left(\frac{1}{\delta} \right)}{n-1} + \left(\frac{\log\left(\frac{1}{\delta}\right)}{2(n-1)}\right)^2 \\ &\implies &t - \frac{\log\left(\frac{1}{\delta}\right)}{2(n-1)} &= \sqrt{\frac{2 \mathbb{E}\left[ V_n(\mathbf{X}) \right]\log\left(\frac{1}{\delta} \right)}{n-1} + \left(\frac{\log\left(\frac{1}{\delta}\right)}{2(n-1)}\right)^2} \\ &\implies &t &= \sqrt{\frac{2 \mathbb{E}\left[ V_n(\mathbf{X}) \right]\log\left(\frac{1}{\delta} \right)}{n-1} + \left(\frac{\log\left(\frac{1}{\delta}\right)}{2(n-1)}\right)^2} + \frac{\log\left(\frac{1}{\delta}\right)}{2(n-1)} \end{aligned}\]Thus, with probability at least $1 - \delta$:
\[\begin{aligned} & & V_n(\mathbf{X}) - \mathbb{E}\left[ V_n(\mathbf{X}) \right] &\leq \sqrt{\frac{2 \mathbb{E}\left[ V_n(\mathbf{X}) \right]\log\left(\frac{1}{\delta} \right)}{n-1} + \left(\frac{\log\left(\frac{1}{\delta}\right)}{2(n-1)}\right)^2} + \frac{\log\left(\frac{1}{\delta}\right)}{2(n-1)} \\ & & &\overset{(i)}{\leq} \sqrt{\frac{2 \mathbb{E}\left[ V_n(\mathbf{X}) \right]\log\left(\frac{1}{\delta}\right)}{n-1}} + \frac{\log\left(\frac{1}{\delta} \right)}{2(n - 1)} + \frac{\log\left(\frac{1}{\delta} \right)}{2(n - 1)} \\ &\implies &V_n(\mathbf{X}) &\leq \mathbb{E}\left[ V_n(\mathbf{X}) \right] + 2 \sqrt{\mathbb{E}\left[ V_n(\mathbf{X}) \right]}\sqrt{\frac{\log\left(\frac{1}{\delta}\right)}{2(n-1)}} + \frac{2\log\left(\frac{1}{\delta} \right)}{2(n - 1)} \\ & \implies &V_n(\mathbf{X}) + \frac{\log\left(\frac{1}{\delta} \right)}{2(n-1)} &\leq \mathbb{E}\left[ V_n(\mathbf{X}) \right] + 2 \sqrt{\mathbb{E}\left[ V_n(\mathbf{X}) \right]}\sqrt{\frac{\log\left(\frac{1}{\delta}\right)}{2(n-1)}} +\frac{\log\left(\frac{1}{\delta} \right)}{2(n-1)} + \frac{\log\left(\frac{1}{\delta} \right)}{n - 1} \\ & & &= \left(\sqrt{\mathbb{E}[V_n(\mathbf{X})]} + \sqrt{\frac{\log\left(\frac{1}{\delta}\right)}{2(n-1)}}\right)^2 + \frac{\log\left(\frac{1}{\delta}\right)}{n-1} \\ & \implies &V_n(\mathbf{X}) &\leq \left(\sqrt{\mathbb{E}[V_n(\mathbf{X})]} + \sqrt{\frac{\log\left(\frac{1}{\delta}\right)}{2(n-1)}}\right)^2 + \frac{\log\left(\frac{1}{\delta}\right)}{2(n-1)} \\ & \implies &\sqrt{V_n(\mathbf{X})} &\leq \sqrt{\left(\sqrt{\mathbb{E}[V_n(\mathbf{X})]} + \sqrt{\frac{\log\left(\frac{1}{\delta}\right)}{2(n-1)}}\right)^2 + \frac{\log\left(\frac{1}{\delta}\right)}{2(n-1)}} \\ & & &\overset{(ii)}{\leq} \sqrt{\mathbb{E}[V_n(\mathbf{X})]} + \sqrt{\frac{\log\left(\frac{1}{\delta}\right)}{2(n-1)}} + \sqrt{\frac{\log\left(\frac{1}{\delta}\right)}{2(n-1)}} \\ & & &= \sqrt{\mathbb{E}[V_n(\mathbf{X})]} + 2\sqrt{\frac{\log\left(\frac{1}{\delta}\right)}{2(n-1)}} \\ & & &= \sqrt{\mathbb{E}[V_n(\mathbf{X})]} + \sqrt{\frac{2\log\left(\frac{1}{\delta}\right)}{n-1}} \end{aligned}\]where, in $(i)$ and $(ii)$, we use the fact that $\sqrt{a + b} \leq \sqrt{a} + \sqrt{b}$. Thus, the second line in Eq. \eqref{eq:intermed-probs} implies that, with probability $1 - \delta$:
\[\begin{aligned} \sqrt{V_n(\mathbf{X})} &\leq \sqrt{\mathbb{E}[V_n(\mathbf{X})]} + \sqrt{\frac{2\log\left(\frac{1}{\delta}\right)}{n-1}} \end{aligned}\]Finally, we just need an extension of Bennett’s inequality that is integral to the proof of Theorem 11.