What Does It Mean To Be Stable?
My last post related to clustering discussed how to describe a “good” clustering algorithm. One way to measure this is by stability, which I’ll define more rigorously later.
The main paper I’ll be summarizing, Sober Look at Clustering Stability, by Shai Ben-David, Ulrike von Luxberg, and Dávid Pál
Similar to the general idea of a robust estimation procedure, a stable clustering algorithm should produce similar results for small perturbations of the data. As Ben-David, von Luxburg, and Pál point out, stability is often used for tuning $k$, the number of clusters the algorithm should yield. Intuitively, choosing $k$ too large will result in arbitrary splitting of true clusters. On the other hand, a $k$ that is too small will do the opposite and arbitrarily merge true clusters.
For optimization-based clustering algorithms, there are two sources of instability:
Sampling: Variability in our observations will result in different values of our objective function. This, in turn, will influence the clustering.
Data Distribution: If the data distribution is such that there are multiple optimizers of the objective function, then there is ambiguity in which clustering is the best.
I’ll be following the notation in Ben-David, von Luxburg, and Pál fairly closely throughout this post with a few slight tweaks to make it align better with my habits. Let’s do a quick run-down of their main definitions.
Our observations $S = {x_1, \dots, x_n}$ will be assumed to be i.i.d. from some sample space, $X$, with probability measure $P$. If $X$ is a metric space, its metric will be denoted by $\ell$. $P_S$ denotes the uniform distribution over the finite sample $S$.
We’ll denote the cluster of point $x$ with $C(x)$. If $C(x) = C(y)$, then we write $x \underset{\mathcal{C}}{\sim} y$.
Note that the above definition differs from the traditional definition of a distance metric in that it is not required that if $d_P(\mathcal{C}_1, \mathcal{C}_2) = 0$ then $\mathcal{C}_1 = \mathcal{C}_2$.
With these definitions out of the way, we can finally define stability formally.
In words, $\text{stab}(A, P, n)$ is the expected distance between the clusterings output by $A$ whe given two i.i.d. samples of size $n$ drawn from $P$.
An $R$-minimizing algorithm will achieve empirical risk equal to the optimal risk (for that distribution).
For clustering distance, $d$, a probability distribution, $P$, has a unique minimizer, \(\mathcal{C}^*\) if for any $\eta > 0$, there exists \(\epsilon > 0\) such that if \(R(P, \mathcal{C}) < opt(P) + \epsilon\), then \(d_P(\mathcal{C}^*, \mathcal{C}) < \eta\). In words, this means that any clustering with risk that is within some $\epsilon$ of the risk achieved by the unique minimizer should be very close (by $\eta$) to the unique minimizer.
$P$ is said to have $m$ distinct minimizers if there are \(\mathcal{C}^*_1, \dots, \mathcal{C}^*_m\) such that \(d_P(\mathcal{C}^*_i, \mathcal{C}^*_j) > 0\) for all \(i \neq j\) such that for all \(\eta > 0\), there exists \(\epsilon > 0\) such that if \(R(P, \mathcal{C}) < opt(P) + \epsilon\), then there exists \(1 \leq i \leq m\) such that \(d_P(\mathcal{C}^*_i, \mathcal{C}) < \eta\).
The above distribution gives us a way to characterize risk functions and clustering distances.
Ben-David et al. present their first theorem as follows:
Any risk converging, $R$-minimizing clustering algorithm will be stable on distribution $P$ if $P$ has a unique minimizer.
The basic idea is to show that, for any value $\xi > 0$, the stability of an algorithm $A$ satisfying the stated properties is less than $\xi$ for sufficiently large sample size $m$. This implies that $A$ is stable on $P$.
Define $A$ as a risk converging, $R$-minimizing algorithm, and fix some $\xi > 0$. Let \(\mathcal{C}^*\) be the unique minimizer of $P$. First, set some $\delta \in (0, 1)$ and $\eta > 0$ such that $2(\eta + \delta) < \xi$.
Since \(\mathcal{C}^*\) is the unique minimizer of $P$, we have that $\exists \epsilon > 0$ such that:
\[R(P, \mathcal{C}) < opt(P) + \epsilon \implies d_P(\mathcal{C}^*, \mathcal{C}) < \eta \nonumber\]by the definition of a unique minimizer for a distribution. Furthermore, by the definition of risk convergence, $\exists m_0$ such that for all $m > m_0$ we have:
\[\underset{S \sim P^m}{\mathbb{P}} \left( R(P, A(S)) \geq opt(P) + \epsilon \right) < \delta \nonumber\]Consider $m > m_0$ and $S \sim P^m$. If $R(P, A(S)) < opt(P) + \epsilon$, then \(d_P(A(S), \mathcal{C}^*) < \eta\). Since the implication is only one way, the probability of the $R(P, A(S)) < opt(P) + \epsilon$ is less than or equal to the probability of \(d_P(A(S), \mathcal{C}^*) < \eta\) (since there may be some cases where the latter holds without the former). Thus:
\[\underset{S \sim P^m}{\mathbb{P}}\left( d_P(A(S), \mathcal{C}^*) \geq \eta \right) \leq \underset{S \sim P^m}{\mathbb{P}}\left( R(P, A(S)) \geq opt(P) + \epsilon\right) < \delta \nonumber\]The bound then follows:
\[\begin{aligned} \text{stab}(A, P, m) &= \underset{S_1, S_2 \sim P^m}{\mathbb{E}} \left[ d_P(A(S_1), A(S_2)) \right] \\ &\leq \underset{S_1, S_2 \sim P^m}{\mathbb{E}} \left[ d_P(A(S_1), \mathcal{C}^*) + d_P(\mathcal{C}^*, A(S_2)) \right] \hspace{15mm} \text{(triangle ineq.)}\\ &= 2 \underset{S \sim P^m}{\mathbb{E}}\left[ d_P(A(S), \mathcal{C}^*) \right] \hspace{15mm} \text{i.i.d. samples} \\ &\overset{(i)}{\leq} 2\left( \eta \cdot \underset{S \sim P^m}{\mathbb{P}}\left( d_P(A(S), \mathcal{C}^*) < \eta \right) + 1 \cdot \underset{S \sim P^m}{\mathbb{P}} \left( d_P(A(S), \mathcal{C}^*) \geq \eta \right) \right) \hspace{15mm} \text{defn. of expectation} \\ &\leq 2\left( \eta + \underset{S \sim P^m}{\mathbb{P}} \left( R(P, A(S)) \geq opt(P) + \epsilon \right) \right) \\ &\leq 2(\eta + \delta) \\ &< \xi \end{aligned} \nonumber\]In $(i)$, I am not entirely sure of the choice of $1$. However, I think we can take $\xi$ to be arbitrarily small as $m \rightarrow \infty$?
The authors emphasize a couple points (see the paper for illustrations):
With these points in mind, the authors prove a related theorem.
Suppose we have an ODD risk function, $R$, an ODD clustering distance $d$, a probability distribution, $P$, with some number $n$ distinct minimizers, and $P$-symmetry, $g$, such that \(d_P(\mathcal{C}^*, g(\mathcal{C}^*)) > 0\) for every $R$-minimizer \(\mathcal{C}^*\). Any risk convergent, $R$-minimizing clustering algorithm is unstable on $P$.
For the $R$-minimizers, denote the clustering they result in as \(\{ \mathcal{C}_1^*, \dots, \mathcal{C}_n^*\}\). Furthermore, let \(r := \underset{1 \leq i \leq n}{\min} \left( d_P(\mathcal{C}^*_i, g(\mathcal{C}^*_i)) \right)\).
Choose $\epsilon > 0$ such that a clustering achieving risk within $\epsilon$ of the best possible risk implies the existence of some $R$-minimizer that is close to $\mathcal{C}$ (less than $r/4$ distance between them). That is, $R(P, \mathcal{C}) < opt(P) + \epsilon$ implies $\exists 1 \leq i \leq n$ such that $d_P(\mathcal{C}^*_i, \mathcal{C}) < r/4$.
Define $T = { S \in X^m \rvert R(P, A(S)) < opt(P) + \epsilon }$, the set of samples from $X$ of size $m$ whose clustering using $A$ achieves risk within $\epsilon$ of $opt(P)$. $A$ is assumed to be risk convergent, so there exists $m_0$ such that for $m > m_0$, $P(T) > 0.9$ (by choosing $\delta = 0.1$).
Now, for $1 \leq i \leq n$, define \(T_i = \{ S \in T \rvert d_p(\mathcal{C}^*_i, A(S)) \leq r/4 \}\), the samples in $T$ whose clusterings with $A$ are within $r/4$ distance of the $i$-th $R$-minimizing clustering, \(\mathcal{C}^*_i\). This step is just defining subsets of $T$ that we know exist since $P$ has $n$ distinct minimizers (which implies this $\epsilon$ exists).
Since $P(T) > 0.9$ for sufficiently large $m$, there must be some $i_0$ such that \(P(T_{i_0}) \geq 0.9/n\) (why?). Since $g$ is a $P$-preserving symmetry and $R$ is assumed to be ODD, we know that for any $S \in T$, $g(S) \in T$ (by definition of ODD). The fact that $g$ is a $P$-preserving symmetry also implies that \(P(g[T_{i_0}]) \geq 0.9/n\) since \(P(T_{i_0}) \geq 0.9/n\).
By the definition of $r$, we have that \(d_P(\mathcal{C}^*_{i_0}, g(\mathcal{C}^*_{i_0})) \geq r\), and by the construction of $T_{i_0}$, we also have that for all $S \in T_{i_0}$, \(d_P(\mathcal{C}^*_{i_0}, A(S)) \leq r/4\). Recall that $d$ is ODD as well, so for all \(S \in T_{i_0}\), \(d_P(g(\mathcal{C}^*_{i_0}), A(g(S))) \leq r/4\).
By our definition of clustering distances, $d$ must satisfy the triangle inequality. Thus, for any $S \in T_{i_0}$ and $S’ \in g[T_{i_0}]$:
\[d_P(A(S), A(S')) \leq d_P(\mathcal{C}^*_i, A(S)) + d_P(\mathcal{C}^*_i, A(S')) \leq \frac{r}{2} \nonumber\]All of the above leads to the following. For any $m \geq m_0$:
\[\begin{aligned} stab(A, P, m) &= \underset{S, S' \sim P^m}{\mathbb{E}} \left[ d_P(A(S), A(S')) \right] \\ &\geq \frac{r}{2} \underset{S, S' \sim P^m}{\mathbb{P}} \left( d_P(A(S), A(S')) \geq \frac{r}{2} \right) \hspace{15mm} \text{ ignore other case (which must be non-negative)} \\ &\geq \frac{r}{2}\underset{S, S' \sim P^m}{\mathbb{P}} \left( A \in T_{i_0} \cap S' \in g[T_{i_0}] \right) \\ &= \frac{r}{2} \underset{S \sim P^m}{\mathbb{P}} \left( S \in T_{i_0} \right) \underset{S' \sim P^m}{\mathbb{P}}\left( S' \in g[T_{i_0}] \right) \hspace{15mm} S, S' \text{ independent } \\ &\geq \frac{r0.9^2}{2 n^2} \end{aligned}\]The stability is therefore always positive (regardless of $m$), which implies that $stab(A,P)$ is as well (since $r > 0$). Thus, $A$ is unstable on $P$.
Here are some more articles you might like to read next: