Though the journey to this point is a bit confusing, I have recently become interesting in clustering metrics and evaluation. In this post, I’ll work through a couple papers on describing how good a clustering function is based upon a set of axioms. These include Kleinberg’s An Impossibility Theorem for Clustering
First, let’s more rigorously define what we mean by clustering and our problem setting as defined by Kleinberg.
We can evaluate a clustering function with what the author terms a clustering quality metric.
Finally, we’ll define some helpful terms and quantities that will be used later. Let $\alpha \cdot d$ denote scaling the distance function $d$ by $\alpha$. That is $\alpha \cdot d$ is the distance function who assigns distance $\alpha d(i,j)$ between points $i$ and $j$. We’ll denote all possible output partitions of $f$ as $\text{Range}(f)$. We denote the fact points $i$ and $j$ are in the same cluster in partition $\Gamma$ with $i \underset{\Gamma}{\sim} j$, and we use the notation $i \underset{\Gamma}{\not \sim} j$ if they are not in the same cluster.
We call a set $G$ a representative set of $\Gamma$ if it contains a single observation from each cluster in $\Gamma$.
A distance-preserving isomorphism can be thought of as a mapping between our source set, $S$, and some target set $\phi(S)$ such that for any $i \in S$, there exists $\phi(i) \in \phi(S)$; for any $i, j \in S$, $d(i, j) = d(\phi(i), \phi(j))$; and there exists an inverse mapping, $\phi’$, such that $\phi’(\phi(i)) = i$.
A $\Gamma$-transformation of a distance function $d$ will assign smaller distances to points in the same cluster and larger distances to points in different clusters.
More intuitively, a refinement is just a finer partition, which means each cluster in $\Gamma$ is either also in $\Gamma’$ or is split into multiple smaller clusters in $\Gamma’$. We can also define the partial order $\Gamma’ \preccurlyeq \Gamma$ if $\Gamma’$ is a refinement of $\Gamma$.
One way I like to conceptualize an antichain is like a collection of subsets where no subset is a subset of any other…but instead of subsets we have partitions.
Kleinberg’s impossibility theorem is based upon a set of three axioms that characterize a “good” clustering function, $f$.
Kleinberg goes on to show that these axioms imply a semi-surprising result.
For $n \geq 2$, there does not exists a clustering function $f$ that is scale-invariant, rich, and consistent.
The proof of Theorem 2.1 rests upon the claim that a scale-invariant and consistent clustering fucntion $f$ cannot possibly be rich as $\text{Range(f)}$ forms an antichain (Theorem 3.1 in paper).
First, suppose we have a consistent clustering function $f$, and let $\Gamma$ be some partition in $\text{Range}(f)$. Because $\Gamma \in \text{Range}(f)$, $\exists d$ such that $f(d) = \Gamma$ (by definition). Define $a’$ as the minimum distance between any two points in the same cluster over all clusters in $\Gamma$, and define $b’$ as the maximum distance between any two points in different clusters over all clusters in $\Gamma$.
Select positive real numbers $a, b$ such that $a < b$, $a \leq a’$, and $b \geq b’$. Notice that, for any distance function $d’$ that $(a,b)$-conforms to $\Gamma$ is a $\Gamma$-transformation of $d$. This is due to the fact that $d’(i,j) \leq a \leq a’ \leq d(i,j)$ for $i,j \in S$ such that $i \underset{\Gamma}{\sim} j$ and $d’(i,j) \geq b \geq b’ \geq d(i,j)$ for $i,j \in S$ such that $i \underset{\Gamma}{\not \sim} j$, which is precisely the definition of a $\Gamma$-transformation.
Since $d’$ is a $\Gamma$-transformation of $d$ and $f$ is consistent by assumption, $(a,b)$ is $\Gamma$-forcing.
Now we further assume that $f$ is scale-invariant and that $\exists \Gamma_0, \Gamma_1 \in \text{Range}(f)$ such that $\Gamma_0$ is a refinement of $\Gamma_1$. Define $(a_0, b_0)$ and $(a_1, b_1)$ be pairs of reals that are $\Gamma_0$-forcing and $\Gamma_1$-forcing, respectively, such that $a_0 < b_0$ and $a_1 < b_1$. Such pairs can be found by taking any two partitions in $\text{Range}(f)$ and setting $\Gamma_0$ and $\Gamma_1$ according to the argument above.
Let $a_2$ be some number such that $a_2 \leq a_1$ and set $\epsilon$ such that $0 < \epsilon < \frac{a_0 a_2}{b_0}$. We now construct the distance function $d$ that satisfies the following:
According to the above conditions, $d$ $(a_1, b_1)$-conforms to $\Gamma_1$. Since $f$ is still assumed to be consistent, this implies $f(d) = \Gamma_1$.
Let $\alpha = \frac{b_0}{a_2}$ and $d’ = \alpha \cdot d$. Since $f$ is scale-invariant, $f(d’) = f(d) = \Gamma_1$. However, for $i,j$ such that $\Gamma_0^i = \Gamma_0^j$, $d’(i,j) = \alpha d(i,j) \leq \alpha \epsilon < \frac{a_0 a_2 b_0}{a_2 b_0} = a_0$. In addition, for $i,j$ such that $\Gamma_0^i \neq \Gamma_0^j$, $d’(i,j) = \alpha d(i,j) \geq \alpha a_2 = \frac{b_0 a_2}{a_2} = b_0$. This implies that $d’$ $(a_0, b_0)$-conforms to $\Gamma_0$, which implies that $f(d’) = \Gamma_0$.
Since $\Gamma_0$ is a refinement of $\Gamma_1$, they are distinct partitions, so $\Gamma_0 \neq \Gamma_1$, and we arrive at a contradition.
Kleinberg proves an additional theorem that describes the partitions achievable by scale-invariant, consistent clustering functions $f$.
Let $\mathcal{A}$ be any antichain of partitions. There exists a scale-invariant, consistent clustering function $f$ such that $\text{Range}(f) = \mathcal{A}$.
Kleinberg’s proof uses the sum-of-pairs clustering method which outputs the partition $\Gamma \in \mathcal{A}$ that minimizes $\Phi_d(\Gamma) = \sum_{(i,j) \sim \Gamma} d(i,j)$ where the notation $(i,j) \sim \Gamma$ indicates $i \underset{\Gamma}{\sim} j$.
Notice that for any $\alpha > 0$, $\Phi_{\alpha \cdot d}(\Gamma) = \sum_{(i,j) \sim \Gamma} \alpha d(i,j) = \alpha \sum_{(i,j) \sim \Gamma} d(i,j) = \alpha \Phi_{d}(\Gamma)$. Since $\alpha$ is positive, the argmin of $\Phi_{\alpha \cdot d}(\Gamma)$ is equivalent to the argmin of $\Phi_d(\Gamma)$, which implies $f(d) = f(\alpha \cdot d)$ so $f$ is scale-invariant.
Fix some $\Gamma \in \mathcal{A}$. Let $d$ be a distance function satisfying:
Notice that $\Phi_d(\Gamma) = \sum_{(i,j) \sim \Gamma} d(i,j) < 1$ since the summation is only over $i,j$ in the same clusters of which there can be, at most, $n^2$ pairs (if there is only one cluster).
Also notice that $\Phi_d(\Gamma’) < 1$ only if $\Gamma’$ is a refinement of $\Gamma$. To see why, consider $\Gamma’$ that is not a refinement of $\Gamma$. This implies that there is some cluster $C’ \in \Gamma’$ such that $C’ \not\subseteq C$ for all $C \in \Gamma$. This implies that there exist points $i,j$ such that $i \underset{\Gamma’}{\sim} j$, so they are included in the summation in $\Phi_d(\Gamma’)$, but $i \underset{\Gamma}{\not \sim} j$, so $d(i,j) > 1$. If $\Gamma’$ is a refinement of $\Gamma$, then $i \underset{\Gamma}{\sim} j$, so $d(i,j) < \frac{1}{n^3}$.
Because $\mathcal{A}$ is an antichain, there is no refinement $\Gamma’ \in \mathcal{A}$ of $\Gamma$. Thus, $\Gamma = \underset{\Gamma^* \in \mathcal{A}}{\arg\min} \Phi_d(\Gamma^*)$, which implies $f(d) = \Gamma$. Thus, $f$ is rich, since there exists a $d$ such that $f(d) = \Gamma$ for any $\Gamma \in \mathcal{A}$.
Let $d$ be a distance function such that $f(d) = \Gamma$, and let $d’$ be a $\Gamma$-transformation of $d$. Furthermore, let $\Gamma’$ be some other partition and define $\Delta(\Gamma’) := \Phi_d(\Gamma’) - \Phi_{d’}(\Gamma’)$. We have that $\Delta(\Gamma) = \sum_{(i,j) \sim \Gamma}d(i,j) - \sum_{(i,j) \sim \Gamma} d’(i,j)$ and $\Delta(\Gamma) = \sum_{(i,j) \sim \Gamma’} d(i,j) - \sum_{(i,j) \sim \Gamma’} d’(i,j)$. Then:
\[\begin{aligned} \Delta(\Gamma) &= \sum_{(i,j) \sim \Gamma} d(i,j) - \sum_{(i,j) \sim \Gamma} d'(i,j) \\ &= \sum_{(i,j) \sim \Gamma} d(i,j) - d'(i,j) \\ &\overset{(i)}{\geq} \sum_{(i,j) \sim \Gamma, (i,j) \sim \Gamma'} d(i,j) - d'(i,j) \\ &\overset{(ii)}{\geq} \sum_{(i,j) \sim \Gamma'} d(i,j) - d'(i,j) \\ &= \Delta(\Gamma') \end{aligned}\]We know $d’$ is a $\Gamma$-transformation of $d$. Thus, $d’(i,j) \leq d(i,j)$ for $(i,j) \sim \Gamma$, implying that $d(i,j) - d’(i,j) \geq 0$ for all $(i,j) \sim \Gamma$. $(i)$ follows from the fact that $\rvert (i,j) \sim \Gamma \cap (i,j) \sim \Gamma’ \rvert \leq \rvert (i,j) \sim \Gamma \rvert$.
Similarly, $(ii)$ is equivalent to the summation in $(i)$ plus $(i,j) \sim \Gamma’$ such that $(i,j) \not\sim \Gamma$. For these pairs, $d’(i,j) \geq d(i,j)$, since $d’$ is a $\Gamma$-transformation of $d$. This means $d(i,j) - d’(i,j) \leq 0$ for these pairs, which implies $(ii)$.
The above argument implies $\Delta(\Gamma) \geq \Delta(\Gamma’)$ for any $\Gamma’ \in \mathcal{A}$. This implies $\Phi_d(\Gamma) - \Phi_{d’}(\Gamma) \geq \Phi_d(\Gamma’) - \Phi_{d’}(\Gamma’)$, which implies $\Phi_d(\Gamma) - \Phi_d(\Gamma’) \geq \Phi_{d’}(\Gamma) - \Phi_{d’}(\Gamma’)$.
Since $\Gamma’$ minimizes $\Phi_{d’}$ and $\Gamma$ minimizs $\Phi_{d}$, we have that $\Phi_d(\Gamma) - \Phi_{d’}(\Gamma) \leq 0$. Chaining this together with the previous yields $0 \geq \Phi_d’(\Gamma) - \Phi_{d’}(\Gamma’)$, which implies $\Phi_{d’}(\Gamma’) \geq \Phi_{d’}(\Gamma)$. If $\Gamma’$ minimizes, $\Phi_{d’}$, then it must be the case that $\Gamma’ = \Gamma$. Thus, $f(d’) = \Gamma$, and therefore $f$ is consistent.
Due to the impossibility theorem, it might be worthwhile to look into easing up the conditions in the axioms. Kleinberg provides a few examples.
Theorem 3.2 is an example of a relaxation of the richness property. If we were satisfied with a clustering function that is scale-invariant, consistent, but only achieves an antichain as its range, then the sum-of-pairs method will work.
Kleinberg proposes Refinement-Consistency, in which a $\Gamma$-transformation should output a refinement of $\Gamma$. Unfortunately, this is not yet enough; a scale-invariant, rich, and refinement-consistent clustering function does not exist. However, if one also relaxes richness to say that all but one (trivial) partition can be achieved by $f$ — termed Near-Richness —, then Theorem 2.1 does not hold.
An alternative is to relax consistency to something I’ll call Weak Consistency, which is where if $d’$ is a $f(d)$-transformation of $d$, then either $f(d’)$ is a refinement of $f(d)$ or $f(d)$ is a refinement of $f(d’)$. There do exist clustering functions that satisfy all three of scale-invariance, richness, and weak consistency.
In some ways, this relaxation may be more reasonable. For example, consider some partition that results in four clusters arranged in a square. Now, construct a distance function that just puts more distance between the left clusters and the right clusters. Although this new distance function is a $\Gamma$-transformation, it might be better to combine the left clusters and right clusters to have a partition with two groups. Ackerman and Ben-David provide this example as an illustration in Figure 1 of their paper.
Kleinberg does not discuss a relaxation of scale-invariance at length. He mentions that single-linkage clustering where we stop combining clusters when their distances exceed some value $r$ satisfies consistency and richness. It satisfies a weaker scale-invariance where we let $f(\alpha \cdot d)$ be a refinement of $f(d)$ when $\alpha > 1$. Besides this, I don’t see why relaxing scale-invariance any more than this would be an attractive property of a clustering function.
Kleinberg’s main result is a theorem (and its proof) stating that there does not exist a clustering function that satisfies three simple and desirable properties. This seems quite disappointing as clustering is a popular subtopic in unsupervised learning, and it seems to imply that clustering is at least impossibly difficult to define precisely if not simply impossible to actually do.
Ackerman and Ben-David push back against this interpretation and state that the way in which Kleinberg axiomatized clustering was part of the reason for the impossibility result. They provide a slightly different perspective and provide a set of axioms for the function used to assess clustering quality rather than the clustering function itself.
Ackerman and Ben-David adjust Steinberg’s work to apply to CQMs rather than clustering functions, which results in a consistent set of axioms. They also add one additional property in order to make their set of axioms satisfy two properties (soundness and completeness, which I won’t go into here) that make it more useful for defining what methods should be used for clustering.
Ackerman and Ben-David then prove that consistent CQMs exist.
There exists a clustering quality measure that satisfies all four of scale-invariance, richness, consistency. and isomorphism invariance. That is, the four properties comprise a consistent set of axioms.
It suffices to construct a CQM and prove it satisfies the three axioms.
First, we define the Relative Point Margin. For distance $d$ and clustering $\Gamma$ containing $k$ clusters, the $G$-Relative Point Margin is defined as $G$-\(RM_{S, d}(x) = \frac{d(x, g_x)}{d(x, g'_{x})}\) where $g_x \in G$ is the closest cluster center to $x$, \(g'_{x} \in G\) is the second closest cluster center to $x$, and $G \subseteq S$.
Next, define the Relative Margin. Let $\mathcal{G}$ denote the set of all possible representative sets of $\Gamma$. Then the relative margin of $\Gamma$ over $(S, d)$ is $RM_{S, d}(\Gamma) = \underset{G \in \mathcal{G}}{\min} \left{ \underset{x \in S \setminus G}{\textit{avg}} G\text{-}RM_{S, d}(x) \right}$. This is the representative set that achieves the minimum average relative point margin where the average is taken over all points not in the representative set. The relative margin assigns lower values to better clusterings (so the inequalities in the richness and consistency definitions will be reversed).
Let $\Gamma$ be an arbitrary clustering of the set $S$ with distance function $d$ in the following.
Scale-Invariance. Let $d’$ be a distance function satisfying $d’(i,j) = \alpha d(i, j)$ for all $i,j \in S$ and all $\alpha > 0$. For any $i,j,k \in S$, we have:
\[\frac{d'(i,j)}{d'(i,k)} = \frac{\alpha d(i,j)}{\alpha d(i,k)} = \frac{d(i,j)}{d(i,k)} \implies \frac{d'(i, g_i)}{d'(i, g'_i)} = \frac{d(i, g_i)}{d(i, g'_i)}\]since scaling all distances will result in the same centers. This implies that $RM_{S, d’}(\Gamma) = RM_{S, d}(\Gamma)$.
Consistency. Let $d’$ be a $\Gamma$-transformation of $d$. We have:
\[\begin{aligned} \begin{cases} d'(i,j) \leq d(i,j) & \text{for } i \underset{\Gamma}{\sim} j \\ d'(i,j) \geq d(i,j) & \text{for } i \underset{\Gamma}{\not \sim} j \end{cases} &\implies \frac{d'(i,j)}{d'(i,k)} \leq \frac{d(i,j)}{d(i,k)} \text{ for } i \underset{\Gamma}{\sim} j; i \underset{\Gamma}{\not \sim} k \\ &\implies G\text{-}RM_{S,d'}(i) \leq G\text{-}RM_{S,d}(i) \text{ for any } G \in \mathcal{G} \end{aligned}\]The first implication follows from the fact that the closest cluster center to $i$ will be a point in the same cluster, and the second closest cluster center to $i$ will be a point in a different cluster (by the definition of a representative set).
Richness. Let $\Gamma$ be an arbitrary non-trivial clustering of $S$ with $d$. Define the distance function $d$ as:
\[\begin{cases} d(i,j) = 1 & \text{for } i \underset{\Gamma}{\sim} j \\ d(i,j) = 10 & \text{for } i \underset{\Gamma}{\not \sim} j \end{cases} \nonumber\]It follows that:
\[\begin{aligned} \underset{\Gamma'}{\arg \min} \left\{ RM_{S, d}(\Gamma') \right\} &= \underset{\Gamma'}{\arg \min} \left\{ \underset{G \in \mathcal{G}}{\min} \left[ \underset{x \in S \setminus G}{\textit{avg}} G\text{-}RM_{S,d}(x) \right] \right\} \\ &= \underset{\Gamma'}{\arg \min} \left\{ \underset{G \in \mathcal{G}}{\min} \left[ \frac{1}{\rvert S \setminus G \rvert} \sum_{x \in S \setminus G} \frac{d(x, g_x)}{d(x, g'_x)} \right] \right\} \\ &= \underset{\Gamma'}{\arg \min} \left\{ \underset{G \in \mathcal{G}}{\min} \left[ \frac{1}{\rvert S \setminus G \rvert} \sum_{x \in S \setminus G} \frac{1}{10} \right] \right\} \\ &= \underset{\Gamma'}{\arg \min} \left\{ \underset{G \in \mathcal{G}}{\min} \left[ \frac{1}{10} \right] \right\} \end{aligned}\]We arrive at the fact that, for this choice of $d$, the minimum $RM_{S,d}(\Gamma’)$ is achieved by every non-trivial partition of $S$. Thus, $\Gamma = \underset{\Gamma’}{\arg \min} \left{ RM_{S, d}(\Gamma’) \right}$.
Isomorphism Invariance. Let $\Gamma’$ be a partition such that $\Gamma \underset{d}{\approx} \Gamma’$. Since they are isomorphic, there exists a distance-preserving isomorphism $\phi$. Let $G’ := { \phi(x): x \in G }$, and let $\mathcal{G}’$ be the set of all $G’$. Thus:
\[\begin{aligned} \frac{d(i, g_i)}{d(i, g'_i)} = \frac{d(\phi, g_{\phi(i)})}{d(\phi, g'_{\phi(i)})} &\implies G\text{-}RM_{S,d}(i) = G'\text{-}RM_{S,d}(i) \\ &\implies \underset{i \in S \setminus G}{avg} \left\{ G\text{-}RM_{S,d}(i)\right\} = \underset{i \in S \setminus G'}{avg} \left\{ G'\text{-}RM_{S,d}(i) \right\} \\ &\implies \underset{G \in \mathcal{G}}{\min} \left[ \underset{i \in S \setminus G}{avg} \left\{ G\text{-}RM_{S,d}(i)\right\}\right] = \underset{G \in \mathcal{G}}{\min} \left[ \underset{i \in S \setminus G'}{avg} \left\{ G'\text{-}RM_{S,d}(i)\right\}\right] \\ &\implies RM_{S,d}(\Gamma) = RM_{S,d}(\Gamma') \end{aligned}\]Ackerman and Ben-David provide a few different examples of CQMs that satisfy their axioms.
Suppose we are in a linkage-based regime. Define the Weakest Link Between Points as the following. Let $\gamma_k$ be the $k$-th cluster in $\Gamma$. For $\Gamma$ over $S$ with $d$:
\[\Gamma\text{-}WL_{S, d}(i, j) = \underset{x \in \gamma_k \forall k}{\min} \left\{ \max \left[ d(i, x), d(i, j) \right] \right\}\]The Weakest Link of $\Gamma$ over $S$ with $d$ is:
\[WL(\Gamma) = \frac{\underset{i \underset{\Gamma}{\sim} j}{\max} \Gamma\text{-}WL_{S,d}(i,j)}{\underset{i \underset{\Gamma}{\not \sim}}{\min} d(i,j)}\]This is calculable in $O(n^3)$ time.
Suppose we are in a center-based clustering (like $k$-means). Define the Additive Point Margin as:
\[G\text{-}AM_{S, d}(i) = d(i, \gamma'_i) - d(i, \gamma_i)\]The Additive Margin of $\Gamma$ over $S$ with $d$ is:
\[AM_{S,d}(\Gamma) = \underset{G \in \mathcal{G}}{\min} \left\{ \frac{ \frac{1}{\rvert S \rvert} \sum_{i \in S} G\text{-}AM_{S, d}(i) }{ \frac{1}{\rvert \left\{ \{ i, j\} \subseteq S \rvert i \underset{\Gamma}{\sim} j \right\}} \sum_{i \underset{\Gamma}{\sim} j} d(i, j) } \right\}\]This is calculable in $O(n^{k+1})$ time.
One can also use functions of clustering-quality measures to create a new one. If one had a CQM defined for partitions of $k$ clusters, one could consider taking the minimum, maximum, or average over all subsets of size $k$ for a clustering of arbitrary size greater than $k$.
The above CQMs do not depend on the number of clusters in a partition, which makes it easy to compare clusterings that are of different sizes. Ackerman and Ben-David extend their framework to CQMs that do depend on cluster number, such as those that are based upon the objective functions of some clustering methods (such as $k$-means).
In order to make these CQMs compliant with the scale-invariance property, the quality scores must be normalized in some way. An example is $\mathcal{L}$-normalization, which scales the loss of a clustering by the loss of a trivial clustering that has a single cluster of all observations.
Loss-based clustering functions also tend to either reward or punish more clusters. Ackerman and Ben-David term CQMs based on a loss function that “prefers” more clusters as refinement-preferring and those based on losses that prefer fewer clusters as coarsening-preferring. More explicitly, a refinement-preferring CQM will assign a better quality score to refinements of $\Gamma$ than to $\Gamma$, and a coarsening-preferring CQM will assign better quality scores to $\Gamma$ than to its refinements. These CQMs do not satisfy the richness property.
Here are some more articles you might like to read next: