- Given:
$N$ unlabeled examples${x_1,\cdots, x_N}$ and number of desired partitions$K$ . - Goal: : Group the examples into
$K$ “homogeneous” partitions.
Def.
Given a set of observations
$X = {x_1, x_2, · · · , x_N} (x_i\in\mathbb{R}^D)$ , partition the$N$ observations into$K$ sets ($K ≤ N$ ) ${\mathcal{C}k}{k=1,··· ,K}$ such that the sets minimize the within-cluster sum of squares: $$ {\mathcal{C}k}=\arg\min{{\mathcal{C}k}}\sum{i=1}^K\sum_{x\in\mathcal{C}_i}\vert\vert x-\mu_i\vert\vert^2 $$ where$\mu$ is the mean of points in set$\mathcal{C_i}$ .
Algorithm
- (Re)-Assign each example
$x_{i}$ to its closest cluster center (based on the smallest Euclidean distance)$$ \mathcal{C}{k}=\left{x{i} \mid\left|x_{i}-\mu_{k}\right|^{2} \leq\left|x_{i}-\mu_{k^{\prime}}\right|^{2}, \text { for } \forall k^{\prime} \neq k\right} $$
- ( $\mathcal{C}{k}$ is the set of examples assigned to cluster $k$ with center $\mu{k}$ )
- Update the cluster means
$$ \mu_{k}=\operatorname{mean}\left(\mathcal{C}{k}\right)=\frac{1}{\left|\mathcal{C}{k}\right|} \sum_{x \in \mathcal{C}_{k}} x $$
- Let
$z_{i,k}$ be an indicator
- and
$z_i=[z_{i,1},\cdots,z_{i,k}]^T$ represents the one-hot encoding of$x_i$ . - The loss is
$L(\mu, \mathbf{X}, \mathbf{Z})=\sum_{i=1}^{N} \sum_{k=1}^{K} z_{i, k}\left|x_{i}-\mu_{k}\right|^{2}=|\mathbf{X}-\mathbf{Z} \mu|^{2}$ - where
$\mathbf{X}\in\mathbb{R}^{N\times D}$ ,$\mathbf{Z}\in\mathbb{R}^{N\times K}$ ,$\mu\in\mathbb R^{K\times D}$
- Makes hard assignments of points to clusters
- Works well only is the clusters are roughly of equal sizes
- K-means also works well only when the clusters are round-shaped and does badly if the clusters have non-convex shapes
- Basic idea: Replace the Euclidean distance/similarity computations in K-means by the kernelized versions