-
Notifications
You must be signed in to change notification settings - Fork 3
/
Copy pathmulticut.tex
127 lines (105 loc) · 9.08 KB
/
multicut.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
\section{Multicut}
\label{sec:multicut}
The multicut problem~\cite{chopra1993partition} (also known as correlation clustering~\cite{demaine2006correlation}) is to cluster graph nodes based on edge preferences.
\begin{definition}[Multicut]
Given an undirected weighted graph $G=(V,E,c)$ the multicut problem is to find a partition $\Pi = (\Pi_1,\ldots,\Pi_k)$ such that $\Pi_i \cap \Pi_j = \varnothing$ for $i \neq j$ and $\dot{\cup}_{i\in[k]} \Pi_j = V$ where $k$ is computed as part of the optimization process.
The \emph{cut} $\delta(\Pi_1,\ldots,\Pi_k)$ induced by a decomposition is the subset of those edges that straddle distinct clusters.
The space of all multicuts is denoted as
\begin{equation}
\mathcal{M}_G = \left\{ \delta(V_1,\ldots,V_k) :
\begin{array}{c}
k \in \N \\
% V_i \cap V_j = \varnothing \quad \forall i\neq j\\
V_1 \dot\cup \ldots \dot\cup V_k = V
\end{array} \right\}\,.
\end{equation}
The minimum cost multicut problem is
\begin{equation}
\label{eq:multicut}
\min_{y \in \mathcal{M}_G} \la c, x \ra\,.
\end{equation}
\end{definition}
A multicut illustration can be seen in Figure~\ref{fig:multicut-illustration}
\begin{figure}[H]
\centering
\scalebox{0.7}{ \includestandalone{figures/multicut} }
\caption{
An exemplary multicut on a grid graph.
The multicut consists of three partitions induced by the cut edges (red, dashed).
}
\label{fig:multicut-illustration}
\end{figure}
\subsection{File Format}
The multicut problem is given in the text file format as follows:
\begin{fileformat}
MULTICUT
(*$i_1$*) (*$j_1$*) (*$c_1$*)
.
.
.
(*$i_m$*) (*$j_m$*) (*$c_m$*)
\end{fileformat}
where $E = \{i_1 j_1, \ldots, i_m j_m\}$ and $c_1,\ldots,c_m$ are the corresponding edge weights.
\subsection{Benchmark datasets}
\subsubsection{CREMI}
The CREMI-challenge~\cite{cremi} is to group voxels from 3D-volumes of fruit-fly brain matter together whenever they belong to the same neuron.
The raw image data was acquired by~\cite{zheng2018complete} and converted to multiple multicut instances as detailed below.
\paragraph{Superpixel\footnote{\url{https://keeper.mpdl.mpg.de/f/811b88d4c97644d39ea9/?dl=1}}}
For converting the data into multicut instances the authors of~\cite{pape2017solving} first created super-pixels and then computed affinities between these for estimating probabilities that superpixels belong to the same neuron.
Instances are different crops of one global problem.
There are 3 small ($400000-600000$ edges), 3 medium ($4-5$ million edges) and 5 large ($28-650$ million edges) multicut instances.
\paragraph{Raw\footnote{\url{https://keeper.mpdl.mpg.de/f/3916d2da6aa840139206/?dl=1}}}
Multicut instances are derived directly from the voxel grid without conversion to superpixels.
Three test volumes sample A+, B+ and C+ from~\cite{cremi} were used.
Edge weights are computed by~\cite{torch_EM}.
There are two types of instances:
(i)~The three full problems where the underlying volumes have size $1250 \times 1250 \times 125$ with around $700$ million edges and
(ii)~six cropped problems created by halving each volume and creating the corresponding multicut instances each containing almost $340$ million edges.
\subsubsection[Cityscapes Instance Segmentation]{Cityscapes Instance Segmentation\footnote{\url{https://keeper.mpdl.mpg.de/f/80686a004ff84d96aaeb/?dl=1}}}
Unsupervised image segmentation on $59$ high resolution images ($2048 \times 1024$) taken from the validation set~\cite{cordts2016cityscapes}.
Conversion to multicut instances is done by computing the edge affinities produced by~\cite{abbas2021combinatorial} on a grid graph with $4$-connectivity and additional coarsely sampled longer range edges.
Each instance contains approximately $2$ million nodes and $9$ million edges.
\subsubsection{OpenGM}
The OpenGM benchmark~\cite{kappes2015comparative} contains instances of the multicut problem from several applications.
These instances are stored in a more general graphical model format.
For convenience we provide these instances here in the multicut file format as described above.
\paragraph{Natural image segmentation\footnote{\url{https://keeper.mpdl.mpg.de/f/7af922a52a354896a5d7/?dl=1}}}
Instances of the multicut problem are constructed from natural images from the Berkley segmentation dataset~\cite{MartinFTM01} as described in~\cite{andres2011closedness}.
First, superpixels are computed by a watershed over-segmentation and a graph is constructed with one node for each superpixel and edges between adjacent superpixels.
Then, edge costs are estimated by a random forest that is trained on manually annotated data.
There are 100 instances with $156-3764$ nodes and $439-10970$ edges.
\paragraph*{Knott3D\footnote{\url{https://keeper.mpdl.mpg.de/f/822134874a8f4c2aa0a9/?dl=1}}}
Instance of the multicut problem are constructed from volume images of an adult mouse somatosensory cortex, acquired by~\cite{knott2008serial}, as described in~\cite{andres2012globally}.
For this, superpixel graphs are computed with a combination of a random forest trained on manually annotated data and a watershed over-segmentation.
Edge costs are also estimated by a random forest.
There are three types of instances corresponding to different volume sizes.
In total there are 24 instances with $571-17073$ nodes and $3381-107060$ edges.
\paragraph*{Modularity clustering\footnote{\url{https://keeper.mpdl.mpg.de/f/33bb96c32ae04395a1e7/?dl=1}}}
Modularity is a measure for quality of a clustering of the nodes of an undirected graph.
Modularity clustering is the problem of finding a clustering of maximal modularity.
The modularity clustering problem can be formulated as a multicut problem on a complete graph~\cite{brandes2008modularity}.
The OpenGM benchmark contains instances of the modularity clustering problem for six graphs that are publicly available.
The sources of these six graphs as well as additional instances of the modularity clustering problem can be found in~\cite{cafieri2011locally}.
The six instances contain $34-115$ nodes and $561-6555$.
\subsubsection[Bird sound clustering]{Bird sound clustering\footnote{\url{https://keeper.mpdl.mpg.de/f/4803b50f0ab14e75b01c/?dl=1}}}
A Siamese network is trained by~\cite{stein2023correlation} to estimate wether the same species of bird can be heard in two different sound recordings.
Based on these estimates a multicut problem on the complete graph is solved to cluster sound recordings.
The names of the instances in this dataset indicate different settings: ``2s''/``3s'' indicates whether the data points are 2 or 3 second chunks of bird sound recordings; ``small''/``big'' indicates whether the dataset with 23 or 68 species is used; ``with-augments''/``no-augments'' indicates whether data augmentation is used for training; the rest of the file name indicates the underlying dataset as described in~\cite{stein2023correlation}.
In total there are $30$ instance with $3326-22252$ nodes and $5529475-247564626$ edges.
\subsection{Algorithms}
\begin{description}
\item[CGC~\cite{beier2014cut}:] Cut, Glue \& Cut, a heuristic that alternatingly bipartitions the graph and exchanges nodes in pairs of clusters via max-cut computed by a reduction to perfect matching.
\item[Fusion Moves~\cite{beier2017multicut}:] Explore subspaces of multicuts generated randomly via solving it with ILP-solvers.
\item[GAEC \& KLj~\cite{keuper2015efficient}:] The greedy additive edge contraction (GAEC) iteratively takes the most attractive edge and contracts it until no attractive edge is present.
Kernighan \& Lin with joins (KLj) computes sequences of node exchanges between clusters and cluster joins that improve the objective.
\item[Balanced Edge Contraction~\cite{kardoost2018solving}:] Like GAEC but additionally preferring edges with endpoints that contain fewer original nodes.
\item[Greedy Edge Fixation~\cite{levinkov2019comparative}:] Like GAEC but additionally prevent contractions on certain repulsive edges.
\item[Multi-stage Multicuts (MSM)~\cite{ho2021msm}:] Solve multiple minimum cost multicut problems across distributed compute units.
\item[Benders decomposition~\cite{lukasik2020benders}:] A Benders decomposition algorithm with node subproblems solved in parallel and accelerated through Magnanti-Wong Benders rows.
\item[RAMA~\cite{abbas2021combinatorial}:] Primal/dual algorithm using parallel edge contraction and fast separation/message passing for computing lower bounds and reduced costs.
\item[Message Passing~\cite{swoboda2017message}:] Sequential separation/message passing.
\item[Cycle Packing \& Persistency~\cite{lange2018partial}:] Fast cycle separation with interleaved greedy cycle packing for obtaining dual lower bounds. Simple persistency criteria for fixing variables to optimal values without optimizing the whole problem.
\item[Combinatorial Persistency Criteria~\cite{lange2019combinatorial}:] A number of persistency criteria for fixing variable to their optimal values ranging from easy enumerative to more involved ones which require optimization of larger subproblems.
\item[LP-rounding~\cite{demaine2006correlation}:]
Solving an LP-relaxation and subsequent region growing rounding with approximation guarantees for the correlation clustering formulation.
\end{description}