diff --git a/tex/main.tex b/tex/main.tex index 27131cf..1ec45ae 100644 --- a/tex/main.tex +++ b/tex/main.tex @@ -34,6 +34,8 @@ \newcommand{\epstol}{\mathbf{u}} \newcommand{\cactive}{\mathcal{B}} \newcommand{\cinactive}{\mathcal{N}} +\newcommand{\ma}[1]{{\color{red} Mihai: #1}} + \title{Condensed-space methods for nonlinear programming on GPUs} \author{François Pacaud \and diff --git a/tex/sections/conditioning.tex b/tex/sections/conditioning.tex index b15fcdc..3afbdf2 100644 --- a/tex/sections/conditioning.tex +++ b/tex/sections/conditioning.tex @@ -31,28 +31,29 @@ \section{Conditioning of the condensed KKT system} by exploiting the structured ill-conditioning of the condensed matrix $K$. We base our analysis on \cite{wright1998ill}, where the author has put a particular emphasis on the condensed KKT -system~\eqref{eq:kkt:condensed} without equality constraints. We generalize her results to the +system~\eqref{eq:kkt:condensed} without equality constraints. We generalize their results to the matrix $K_\gamma$, which incorporates both equality and inequality constraints. The results extend directly to $K_\tau$ (by letting the number of equalities to be zero). -To ease the notation, we suppose that the primal variable -$x$ is unconstrained, leaving only a bounded slack variable $s$ in \eqref{eq:slack_problem}. +To simplify the notation, we suppose that the primal variable +$x$ is unconstrained \ma{Not sure why we needed bound constraints on $x$ in the first place}, leaving only a bounded slack variable $s$ in \eqref{eq:slack_problem}. This is equivalent to include the bounds on the variable $x$ in the inequality constraints, as $\bar{h}(x) \leq 0$ with $\bar{h}(x) := (h(x), -x)$. \subsection{Centrality conditions} -We start the discussion by recalling important results porting -on the iterates of the interior-point algorithm. -Let $p = (x, s, y, z)$ the current primal-dual iterate (the bound multiplier $v$ is considered apart), -and $p^\star$ a solution of the KKT conditions~\eqref{eq:kktconditions}. We suppose +We start the discussion by recalling important results about the iterates of the interior-point algorithm. +Let $p = (x, s, y, z)$ be the current primal-dual iterate (the bound multiplier $v$ is considered apart), +and $p^\star$ be a solution of the KKT conditions~\eqref{eq:kktconditions}. We suppose that the solution $p^\star$ is regular and satisfies the following assumptions. -We denote $\cactive = \cactive(x^\star)$ the active-set at the optimal solution $x^\star$, +We denote $\cactive = \cactive(x^\star)$ the active-set at the optimal solution $x^\star$, +\ma{As I noted in the beginning; it does not make sense to denote the active set as you did unless you have no inequality constraints on $x$. Maybe you want to define the active set here since here is where you use it; after you make the assumption you have no ineq constraints on $x$} and $\cinactive = \cinactive(x^\star)$ the inactive set. % Sungho: not sure if this should be introduced here. \begin{assumption} \label{hyp:ipm} Let $p^\star = (x^\star, s^\star, y^\star, z^\star)$ be a primal-dual solution + \ma{Most people would include the $v^\star$ in the primal dual solution} satisfying the KKT conditions~\eqref{eq:kktconditions}. Let the following hold: \begin{itemize} \item Continuity: The Hessian $\nabla^2_{x x} L(\cdot)$ is Lipschitz continuous @@ -67,7 +68,7 @@ \subsection{Centrality conditions} We denote $\delta(p, v) = \| (p, v) - (p^\star, v^\star) \|$ the Euclidean distance to the primal-dual stationary point $(p^\star, v^\star)$. From \cite[Theorem 2.2]{wright2001effects}, if Assumption~\ref{hyp:ipm} -holds at $p^\star$ and $v > 0$, +holds at $p^\star$ and $v > 0$ \ma{Pretty sure that is true only locally}, \begin{equation} \delta(p, v) = \Theta\left( \left\Vert \begin{bmatrix} \nabla_p L(p, v) \\ \min(v, s) @@ -93,9 +94,9 @@ \subsection{Centrality conditions} \end{align} \end{subequations} for some constants $C > 0$ and $\alpha \in (0, 1)$. -Conditions~\eqref{eq:centralitycond:complement} ensures that the products +Conditions~\eqref{eq:centralitycond:complement} ensure that the products $s_i v_i$ are not too disparate in the diagonal term $D_s$. -This condition is satisfied (despite rather loosely) +This condition is satisfied (even if rather loosely) in the solver Ipopt (see \cite[Equation (16)]{wachter2006implementation}). \begin{proposition}[\cite{wright2001effects}, Lemma 3.2 and Theorem 3.3] @@ -136,10 +137,10 @@ \subsubsection{Invariant subspaces in $K_\gamma$} s_{min} = \min_{i \in \cactive} s_i \; , \quad s_{max} = \max_{i \in \cactive} s_i \; . \end{equation} -We remind that $m_e$ is the number of equality constraints, +We recall that $m_e$ is the number of equality constraints, and define $\ell := m_e + m_a$. -We explicit the structured ill-conditioning of $K_\gamma$ by +We express the structured ill-conditioning of $K_\gamma$ by modifying the approach outlined in \cite[Theorem 3.2]{wright1998ill} to account for the additional term $\gamma G^\top G$ arising from the equality constraints. We show that the matrix $K_\gamma$ has two invariant subspaces diff --git a/tex/sections/introduction.tex b/tex/sections/introduction.tex index 9a5d444..e766d21 100644 --- a/tex/sections/introduction.tex +++ b/tex/sections/introduction.tex @@ -3,7 +3,7 @@ \section{Introduction} GPUs offer two practical advantages: (1) massive parallel computing capability for applications that can exploit coarse-grain parallelism and high-memory bandwidth and (2) power efficiency due to requiring fewer transistors to process multiple tasks in parallel utilizing ``single instruction, multiple data'' (SIMD) parallelism. While GPUs have made significant strides in enhancing machine learning applications, their adoption in the mathematical programming community has been relatively limited. -This limitation stems primarily from the fact that most optimization solvers were developed in the 1990s and are heavily optimized for CPU architectures. +This limitation stems primarily from the fact that most second-order optimization methods for constrained optimization solve a form of Newton's method using direct linear algebra as finding good iterative solvers for the Newton direction has proved elusive. Additionally, the utilization of GPUs has been impeded by the challenges associated with sparse matrix factorization routines, which are inherently difficult to parallelize on SIMD architectures. Nevertheless, recent years have witnessed notable advancements that are reshaping this landscape. \begin{enumerate} \item \textbf{Improved sparse matrix operations}: The performance of sparse matrix operations has seen substantial improvements in the CUDA library, largely attributed to the integration of novel tensor cores in recent GPUs~\cite{markidis2018nvidia}. @@ -13,7 +13,7 @@ \section{Introduction} \end{enumerate} \subsection{Solving optimization problems on GPU: current state-of-the-art} -For all the reasons listed before, there is an increasing interest to solve optimization problems on GPUs. +For all the reasons listed before, there is an increased interest for solving optimization problems on GPUs. We now summarize the previous work on solving classical---large-scale, sparse, constrained---mathematical programs on GPUs. \paragraph{GPU for mathematical programming.} @@ -49,7 +49,7 @@ \subsection{Solving optimization problems on GPU: current state-of-the-art} \paragraph{GPU for nonlinear programming.} The success of first-order algorithms in classical mathematical programming relies on the convexity of the problem. Thus, this approach is nontrivial to replicate -in nonlinear programming: Most engineering problems encode complex +in general nonlinear programming: Most engineering problems encode complex physical equations that are likely to break any convex structure in the problem. Previous experiments on the alternating current (AC) optimal power flow (OPF) problem have shown that even a simple algorithm as the alternating direction method of multipliers (ADMM) has trouble converging as soon as the convergence @@ -68,7 +68,7 @@ \subsection{Solving optimization problems on GPU: current state-of-the-art} % SS: removed due to irrelevance. % The new NVIDIA Grace CPU architecture could also be a game changer in the future of sparse linear solvers, thanks to fast communication between the CPU and GPU. Furthermore, variants of interior point methods have been proposed -that does not depend on numerical pivoting in the linear solves, +that do not depend on numerical pivoting in the linear solves, opening the door to parallelized sparse solvers. Coupled with a GPU-accelerated automatic differentiation library and a sparse Cholesky solver, these nonlinear programming solvers can solve @@ -90,7 +90,7 @@ \subsection{Solving optimization problems on GPU: current state-of-the-art} the interior-point methods, with promising results on the GPU. On the other hand, null-space methods (also known as reduced Hessian methods) reduce the KKT system down to a dense matrix, a setting also favorable for GPUs. -Our previous research has shown that the method plays nicely with the interior-point +Our previous research has shown that the approach is suitable for interior-point methods if the number of degrees of freedom in the problem remains relatively small~\cite{pacaud2022condensed}. diff --git a/tex/sections/ipm.tex b/tex/sections/ipm.tex index 5a208ab..81a31aa 100644 --- a/tex/sections/ipm.tex +++ b/tex/sections/ipm.tex @@ -76,7 +76,7 @@ \subsection{Problem formulation and KKT conditions} The notation $x \perp u$ is a shorthand for the complementarity condition $x_i u_i = 0$ (for all $i=1,\cdots, n$). -The set of active constraints at a point $x$ is denoted by +The set of active constraints at a point $x$ is denoted by \ma{This is odd. Don't you need to worry about the activity of $x$ itself?} \begin{equation} \mathcal{B}(x) := \{ i \in\{ 1, \cdots, m_i\} \; | \; h_i(x) = 0 \} \; . \end{equation} @@ -174,7 +174,7 @@ \subsection{Solving the KKT conditions with the interior-point method} In addition, we define $X_k$, $S_k$, $U_k$ and $V_k$ the diagonal matrices built respectively from the vectors $x_k$, $s_k$, $u_k$ and $v_k$. Note that~\eqref{eq:kkt:unreduced} can be symmetrized by performing simple block row and column operations. -In what follows, we will omit the index $k$ to alleviate the notations. +In what follows, we will omit the index $k$ to simplify the notations. \paragraph{Augmented KKT system.} It is usual to remove in \eqref{eq:kkt:unreduced} the blocks associated @@ -210,7 +210,7 @@ \subsection{Solving the KKT conditions with the interior-point method} $r_2 := z - \mu S^{-1} e$, $r_3 := g(x)$, $r_4 := h(x) + s$. -Once \eqref{eq:kkt:augmented} solved, we recover the updates on bound multipliers with +Once \eqref{eq:kkt:augmented} is solved, we recover the updates on bound multipliers with $d_u = - X^{-1}(U d_x - \mu e) - u$ and $d_v = - S^{-1}(V d_s - \mu e) - v$. @@ -221,6 +221,7 @@ \subsection{Solving the KKT conditions with the interior-point method} if and only if the Jacobian $J = \begin{bmatrix} G \; &\; 0 \\ H \;&\; I \end{bmatrix}$ is full row-rank and the matrix $\begin{bmatrix} W + D_x & 0 \\ 0 & D_s \end{bmatrix}$ projected onto the null-space of the Jacobian $J$ is positive-definite~\cite{benzi2005numerical}. +\ma{I do not think that is an if and only if conditions. The Jacobian needs to be full rank. but the other matrix need not be positive definite in the projection for the system to be nonsingular} The condition is satisfied if the inertia (defined as the respective numbers of positive, negative and zero eigenvalues) of the matrix~\eqref{eq:kkt:augmented} is $(n + m_i, m_i + m_e, 0)$. We use the inertia-controlling method introduced in \cite{wachter2006implementation} @@ -231,7 +232,7 @@ \subsection{Solving the KKT conditions with the interior-point method} an inertia-revealing \lblt factorization~\cite{duff1983multifrontal}. Krylov methods are often not competitive when solving~\eqref{eq:kkt:augmented}, as the block diagonal terms $D_x$ and $D_s$ are getting increasingly -ill-conditioned near the solution. Their use in IPM has been limited in +ill-conditioned near the solution. Their use in IPM has been limited to linear and convex quadratic programming \cite{gondzio-2012} (when paired with a suitable preconditioner). We also refer to \cite{cao2016augmented} for an efficient implementation of a preconditioned conjugate gradient @@ -289,7 +290,7 @@ \subsection{Solving the KKT conditions with the interior-point method} the elements in the diagonal tend to infinity if a variable converges to its bound, and to $0$ if the variable is inactive. To address the numerical error arising from such ill-conditioning, most of the -implementation of IPM employs Richardson iterations on the original system~\eqref{eq:kkt:unreduced} to refine the solution returned by the direct sparse linear solver (see \cite[Section 3.10]{wachter2006implementation}). +implementations of IPM employ Richardson iterations on the original system~\eqref{eq:kkt:unreduced} to refine the solution returned by the direct sparse linear solver (see \cite[Section 3.10]{wachter2006implementation}). \subsection{Discussion} @@ -303,10 +304,10 @@ \subsection{Discussion} The system~\eqref{eq:kkt:augmented} is usually factorized using a \lblt factorization: for sparse matrices, the Duff and Reid multifrontal algorithm~\cite{duff1983multifrontal} is the favored method (as implemented in the HSL linear solvers MA27 and MA57~\cite{duff2004ma57}). The condensed KKT system~\eqref{eq:kkt:condensed} is often discarded, -as its conditioning is higher +as its conditioning is worse than \eqref{eq:kkt:augmented} (implying less accurate solutions). Additionally, condensation may result in increased fill-in within the condensed system \eqref{eq:kkt:condensed}~\cite[Section 19.3, p.571]{nocedal_numerical_2006}. -In the worst cases \eqref{eq:kkt:condensed} itself may become fully dense if an inequality row is completely dense (fortunately, a case rarer than in the normal equations used in linear programming). +In the worst cases \eqref{eq:kkt:condensed} itself may become fully dense if an inequality row is completely dense (fortunately, a case rarer than in the normal equations used in linear programming \ma{Do you really mean normal equations as in regression or do you mean equations commonly encountered in linear programming?}). Consequently, condensation methods are not commonly utilized in practical optimization settings. To the best of our knowledge, Artelys Knitro~\cite{waltz2006interior} is the only solver that supports computing the descent direction with \eqref{eq:kkt:condensed}. diff --git a/tex/sections/kkt_systems.tex b/tex/sections/kkt_systems.tex index c8479d5..49a5495 100644 --- a/tex/sections/kkt_systems.tex +++ b/tex/sections/kkt_systems.tex @@ -106,7 +106,7 @@ \subsection{Equality relaxation strategy: LiftedKKT} \subsection{Discussion} We have introduced two algorithms to solve -KKT systems on the GPU. On the contrary to classical implementations, +KKT systems on the GPU. As opposed to classical implementations, the two methods do not require computing a sparse \lblt factorization of the KKT system and use instead alternate reformulations based on the condensed KKT system~\eqref{eq:kkt:condensed}. Both strategies diff --git a/tex/sections/numerics.tex b/tex/sections/numerics.tex index 945787c..3613404 100644 --- a/tex/sections/numerics.tex +++ b/tex/sections/numerics.tex @@ -17,7 +17,7 @@ \subsection{Implementation} We have implemented the two condensed-space methods in our nonlinear IPM solver MadNLP~\cite{shin2021graph}. This implementation utilizes the abstraction {\tt AbstractKKTSystem} in MadNLP to represent the various formulations of the KKT linear systems. -MadNLP can deport most of the IPM algorithm to the GPU, except for basic IPM operations used for the coordination (e.g., the filter line-search algorithm). +MadNLP can push most of the IPM algorithm to the GPU, except for basic IPM operations used for the coordination (e.g., the filter line-search algorithm). In particular, any operation that involves the manipulation of array entries is performed by GPU kernels without transferring data to host memory. We refer to \cite{shin2023accelerating} for a detailed description of the GPU implementation in MadNLP. @@ -383,7 +383,7 @@ \subsection{Benchmark on OPF instances} \subsection{Benchmark on COPS instances} \label{sec:num:cops} We have observed in the previous section that both LiftedKKT -and HyKKT outperforms HSL MA27 when running on the GPU. +and HyKKT outperform HSL MA27 when running on the GPU. However, the OPF instances are specific nonlinear instances. For that reason, we complement our analysis by looking at the performance of LiftedKKT and HyKKT on large-scale COPS instances~\cite{dolan2004benchmarking}.