Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: Derivatives interface #38

Merged
merged 19 commits into from
Aug 7, 2020
Merged

WIP: Derivatives interface #38

merged 19 commits into from
Aug 7, 2020

Conversation

bamos
Copy link
Collaborator

@bamos bamos commented Nov 2, 2019

Hi all -- here's a first attempt at moving some of the derivative code from https://github.com/oxfordcontrol/osqpth/ and into here. What do you think of this interface?

If it's good I can also start adding:

Two other things:

  1. Something that may be worth adding to diff_mode is the linear system solver type too (direct/LSQR/sparse/dense)
  2. This follows the interface of https://github.com/cvxgrp/diffcp/blob/master/diffcp/cone_program.py -- do you also want to stub in a derivative function here too?

\cc @sbarratt @akshayka

@bamos bamos requested a review from bstellato November 2, 2019 18:39
@CLAassistant
Copy link

CLAassistant commented Nov 2, 2019

CLA assistant check
All committers have signed the CLA.

@bstellato
Copy link
Contributor

This is great thanks! Just a couple of comments/suggestions:

  • I would focus on the sparse version first since the solver is anyway sparse. Then we can work out the dense version after we have the sparse one running.

  • I totally agree with having a direct solver (QR decomposition) and indirect solver (LSQR). I am not sure LSQR always works for badly conditioned systems. By the way, if we end up using the functions from the internal C code we will benefit from the problem data scaling in OSQP. This should help the LSQR convergence.

  • What do you think about having the same solve function for osqp and then adding some boolean option to compute the derivatives? I mean something like derivative=True. This would be more coherent with the current osqp interface when we pass the polish=True to perform the final polish step at the end of the ADMM algorithm.

@bamos
Copy link
Collaborator Author

bamos commented Nov 4, 2019

  • I totally agree with having a direct solver (QR decomposition) and indirect solver (LSQR). I am not sure LSQR always works for badly conditioned systems. By the way, if we end up using the functions from the internal C code we will benefit from the problem data scaling in OSQP. This should help the LSQR convergence.

How would you like to specify these modes? Maybe we can put the solver into diff_mode as full_qr, full_lsqr?

  • What do you think about having the same solve function for osqp and then adding some boolean option to compute the derivatives? I mean something like derivative=True. This would be more coherent with the current osqp interface when we pass the polish=True to perform the final polish step at the end of the ADMM algorithm.

Yes this sounds good too. How would you like the derivative functions to be returned? Maybe we could add derivative and adjoint_derivative fields to the OSQP_results struct that point to Python functions?

@bamos
Copy link
Collaborator Author

bamos commented Nov 4, 2019

Maybe we could add derivative and adjoint_derivative fields to the OSQP_results struct that point to Python functions?

It may be easy to accidentally leak memory if we add pointers to Python functions in the struct where the Python functions keep data around

@bstellato
Copy link
Contributor

How would you like to specify these modes? Maybe we can put the solver into diff_mode as full_qr, full_lsqr?

Happy to use the parameter diff_mode. We could also just set it to qr and lsqr. The active-set mode could be a special feature we run only in debug (not very robust yet) as active_qr and active_lsqr.

Yes this sounds good too. How would you like the derivative functions to be returned? Maybe we could add derivative and adjoint_derivative fields to the OSQP_results struct that point to Python functions?

It may be easy to accidentally leak memory if we add pointers to Python functions in the struct where the Python functions keep data around

I think we need to discuss this a bit more. I would really like to have a solution that does not necessarily depend on Python function pointers. This would be enable the use of differentiable OSQP also in other languages. (e.g., Julia?)

Here is an idea that goes against what I suggested before (no derivative=True option anymore) but that could work.

We could have lower level functions in C called osqp_derivative and osqp_adjoint_derivative that take the workspace structure as input, check that the problem status is solved and compute the derivative. To start, these functions could be prototyped in Python and then ported to C later.

However, this approach would need the workspace to be stored somewhere in Python. At the moment it does not happen with the one-shot solve function disabling the GIL but it does happen in the standard way we call OSQP with the Python object. We can fix this by using the OSQP object methods instead of the one-shot solve and by disabling the GIL (making them thread unsafe).

- Release GIL on OSQP object functions
- Removed OSQP one-shot solve function (useless now and repeats code)
- Added test for 'qr_active' differentiation mode
- Dropped Python 2 support (next release will not support it)
@bstellato
Copy link
Contributor

I have started creating the differentiable interface from Python. It works from the object interface now. In particular, I have

  • Released GIL on OSQP object functions
  • Removed OSQP one-shot solve function (useless now and repeats code)
  • Added test for 'qr_active' differentiation mode
  • Dropped Python 2 support (next release will not support it)

The tests are now using numdifftools but I would prefer not to install additional packages. Let me know what you think.

results = self._model.solve()

# TODO(bart): this will be unnecessary when the derivative will be in C
self._derivative_cache['results'] = results
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we also set this to None if update is called and then make sure it's populated when the derivative functions are called? This way somebody can't accidentally change the data and then try to differentiate through an old problem.

Also it's ok to not support the setting when the user wants to solve a batch of problems by using update on a single OSQP instance and then differentiate through all the solutions, right?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are definitely right. I was thinking that anyway we will not have this issue later when we will port it to C. I have changed to code to set it to None when the update function gets called.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we remove the self._derivative_cache = None line and add the following code to the bottom of the update function? This way the _derivative_cache is updated properly and it removes results from it, meaning the problem hasn't been solved.

        # update problem data in self._derivative_cache
        if q is not None:
            self._derivative_cache["q"] = q

        if l is not None:
            self._derivative_cache["l"] = l

        if u is not None:
            self._derivative_cache["u"] = u

        if Px is not None:
            if Px_idx.size == 0:
                self._derivative_cache["P"].data = Px
            else:
                self._derivative_cache["P"].data[Px_idx] = Px
        
        if Ax is not None:
            if Ax_idx.size == 0:
                self._derivative_cache["A"].data = Ax
            else:
                self._derivative_cache["A"].data[Ax_idx] = Ax

        # delete results from self._derivative_cache to prohibit
        # taking the derivative of unsolved problems
        if "results" in self._derivative_cache.keys():
            del self._derivative_cache["results"]

@bstellato
Copy link
Contributor

I have implemented the backward derivative also with the lsqr method. All the tests pass except dA.

It is weird because the derivative does not match the numerically estimated one, but we get the same output for both lu_active or lsqr method. Probably there is an issue either in the math or in the numerical estimation.

@sbarratt
Copy link

sbarratt commented Nov 8, 2019

The tests are now using numdifftools but I would prefer not to install additional packages. Let me know what you think.

Could we use check_grad? (https://docs.scipy.org/doc/scipy/reference/generated/scipy.optimize.check_grad.html)
This is part of scipy, so should already be included.

Comment on lines 350 to 352
dA_vals = \
y_u[rows] * r_x[cols] + y_u[rows] * (r_yu[rows] * x[cols]) - \
(y_l[rows] * r_x[cols] + y_l[rows] * (r_yl[rows] * x[cols]))
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this just be:

dA_vals = (y_u[rows] - y_l[rows]) * r_x[cols] + \
    (y_u[rows] * r_yu[rows] - y_l[rows] * r_yl[rows]) * x[cols]

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. Fixed. Looks neater now.

results = self._model.solve()

# TODO(bart): this will be unnecessary when the derivative will be in C
self._derivative_cache['results'] = results
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we remove the self._derivative_cache = None line and add the following code to the bottom of the update function? This way the _derivative_cache is updated properly and it removes results from it, meaning the problem hasn't been solved.

        # update problem data in self._derivative_cache
        if q is not None:
            self._derivative_cache["q"] = q

        if l is not None:
            self._derivative_cache["l"] = l

        if u is not None:
            self._derivative_cache["u"] = u

        if Px is not None:
            if Px_idx.size == 0:
                self._derivative_cache["P"].data = Px
            else:
                self._derivative_cache["P"].data[Px_idx] = Px
        
        if Ax is not None:
            if Ax_idx.size == 0:
                self._derivative_cache["A"].data = Ax
            else:
                self._derivative_cache["A"].data[Ax_idx] = Ax

        # delete results from self._derivative_cache to prohibit
        # taking the derivative of unsolved problems
        if "results" in self._derivative_cache.keys():
            del self._derivative_cache["results"]

A = self._derivative_cache['A']
l, u = self._derivative_cache['l'], self._derivative_cache['u']

results = self._derivative_cache['results']
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we do:

try:
    results = self._derivative_cache['results']
except KeyError:
    raise ValueError("Problem has not been solved. You cannot take derivatives. Please call the solve function.")

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed both. I did not spent too much time fixing these issues since I think they will be much easier to deal with from the C code where the cache will correspond to our workspace.

@sbarratt
Copy link

sbarratt commented Nov 9, 2019

I double checked and the math seems right. Here's a write-up:
diff_osqp.pdf

@bstellato
Copy link
Contributor

Tests should be working now (at least from my machine) for all the adjoint derivatives. I ended up using approx_fprime from scipy and tuning its precision accordingly.

I think we should add the forward derivative and the multithreaded code from this interface directly (just as it happens for diffcp).

@bstellato bstellato changed the title WIP proposal for solve_and_derivative interface WIP: Derivatives interface Nov 9, 2019
@bamos
Copy link
Collaborator Author

bamos commented Nov 28, 2019

Just quickly added back in the full, sparse, LU option too. I think we'll need to tune the LSQR settings a bit to make sure it converges for most problems, as here's an ill-conditioned problem that LSQR returns early from but the direct solver gives something reasonable: t.pkl.zip

import pickle as pkl
import torch
from osqpth.osqpth import OSQP

Q_idx, Q_shape, A_idx, A_shape, Q_data, p, A_data, l, u = pkl.load(open('/private/home/bda/repos/optnet/sudoku/t.pkl', 'rb'))
p.requires_grad_()

y_lsqr = OSQP(Q_idx, Q_shape, A_idx, A_shape, diff_mode='lsqr')(
    Q_data, p, A_data, l, u
)
print('==== lsqr')
print(torch.autograd.grad(y_lsqr[0,0], p)[0])

y_lu = OSQP(Q_idx, Q_shape, A_idx, A_shape, diff_mode='lu')(
    Q_data, p, A_data, l, u
)
print('===== lu')
print(torch.autograd.grad(y_lu[0,0], p)[0])

Output:

==== lsqr
 
LSQR            Least-squares solution of  Ax = b
The matrix A has      272 rows  and      272 cols
damp = 0.00000000000000e+00   calc_var =        0
atol = 1.00e-08                 conlim = 1.00e+08
btol = 1.00e-08               iter_lim =      544
 
   Itn      x[0]       r1norm     r2norm   Compatible    LS      Norm A   Cond A
     0  0.00000e+00   1.000e+00  1.000e+00    1.0e+00  5.8e-01
     1  4.70547e-03   9.920e-01  9.920e-01    9.9e-01  1.0e+05   4.6e+00  1.0e+00
     2  4.71200e-03   9.920e-01  9.920e-01    9.9e-01  7.9e-09   1.0e+08  2.2e+07
 
LSQR finished
The least-squares solution is good enough, given atol     
 
istop =       2   r1norm = 9.9e-01   anorm = 1.0e+08   arnorm = 7.9e-01
itn   =       2   r2norm = 9.9e-01   acond = 2.2e+07   xnorm  = 2.7e-02
 
tensor([[-4.7120e-03,  9.1007e-18,  7.7816e-18,  9.2699e-18,  9.8331e-18,
          1.1310e-17,  9.2798e-18,  9.8769e-18,  9.1475e-18,  8.3885e-18,
          7.9731e-18,  1.0012e-17,  9.4575e-18,  9.7441e-18,  8.5112e-18,
          9.1525e-18,  8.9611e-18,  1.0282e-17,  8.6824e-18,  8.2783e-18,
          7.9033e-18,  6.1820e-18,  8.8372e-18,  8.7393e-18,  8.5133e-18,
          8.4072e-18,  9.3348e-18,  6.6223e-18,  8.8388e-18,  7.8768e-18,
          7.1385e-18,  7.4183e-18,  8.0707e-18,  7.9934e-18,  7.0882e-18,
          8.7638e-18,  8.5200e-18,  9.0570e-18,  8.7410e-18,  8.4229e-18,
          8.2337e-18,  5.8012e-18,  8.3921e-18,  6.9027e-18,  9.5226e-18,
          9.1675e-18,  7.7723e-18,  8.3988e-18,  7.4415e-18,  7.5445e-18,
          8.8055e-18,  9.4669e-18,  8.3989e-18,  8.3193e-18,  9.0031e-18,
          7.4605e-18,  1.0165e-17,  7.0037e-18,  8.8694e-18,  9.6764e-18,
          9.5390e-18,  9.1936e-18,  9.7664e-18,  1.0186e-17]], device='cuda:0',
       dtype=torch.float64)
===== lu
tensor([[-1.2817e+00,  1.7240e-04, -7.5087e-05,  7.0880e-02, -2.5255e-02,
          2.9073e-04, -1.2829e+00,  8.6578e-04, -5.7333e-01, -7.3684e-01,
         -1.5084e-01,  3.2826e-01, -3.7487e-01,  6.1822e-01,  2.9244e-01,
         -1.5862e-01,  4.6635e-01,  2.0958e-01,  1.9206e-01,  1.5975e-01,
         -3.8013e-03, -8.8888e-02, -4.0045e-01,  1.8145e-01,  5.0644e-01,
         -4.2531e-01, -2.3564e-02,  4.9506e-01,  6.2890e-01, -5.0957e-01,
         -7.4680e-01, -5.2668e-02, -1.4794e-02,  3.4635e-01,  1.6762e-05,
          1.1221e-01,  4.5385e-01,  6.7591e-02, -5.5885e-01, -2.9073e-01,
         -5.2149e-01, -2.5710e-01, -2.6545e-01, -3.4104e-01, -2.4808e-01,
         -2.8137e-02,  6.1538e-01,  5.7590e-06,  2.3811e-04, -3.5384e-01,
         -7.7634e-02, -3.7694e-01,  6.0934e-01, -1.8100e-04,  4.5111e-01,
          4.4656e-01,  1.9813e-03, -2.0191e-02,  1.6115e-02,  1.6816e-01,
          1.3684e+00,  8.1238e-01,  1.0328e-01,  3.1379e-01]], device='cuda:0',
       dtype=torch.float64)

@bstellato
Copy link
Contributor

Just to update on this: there has been much progress on the CVXPY QP interface https://github.com/cvxgrp/cvxpy/pull/960 thanks to @SteveDiamond.

From OSQP side I am working on a formulation that does not need LSQR. We now have qdldl-python that can be quickly hooked in and supports factorization updates/multi-threading.

@bstellato
Copy link
Contributor

I have just pushed the derivatives with symmetrized M matrix working with qdldl solver. @bamos @akshayka would you have time to have a look at it? I think it should be more reliable than LSQR. It requires qdldl.

@bstellato bstellato mentioned this pull request Mar 25, 2020
3 tasks
@akshayka
Copy link

Awesome! The interface for adjoint_derivative looks good to me. It'd be great if a derivative method was also added at some point, but for CVXPY Layers, the adjoint is all we need.

Do the derivative tests all pass? (Just curious, I haven't run them myself.)

The dependency on QDLDL is fine by me.

Besides style things, such as commented out dead code (which I'm sure you're already planning to fix), it looks good to me.

@bamos
Copy link
Collaborator Author

bamos commented Mar 27, 2020

Awesome! I've pulled the latest code and updated this osqpth PR/branch to use it: osqp/osqpth#7

The OptNet sudoku experiment is still not converging with osqpth as it does with osqp and I've been debugging it a bit over the past few days. It may be related to gradient scaling (related to summing v averaging across the broadcasted input tensors), the derivatives of equality constraints, or the -inf/large negative inequality constraints, but I'm not quite sure. I've also bumped up the accuracy of OSQP, as these problems are pretty poorly-scaled, which may make them a bad test suite for this...

I've been debugging by doing a single epoch of training with a minibatch size of 1:

image

If you think it should work with equality constraints and -inf/large negative inequality constraints, I can try to pull out a few concrete examples of where the mismatch is happening.

Also I ran into this error when factorizing M and patched it by adding a small diagonal term:

RuntimeError: Error in computing elimination tree. Matrix not properly upper-triangular, sum_Lnz = -1
> /private/home/bda/repos/osqp-python/osqp/interface.py(374)adjoint_derivative()
    372                                 spa.diags(y_l) @ dy_l])
    373 
--> 374         r_sol = - qdldl.Solver(self._derivative_cache['M']).solve(d_sol)
    375 
    376         r_x, r_yu, r_yl = np.split(r_sol, [n, n+m

module/interface.py Outdated Show resolved Hide resolved
@bstellato
Copy link
Contributor

The OptNet sudoku experiment is still not converging with osqpth as it does with osqp and I've been debugging it a bit over the past few days. It may be related to gradient scaling (related to summing v averaging across the broadcasted input tensors), the derivatives of equality constraints, or the -inf/large negative inequality constraints, but I'm not quite sure. I've also bumped up the accuracy of OSQP, as these problems are pretty poorly-scaled, which may make them a bad test suite for this...

It might be worth using this example to check the derivatives correctness since it is more challenging than our unittests.

Also I ran into this error when factorizing M and patched it by adding a small diagonal term:

RuntimeError: Error in computing elimination tree. Matrix not properly upper-triangular, sum_Lnz = -1
> /private/home/bda/repos/osqp-python/osqp/interface.py(374)adjoint_derivative()
    372                                 spa.diags(y_l) @ dy_l])
    373 
--> 374         r_sol = - qdldl.Solver(self._derivative_cache['M']).solve(d_sol)
    375 
    376         r_x, r_yu, r_yl = np.split(r_sol, [n, n+m

This error should not happen since we have a convex program. I wonder if this is related to scipy having not necessarily ordered sparse matrices. Maybe I should check has_sorted_indices before factoring. Do you have a MWE reproducing it?

@bstellato
Copy link
Contributor

Awesome! The interface for adjoint_derivative looks good to me. It'd be great if a derivative method was also added at some point, but for CVXPY Layers, the adjoint is all we need.

Do the derivative tests all pass? (Just curious, I haven't run them myself.)

The dependency on QDLDL is fine by me.

Besides style things, such as commented out dead code (which I'm sure you're already planning to fix), it looks good to me.

The tests pass on my computer but do not pass on ci because qdldl-python is not available on conda-forge yet. I just created an issue conda-forge/osqp-feedstock#23 hoping that they could help me creating a feedstock. I will keep you posted and cleanup the code once we have something finalized ;-)

@bamos
Copy link
Collaborator Author

bamos commented Apr 4, 2020

Do you have a MWE reproducing it?

Yes, I just looked a bit closer and it looks like this is happening when the solution hits the boundary exactly and makes those residuals in M exactly zero. Here's a MWE:

import scipy.sparse as sp
import numpy as np
import numpy.random as npr

import osqp

n = 100

npr.seed(1)

P = 0.01*npr.randn(n,n)
P = P.T.dot(P)
P = sp.csr_matrix(P)
q = npr.randn(n)
A = sp.csr_matrix(npr.randn(n, n))
l = npr.randn(n)
u = l

s = osqp.OSQP()
s.setup(P, q, A, l, u, eps_abs=1e-10, eps_rel=1e-10, max_iter=int(1e8))
results = s.solve()

s.adjoint_derivative(dx=np.ones(n))

@bamos
Copy link
Collaborator Author

bamos commented Apr 14, 2020

I just tried replacing the QP solver with the cvxpylayers with the CP backend and that is able to reproduce the qpth results on the sudoku experiment. From a quick glance it even seems to slightly help roll out some instabilities with qpth. So this at least points to the convergence issues here being either from some way I've hooked osqp up to the sudoku experiment, or from some way the derivatives are being handled for these problems

@bstellato bstellato merged commit 46dff11 into osqp:master Aug 7, 2020
This was referenced Aug 7, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants