Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add optional argument for specifying factorization strategy #31

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

samuelpmish
Copy link

Hi,

After reading about the significant performance difference between simplicial and supernodal factorization (#25), I thought it would be beneficial to expose that as an option to the user. This PR adds a new argument to the CholeskySolver class to specify which option to use. CHOLMOD also apparently has an "AUTO" option which uses some heuristic to decide which strategy is appropriate for the provided matrix, so I use that as the default argument.

Since the new argument is defaulted, this change shouldn't break existing users' code, although it is potentially a change in the default behavior of CHOLMOD, which could change the runtime of existing code.

The performance of two test problems (as measured on a M2 macbook) are summarized below,

  1. get_icosphere(7) (2D connectivity): supernodal was a 1.7x speedup over simplicial
  2. a tetrahedron mesh of a sphere (3D connectivity): supernodal was a 4.1x speedup over simplicial

In both cases, the default "CHOLMOD_AUTO" option picked the faster (supernodal) approach.

The small performance test script (and relevant data for the tetrahedron mesh) is available here.

@samuelpmish
Copy link
Author

Oh, also when I ran pytest test_cholesky.py, only a small fraction of the tests were not skipped. Those tests that ran passed, but I can't verify that the remaining tests are working correctly.

@wjakob
Copy link
Contributor

wjakob commented Jan 5, 2024

A first question: enabling a supernodal decomposition on the CPU makes sense, but what does this mean for the CUDA implementation? I suppose that the kernels would need to be updated to deal with this case. Does it even make sense to use such a strategy on the GPU?

@samuelpmish
Copy link
Author

Enabling a supernodal decomposition on the CPU makes sense, but what does this mean for the CUDA implementation? ... Does it even make sense to use such a strategy on the GPU?

From the the original talk about GPU-accelerated CHOLMOD, it sounds like supernodal was the only supported factorization strategy. I'm not sure if that's still true today or not. If so, I wonder what CHOLMOD does if you tell it to use the GPU and also specify the simplicial strategy.

I suppose that the kernels would need to be updated to deal with this [supernodal option].

Which kernels need updating?

@samuelpmish
Copy link
Author

samuelpmish commented Jan 6, 2024

When I run pytest test_cholesky.py on a machine with an NVIDIA GPU (where 17/20 of the tests are not skipped, and run successfully), nvprof reveals that none of the linear algebra in cholespy is using the GPU, except for the triangular solves.

CHOLMOD's recent user guide says that only cholmod_l_factorize (for 64-bit integers) supports GPU acceleration, but cholespy uses the 32-bit integer variant which doesn't seem to take advantage of GPU hardware:

// Compute the Cholesky factorization
m_factor = cholmod_analyze(A, &m_common);
cholmod_factorize(A, m_factor, &m_common);

am I goofing something up?

Copy link
Collaborator

@bathal1 bathal1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, thanks for this PR !

It looks good to me, besides a small typo in the docstrings.

Regarding GPU CHOLMOD, the implementation makes use of the NVIDIA runtime API, which is a tedious dependency to have as not all users have it installed already, and they might need different versions for different projects.

Our implementation of the sparse triangular solver makes use of the NVIDIA driver API, which is guaranteed to be present on any system with a GPU, as it is shipped with the driver. It is also backwards compatible so that makes it easier to maintain.

In the application we designed this package for initially, we typically compute the decomposition once, and then solve the associated system many times before needing to update the matrix, hence relying on the CPU for decomposition is not a major bottleneck.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants