Update vendored finufft and add GPU support #20

lgarrison · 2023-10-20T20:47:45Z

In recent months, cufinufft has been merged into the primary finufft codebase (thanks to @blackwer), which itself has matured a lot over the last few years. The vendored finufft hasn't been updated in a long time, so this PR does that, fixes some minor compatibility issues, and removes the deprecated vendored cufinufft. It also adds GPU support.

The GPU support comes from #4 but with an important update, which is that we can now pass the GPU stream that JAX gives us to the cufinufft library via flatironinstitute/finufft#330. This is probably required for multi-GPU support (although I haven't tested it) and may help with performance too. The only other changes were various updates to match the new API, build fixes, etc.

Some notes about the CMake build: cufinufft no longer requires building two static libraries with different macro definitions (one for float and one for double). The multiple precisions are supported via C++ templates, so the library can be built all at once. Also, now that finufft itself uses CMake, we might prefer to include it as a CMake sub-package rather than itemizing source files. But I left the itemization approach for now, since it required fewer changes.

For the moment, the vendored finufft is maintained in my own fork while we wait for some important PRs (including flatironinstitute/finufft#330 and flatironinstitute/finufft#354) to be merged. It should be easy to re-target the primary repo later.

The tests pass on the GPU and CPU, but more help testing the GPU in particular would be great! For anybody looking to run this on the Flatiron clusters, this is my build environment:

env.sh

ml modules/2.2
ml gcc
ml python/3.11
ml fftw
ml cmake
ml cuda/12
ml cudnn

Passes CPU tests; GPU compilation still needs to be fixed for finufft refactor.

…the single and double precision interfaces are compiled together now

… but segfault.

…tion rules use lowercase cuda, weirdly

…#330 and flatironinstitute/finufft#354 are merged

dfm · 2023-10-23T18:46:35Z

tests/ops_test.py

-    for n in range(num_repeat):
-        np.testing.assert_allclose(
-            calc_unmap_pt[n], func(c[n], *(x_[n] for x_ in x[:-1]), x[-1][0])
+    with jax.experimental.enable_x64():


Do we need this here? Perhaps we could adjust the tolerance for the allclose calls below to test both? I think the jax._src.public_test_util.check_close might do what we want.

dfm · 2023-10-23T18:51:00Z

This is awesome! I wonder if it's worth getting the Flatiron Jenkins CI set up to test the GPU ops? I've been doing that with the JAX CUDA ops in my exoplanet-core package: https://github.com/exoplanet-dev/exoplanet-core/blob/main/ci/Jenkinsfile

I'd like to update all the custom call stuff because the best practices have changed a little in the meantime, but that should be a separate PR.

For now, I'm keen to merge this with the only question being about running a GPU-enabled CI.

lgarrison · 2023-10-23T18:56:12Z

Yeah, I think it definitely makes sense to set this up on Jenkins! I actually started looking into this, but @Matematija reported an non-deterministic crash that I'm looking into first, might be when using the latest JAX. I'll keep you posted.

…x.experimental. Point to vendored finufft with more fixes.

lgarrison · 2023-10-24T20:18:26Z

The problems appear at JAX 0.4.9, which is also the version where JAX starts using a non-blocking CUDA stream, according to jax-ml/jax#16580. I've gone through and fixed some cufinufft stream race conditions for kernel launches, cudaMemcpyAsync, Thrust, and (maybe) cufft, which resolves the particular crash that @Matematija reported, but there are still problems, because the tests don't pass. I'll need to dig into why (I'll probably be delayed in working on this by jury duty).

…do need to sync before synchronously destroying resources.

lgarrison · 2023-10-27T19:48:04Z

I think this is fixed now! The relevant patch is in the finufft submodule, so don't forget to update submodules when you pull. I'll work on Jenkins next.

lgarrison · 2023-10-27T20:18:05Z

@dfm Can you add the Jenkins webhook to this project, or give me permissions to do so? I think I would need to be a repo admin.

dfm and others added 25 commits November 8, 2021 15:03

starting to add optional cuda support

8a4e748

include dirs for cuda

2e827c0

Merge branch 'main' of https://github.com/dfm/jax-finufft into gpu

e50c233

getting cufinufft to compile

94c5bbf

adding first pass at gpu kernels

974b0e9

order of parameters

9ffc006

Minor refactoring to support GPU

9473643

Maybe sort-of calling all the right functions?

1b12da0

Add FindCUDAToolkit to cmake to bring in cufft

fe79e97

Trying to hook up Jax CUDA ops

1295a1c

Don't fail on no CUDA

93ba4b3

first pass at getting GPU ops to work

65f399c

Fix GPU tests

1d94e9e

vendor: update vendored finufft version to latest and fix deprecations

b4c7dea

Merge branch 'gpu' into 2023-gpu

b195154

Passes CPU tests; GPU compilation still needs to be fixed for finufft refactor.

gpu: use new cufinufft API and change CMake to reflect the fact that …

23e5e1b

…the single and double precision interfaces are compiled together now

xla: uppercase CUDA doesn't work anymore, use cuda. GPU tests now run…

d2ae165

… but segfault.

gpu: fix extraneous translation_rule arg

c770a54

gpu: custom call target registration uses capital CUDA, while transla…

f4ac665

…tion rules use lowercase cuda, weirdly

gpu: use x64 for some tests that were off by 1.1e-7

c36b6c3

gpu: skip some 1D tests

097be09

cmake: get colored output through ninja

37db2f0

gpu: use the CUDA stream provided by JAX

e02f231

vendor: use lgarrison fork of finufft until flatironinstitute/finufft…

5cbb18c

…#330 and flatironinstitute/finufft#354 are merged

Merge branch 'main' into 2023-gpu

1ca71ee

lgarrison requested a review from dfm October 20, 2023 20:47

lgarrison mentioned this pull request Oct 20, 2023

Starting to add GPU support using cuFINUFFT #4

Closed

dfm reviewed Oct 23, 2023

View reviewed changes

Fixes for modern JAX: block until CUDA operations complete. Import ja…

d3c8ccf

…x.experimental. Point to vendored finufft with more fixes.

lgarrison force-pushed the 2023-gpu branch from aa1a634 to d3c8ccf Compare October 24, 2023 20:13

lgarrison added 2 commits October 27, 2023 15:06

Probably don't need to sync the stream, JAX ought to do that. But we …

e3db065

…do need to sync before synchronously destroying resources.

vendor: update finufft

e3230c6

dfm merged commit b2b2cd0 into main Oct 30, 2023
2 checks passed

dfm mentioned this pull request Oct 30, 2023

Updating CMake infrastructure to use upstream CMakeLists #22

Merged

lgarrison deleted the 2023-gpu branch November 2, 2023 17:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update vendored finufft and add GPU support #20

Update vendored finufft and add GPU support #20

lgarrison commented Oct 20, 2023

dfm Oct 23, 2023

dfm commented Oct 23, 2023

lgarrison commented Oct 23, 2023

lgarrison commented Oct 24, 2023

lgarrison commented Oct 27, 2023

lgarrison commented Oct 27, 2023

Update vendored finufft and add GPU support #20

Update vendored finufft and add GPU support #20

Conversation

lgarrison commented Oct 20, 2023

dfm Oct 23, 2023

Choose a reason for hiding this comment

dfm commented Oct 23, 2023

lgarrison commented Oct 23, 2023

lgarrison commented Oct 24, 2023

lgarrison commented Oct 27, 2023

lgarrison commented Oct 27, 2023