Add Umpire memory manager for GPU pool memory allocation #943

hokkanen · 2024-04-10T03:56:56Z

This PR adds Umpire memory manager for GPU pool memory allocation. However, the implementation crashes due to a silent error in the base version, see the below attached Zulip discussion. I mark this as draft, as it probably makes sense to fix the base version error first.

Zulip:

"Ok, I tried to figure out what is wrong, and it looks like the problem is not the Umpire implementation, but an already existing issue in the vlasiator_gpu branch, at least since


commit f3bc0e44fcb0e763716784d3dcdfdc92f2ec20c7 (HEAD)

Author: Markus Battarbee <[email protected]>

Date:   Thu Mar 7 08:46:25 2024 +0200

    Comment out old prefetch

The reason the bug only shows up with the Umpire implementation is that the Managed class in gpu_base.hpp does not have error handling (ie, the gpuFree() just errors silently and execution continues):


// Unified memory class for inheritance

class Managed {

public:

   void *operator new(size_t len) {

      void *ptr;

      gpuMallocManaged(&ptr, len);

      gpuDeviceSynchronize();

      return ptr;

   }

   void operator delete(void *ptr) {

      gpuDeviceSynchronize();

      gpuFree(ptr);

   }

   void* operator new[] (size_t len) {

      void *ptr;

      gpuMallocManaged(&ptr, len);

      gpuDeviceSynchronize();

      return ptr;

   }

   void operator delete[] (void* ptr) {

      gpuDeviceSynchronize();

      gpuFree(ptr);

   }

};

If I add error handling, then the program fails exactly at the same location where Umpire implementation fails:


class Managed {

public:

   void *operator new(size_t len) {

      void *ptr;

      CHK_ERR(gpuMallocManaged(&ptr, len));

      CHK_ERR(gpuDeviceSynchronize());

      return ptr;

   }

   void operator delete(void *ptr) {

      CHK_ERR(gpuDeviceSynchronize());

      CHK_ERR(gpuFree(ptr));

   }

   void* operator new[] (size_t len) {

      void *ptr;

      CHK_ERR(gpuMallocManaged(&ptr, len));

      CHK_ERR(gpuDeviceSynchronize());

      return ptr;

   }

   void operator delete[] (void* ptr) {

      CHK_ERR(gpuDeviceSynchronize());

      CHK_ERR(gpuFree(ptr));

   }

};

with the following output (on Mahti):


(Grid) rank 0 is noderank 0 of 1

Done setting all 62 instances of device mesh wrapper handler!

(MAIN): Completed grid initialization.

(MAIN): Starting main simulation loop.

(MAIN): Completed requested simulation. Exiting.

driver shutting down in arch/gpu_base.hpp at line 90

srun: error: g1101: task 0: Exited with exit code 1

"

This reverts commit 3a14535.

markusbattarbee · 2024-04-17T06:37:45Z

I also now built Umpire on Mahti so I can trial this - is this sufficient for building or do you think we need additional flags?

cmake .. -DENABLE_CUDA=On -DCMAKE_INSTALL_PREFIX=/projappl/project_2004522/libraries/gcc-10.4.0/openmpi-4.1.5-cuda/cuda-12.1.1/umpire -DCMAKE_CUDA_ARCHITECTURES=80

markusbattarbee · 2024-04-17T07:13:10Z

Ah, ok, I think I see at least one reason why this might be causing errors. In regular CUDA/HIP code, one can use the same fpuFree macro for both UM and device memory, but here we need to have a specific call for freeing UM memory. In Vlasiator_gpu, those haven't yet been distinguished.

Also, I guess Hashinator will need to be updated to support Umpire to really benefit from it.

kstppd · 2024-04-17T07:18:08Z

Ah, ok, I think I see at least one reason why this might be causing errors. In regular CUDA/HIP code, one can use the same fpuFree macro for both UM and device memory, but here we need to have a specific call for freeing UM memory. In Vlasiator_gpu, those haven't yet been distinguished.

Also, I guess Hashinator will need to be updated to support Umpire to really benefit from it.

For Hashinator we would "just" need to add a new split allocator that uses Umpire.

hokkanen · 2024-04-17T08:02:37Z

I also now built Umpire on Mahti so I can trial this - is this sufficient for building or do you think we need additional flags?
cmake .. -DENABLE_CUDA=On -DCMAKE_INSTALL_PREFIX=/projappl/project_2004522/libraries/gcc-10.4.0/openmpi-4.1.5-cuda/cuda-12.1.1/umpire -DCMAKE_CUDA_ARCHITECTURES=80

I think that should probably be ok. I didn't specify the CUDA architecture, but if it works, then you shouldn't need other stuff.

…nto umpire

markusbattarbee · 2024-06-04T13:04:02Z

Myep, even after fixing those two calls it still complains on exit:

(Grid) rank 0 is noderank 0 of 1
Done setting all 64 instances of device mesh wrapper handler!
(MAIN): Completed grid initialization.
(MAIN): Starting main simulation loop.
(MAIN): Completed requested simulation. Exiting.
terminate called after throwing an instance of 'umpire::runtime_error'
  what():  ! Umpire runtime_error [/projappl/project_2004522/libraries/gcc-10.4.0/openmpi-4.1.5-cuda/cuda-12.1.1/Umpire/src/umpire/util/AllocationMap.cpp:255]: Cannot remove 0x7ff453000000
    Backtrace: 13 frames
    0 0x617a92 No dladdr: /scratch/project_2004522/testpackage/vlasiator_gpu_umpire_tp() [0x617a92]
    1 0x61931b No dladdr: /scratch/project_2004522/testpackage/vlasiator_gpu_umpire_tp() [0x61931b]
    2 0x619948 No dladdr: /scratch/project_2004522/testpackage/vlasiator_gpu_umpire_tp() [0x619948]
    3 0x77c3be No dladdr: /scratch/project_2004522/testpackage/vlasiator_gpu_umpire_tp() [0x77c3be]
    4 0x70d6ea No dladdr: /scratch/project_2004522/testpackage/vlasiator_gpu_umpire_tp() [0x70d6ea]
    5 0x76050d No dladdr: /scratch/project_2004522/testpackage/vlasiator_gpu_umpire_tp() [0x76050d]
    6 0x4b2a73 No dladdr: /scratch/project_2004522/testpackage/vlasiator_gpu_umpire_tp() [0x4b2a73]
    7 0x629373 No dladdr: /scratch/project_2004522/testpackage/vlasiator_gpu_umpire_tp() [0x629373]
    8 0x6294ea No dladdr: /scratch/project_2004522/testpackage/vlasiator_gpu_umpire_tp() [0x6294ea]
    9 0x6178b8 No dladdr: /scratch/project_2004522/testpackage/vlasiator_gpu_umpire_tp() [0x6178b8]
    10 0x440d8b No dladdr: /scratch/project_2004522/testpackage/vlasiator_gpu_umpire_tp() [0x440d8b]
    11 0x7fffbe4c8cf3 No dladdr: /lib64/libc.so.6(__libc_start_main+0xf3) [0x7fffbe4c8cf3]
    12 0x44d7ce No dladdr: /scratch/project_2004522/testpackage/vlasiator_gpu_umpire_tp() [0x44d7ce]


[g1101:2996122] *** Process received signal ***
[g1101:2996122] Signal: Aborted (6)
[g1101:2996122] Signal code:  (-6)
[g1101:2996122] [ 0] /lib64/libc.so.6(+0x4eb20)[0x7fffbe4dcb20]
[g1101:2996122] [ 1] /lib64/libc.so.6(gsignal+0x10f)[0x7fffbe4dca9f]
[g1101:2996122] [ 2] /lib64/libc.so.6(abort+0x127)[0x7fffbe4afe05]
[g1101:2996122] [ 3] /appl/spack/v020/install-tree/gcc-8.5.0/gcc-10.4.0-2oazqj/lib64/libstdc++.so.6(+0xa27bc)[0x7fffbec787bc]
[g1101:2996122] [ 4] /appl/spack/v020/install-tree/gcc-8.5.0/gcc-10.4.0-2oazqj/lib64/libstdc++.so.6(+0xad766)[0x7fffbec83766]
[g1101:2996122] [ 5] /appl/spack/v020/install-tree/gcc-8.5.0/gcc-10.4.0-2oazqj/lib64/libstdc++.so.6(+0xad7d1)[0x7fffbec837d1]
[g1101:2996122] [ 6] /appl/spack/v020/install-tree/gcc-8.5.0/gcc-10.4.0-2oazqj/lib64/libstdc++.so.6(+0xada65)[0x7fffbec83a65]
[g1101:2996122] [ 7] /scratch/project_2004522/testpackage/vlasiator_gpu_umpire_tp[0x77c537]
[g1101:2996122] [ 8] /scratch/project_2004522/testpackage/vlasiator_gpu_umpire_tp[0x70d6ea]
[g1101:2996122] [ 9] /scratch/project_2004522/testpackage/vlasiator_gpu_umpire_tp[0x76050d]
[g1101:2996122] [10] /scratch/project_2004522/testpackage/vlasiator_gpu_umpire_tp[0x4b2a73]
[g1101:2996122] [11] /scratch/project_2004522/testpackage/vlasiator_gpu_umpire_tp[0x629373]
[g1101:2996122] [12] /scratch/project_2004522/testpackage/vlasiator_gpu_umpire_tp[0x6294ea]

The address 0x7ff453000000 looks like a GPU-memoryspace address to me.

Interestingly, as I was unable to debug this on Mahti, I then switched to my own desktop computer with a GTX1060. Built Umpire, compiled, run, and.... no error. :)

markusbattarbee · 2024-06-25T08:43:59Z

I notice now that the allocators constructed here do not use the syntax for umpire threadsafe allocators:
https://umpire.readthedocs.io/en/develop/sphinx/cookbook/thread_safe.html
Thus, we should either switch to a threadsafe allocator (which might be slow if it has to use locks on every allocation) or implement a method which creates max_omp_n_threads allocators where each CPU thread uses the assigned allocator. That'll probably be less efficient in re-coalescing allocations, but might still be the better option.

hokkanen added 10 commits April 8, 2024 14:50

Implement Umpire support for Lumi+HIP

b7fd0da

Umpire works for device memory, but fails for Unified Mem

c78a55e

Update arch_device_hip.h and comment out failing UM alloc/free

c77b30b

Add Umpire for async allocs and pinned allocs, managed still disabled

252b7ba

Fix arch_device_hip.h Umpire stuff

88a11de

Fix too large Umpire allocation size

738b263

Add threadsafe allocators, and shared free function

3a14535

Revert "Add threadsafe allocators, and shared free function"

eb4b58c

This reverts commit 3a14535.

Update Makefiles

0a162cd

Add error handling into gpu_base.hpp

dce427f

markusbattarbee added improved-memory-usage gpu labels Apr 12, 2024

hokkanen and others added 3 commits June 4, 2024 14:25

Merge branch 'vlasiator_gpu' of https://github.com/fmihpc/vlasiator i…

fc4abee

…nto umpire

Add Umpire into Mahti Cuda Makefile

ea47708

Changed the only two raw managed gpuFree calls to gpuFreeManaged

2c6f915

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Umpire memory manager for GPU pool memory allocation #943

Add Umpire memory manager for GPU pool memory allocation #943

hokkanen commented Apr 10, 2024

markusbattarbee commented Apr 17, 2024

markusbattarbee commented Apr 17, 2024

kstppd commented Apr 17, 2024

hokkanen commented Apr 17, 2024

markusbattarbee commented Jun 4, 2024

markusbattarbee commented Jun 25, 2024

Add Umpire memory manager for GPU pool memory allocation #943

Are you sure you want to change the base?

Add Umpire memory manager for GPU pool memory allocation #943

Conversation

hokkanen commented Apr 10, 2024

markusbattarbee commented Apr 17, 2024

markusbattarbee commented Apr 17, 2024

kstppd commented Apr 17, 2024

hokkanen commented Apr 17, 2024

markusbattarbee commented Jun 4, 2024

markusbattarbee commented Jun 25, 2024