Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Umpire memory manager for GPU pool memory allocation #943

Draft
wants to merge 13 commits into
base: vlasiator_gpu
Choose a base branch
from

Conversation

hokkanen
Copy link
Contributor

This PR adds Umpire memory manager for GPU pool memory allocation. However, the implementation crashes due to a silent error in the base version, see the below attached Zulip discussion. I mark this as draft, as it probably makes sense to fix the base version error first.

Zulip:

"Ok, I tried to figure out what is wrong, and it looks like the problem is not the Umpire implementation, but an already existing issue in the vlasiator_gpu branch, at least since 


commit f3bc0e44fcb0e763716784d3dcdfdc92f2ec20c7 (HEAD)

Author: Markus Battarbee <[email protected]>

Date:   Thu Mar 7 08:46:25 2024 +0200

    Comment out old prefetch

The reason the bug only shows up with the Umpire implementation is that the Managed class in gpu_base.hpp does not have error handling (ie, the gpuFree() just errors silently and execution continues):


// Unified memory class for inheritance

class Managed {

public:

   void *operator new(size_t len) {

      void *ptr;

      gpuMallocManaged(&ptr, len);

      gpuDeviceSynchronize();

      return ptr;

   }

   void operator delete(void *ptr) {

      gpuDeviceSynchronize();

      gpuFree(ptr);

   }

   void* operator new[] (size_t len) {

      void *ptr;

      gpuMallocManaged(&ptr, len);

      gpuDeviceSynchronize();

      return ptr;

   }

   void operator delete[] (void* ptr) {

      gpuDeviceSynchronize();

      gpuFree(ptr);

   }

};

If I add error handling, then the program fails exactly at the same location where Umpire implementation fails:


class Managed {

public:

   void *operator new(size_t len) {

      void *ptr;

      CHK_ERR(gpuMallocManaged(&ptr, len));

      CHK_ERR(gpuDeviceSynchronize());

      return ptr;

   }

   void operator delete(void *ptr) {

      CHK_ERR(gpuDeviceSynchronize());

      CHK_ERR(gpuFree(ptr));

   }

   void* operator new[] (size_t len) {

      void *ptr;

      CHK_ERR(gpuMallocManaged(&ptr, len));

      CHK_ERR(gpuDeviceSynchronize());

      return ptr;

   }

   void operator delete[] (void* ptr) {

      CHK_ERR(gpuDeviceSynchronize());

      CHK_ERR(gpuFree(ptr));

   }

};

with the following output (on Mahti):


(Grid) rank 0 is noderank 0 of 1

Done setting all 62 instances of device mesh wrapper handler!

(MAIN): Completed grid initialization.

(MAIN): Starting main simulation loop.

(MAIN): Completed requested simulation. Exiting.

driver shutting down in arch/gpu_base.hpp at line 90

srun: error: g1101: task 0: Exited with exit code 1

"

@markusbattarbee
Copy link
Contributor

I also now built Umpire on Mahti so I can trial this - is this sufficient for building or do you think we need additional flags?

cmake .. -DENABLE_CUDA=On -DCMAKE_INSTALL_PREFIX=/projappl/project_2004522/libraries/gcc-10.4.0/openmpi-4.1.5-cuda/cuda-12.1.1/umpire -DCMAKE_CUDA_ARCHITECTURES=80

@markusbattarbee
Copy link
Contributor

Ah, ok, I think I see at least one reason why this might be causing errors. In regular CUDA/HIP code, one can use the same fpuFree macro for both UM and device memory, but here we need to have a specific call for freeing UM memory. In Vlasiator_gpu, those haven't yet been distinguished.

Also, I guess Hashinator will need to be updated to support Umpire to really benefit from it.

@kstppd
Copy link
Contributor

kstppd commented Apr 17, 2024

Ah, ok, I think I see at least one reason why this might be causing errors. In regular CUDA/HIP code, one can use the same fpuFree macro for both UM and device memory, but here we need to have a specific call for freeing UM memory. In Vlasiator_gpu, those haven't yet been distinguished.

Also, I guess Hashinator will need to be updated to support Umpire to really benefit from it.

For Hashinator we would "just" need to add a new split allocator that uses Umpire.

@hokkanen
Copy link
Contributor Author

I also now built Umpire on Mahti so I can trial this - is this sufficient for building or do you think we need additional flags?

cmake .. -DENABLE_CUDA=On -DCMAKE_INSTALL_PREFIX=/projappl/project_2004522/libraries/gcc-10.4.0/openmpi-4.1.5-cuda/cuda-12.1.1/umpire -DCMAKE_CUDA_ARCHITECTURES=80

I think that should probably be ok. I didn't specify the CUDA architecture, but if it works, then you shouldn't need other stuff.

@markusbattarbee
Copy link
Contributor

Myep, even after fixing those two calls it still complains on exit:

(Grid) rank 0 is noderank 0 of 1
Done setting all 64 instances of device mesh wrapper handler!
(MAIN): Completed grid initialization.
(MAIN): Starting main simulation loop.
(MAIN): Completed requested simulation. Exiting.
terminate called after throwing an instance of 'umpire::runtime_error'
  what():  ! Umpire runtime_error [/projappl/project_2004522/libraries/gcc-10.4.0/openmpi-4.1.5-cuda/cuda-12.1.1/Umpire/src/umpire/util/AllocationMap.cpp:255]: Cannot remove 0x7ff453000000
    Backtrace: 13 frames
    0 0x617a92 No dladdr: /scratch/project_2004522/testpackage/vlasiator_gpu_umpire_tp() [0x617a92]
    1 0x61931b No dladdr: /scratch/project_2004522/testpackage/vlasiator_gpu_umpire_tp() [0x61931b]
    2 0x619948 No dladdr: /scratch/project_2004522/testpackage/vlasiator_gpu_umpire_tp() [0x619948]
    3 0x77c3be No dladdr: /scratch/project_2004522/testpackage/vlasiator_gpu_umpire_tp() [0x77c3be]
    4 0x70d6ea No dladdr: /scratch/project_2004522/testpackage/vlasiator_gpu_umpire_tp() [0x70d6ea]
    5 0x76050d No dladdr: /scratch/project_2004522/testpackage/vlasiator_gpu_umpire_tp() [0x76050d]
    6 0x4b2a73 No dladdr: /scratch/project_2004522/testpackage/vlasiator_gpu_umpire_tp() [0x4b2a73]
    7 0x629373 No dladdr: /scratch/project_2004522/testpackage/vlasiator_gpu_umpire_tp() [0x629373]
    8 0x6294ea No dladdr: /scratch/project_2004522/testpackage/vlasiator_gpu_umpire_tp() [0x6294ea]
    9 0x6178b8 No dladdr: /scratch/project_2004522/testpackage/vlasiator_gpu_umpire_tp() [0x6178b8]
    10 0x440d8b No dladdr: /scratch/project_2004522/testpackage/vlasiator_gpu_umpire_tp() [0x440d8b]
    11 0x7fffbe4c8cf3 No dladdr: /lib64/libc.so.6(__libc_start_main+0xf3) [0x7fffbe4c8cf3]
    12 0x44d7ce No dladdr: /scratch/project_2004522/testpackage/vlasiator_gpu_umpire_tp() [0x44d7ce]


[g1101:2996122] *** Process received signal ***
[g1101:2996122] Signal: Aborted (6)
[g1101:2996122] Signal code:  (-6)
[g1101:2996122] [ 0] /lib64/libc.so.6(+0x4eb20)[0x7fffbe4dcb20]
[g1101:2996122] [ 1] /lib64/libc.so.6(gsignal+0x10f)[0x7fffbe4dca9f]
[g1101:2996122] [ 2] /lib64/libc.so.6(abort+0x127)[0x7fffbe4afe05]
[g1101:2996122] [ 3] /appl/spack/v020/install-tree/gcc-8.5.0/gcc-10.4.0-2oazqj/lib64/libstdc++.so.6(+0xa27bc)[0x7fffbec787bc]
[g1101:2996122] [ 4] /appl/spack/v020/install-tree/gcc-8.5.0/gcc-10.4.0-2oazqj/lib64/libstdc++.so.6(+0xad766)[0x7fffbec83766]
[g1101:2996122] [ 5] /appl/spack/v020/install-tree/gcc-8.5.0/gcc-10.4.0-2oazqj/lib64/libstdc++.so.6(+0xad7d1)[0x7fffbec837d1]
[g1101:2996122] [ 6] /appl/spack/v020/install-tree/gcc-8.5.0/gcc-10.4.0-2oazqj/lib64/libstdc++.so.6(+0xada65)[0x7fffbec83a65]
[g1101:2996122] [ 7] /scratch/project_2004522/testpackage/vlasiator_gpu_umpire_tp[0x77c537]
[g1101:2996122] [ 8] /scratch/project_2004522/testpackage/vlasiator_gpu_umpire_tp[0x70d6ea]
[g1101:2996122] [ 9] /scratch/project_2004522/testpackage/vlasiator_gpu_umpire_tp[0x76050d]
[g1101:2996122] [10] /scratch/project_2004522/testpackage/vlasiator_gpu_umpire_tp[0x4b2a73]
[g1101:2996122] [11] /scratch/project_2004522/testpackage/vlasiator_gpu_umpire_tp[0x629373]
[g1101:2996122] [12] /scratch/project_2004522/testpackage/vlasiator_gpu_umpire_tp[0x6294ea]

The address 0x7ff453000000 looks like a GPU-memoryspace address to me.

Interestingly, as I was unable to debug this on Mahti, I then switched to my own desktop computer with a GTX1060. Built Umpire, compiled, run, and.... no error. :)

@markusbattarbee
Copy link
Contributor

I notice now that the allocators constructed here do not use the syntax for umpire threadsafe allocators:
https://umpire.readthedocs.io/en/develop/sphinx/cookbook/thread_safe.html
Thus, we should either switch to a threadsafe allocator (which might be slow if it has to use locks on every allocation) or implement a method which creates max_omp_n_threads allocators where each CPU thread uses the assigned allocator. That'll probably be less efficient in re-coalescing allocations, but might still be the better option.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants