Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

is set access to the previous GPUs at cudamallocasync_allocator.cc cause allocateRaw slower? #19431

Open
zjjott opened this issue Nov 18, 2024 · 0 comments

Comments

@zjjott
Copy link
Contributor

zjjott commented Nov 18, 2024

env:
GPU: H20x8
CUDA: 12.4

export TF_CPP_MIN_LOG_LEVEL=0
export TF_CPP_VMODULE="xla_graph_executor=5,gpu_compiler=5,pjrt_computation_client=5,lazy_graph_executor=5,gpu_cudamallocasync_allocator=5"

I found it will cause alloc slower than expect,
init log like this:

2024-11-18 20:18:27.924238: I external/xla/xla/stream_executor/gpu/gpu_cudamallocasync_allocator.cc:195] DRIVER VERSION: 12020
2024-11-18 20:18:27.924499: I external/xla/xla/stream_executor/gpu/gpu_cudamallocasync_allocator.cc:261] using default memory pool 0x21662a28
2024-11-18 20:18:27.924510: I external/xla/xla/stream_executor/gpu/gpu_cudamallocasync_allocator.cc:264] gpu_async_0 CudaMallocAsync initialized on platform: 0 with pool size of: 91809703526 this ptr: 0x22ef0520
2024-11-18 20:18:27.924522: I external/xla/xla/stream_executor/gpu/gpu_cudamallocasync_allocator.cc:373] gpu_async_0 GpuCudaMallocAsyncAllocator PoolSize 91809703526
2024-11-18 20:18:29.461828: I external/xla/xla/stream_executor/gpu/gpu_cudamallocasync_allocator.cc:565] gpu_async_0 GpuCudaMallocAsyncAllocator reserved the pool for 91809703526 bytes. First ptr: 0xa20000000
I0000 00:00:1731932309.461990  243046 se_gpu_pjrt_client.cc:826] XLA backend allocating 91809703526 bytes on device 1 for CudaAsyncAllocator.
2024-11-18 20:18:29.462063: I external/xla/xla/stream_executor/gpu/gpu_cudamallocasync_allocator.cc:195] DRIVER VERSION: 12020
2024-11-18 20:18:29.462236: I external/xla/xla/stream_executor/gpu/gpu_cudamallocasync_allocator.cc:261] using default memory pool 0x1f806288
2024-11-18 20:18:29.462244: I external/xla/xla/stream_executor/gpu/gpu_cudamallocasync_allocator.cc:264] gpu_async_1 CudaMallocAsync initialized on platform: 1 with pool size of: 91809703526 this ptr: 0x22eaec60
2024-11-18 20:18:29.462254: I external/xla/xla/stream_executor/gpu/gpu_cudamallocasync_allocator.cc:327] Setting access of the current pool to  location id: 0
2024-11-18 20:18:29.462268: I external/xla/xla/stream_executor/gpu/gpu_cudamallocasync_allocator.cc:352] Set access to the pool id: 0 location id: 1
2024-11-18 20:18:34.272645: I external/xla/xla/stream_executor/gpu/gpu_cudamallocasync_allocator.cc:373] gpu_async_1 GpuCudaMallocAsyncAllocator PoolSize 91809703526
2024-11-18 20:18:41.702691: I external/xla/xla/stream_executor/gpu/gpu_cudamallocasync_allocator.cc:565] gpu_async_1 GpuCudaMallocAsyncAllocator reserved the pool for 91809703526 bytes. First ptr: 0x39c0000000
I0000 00:00:1731932321.703009  243046 se_gpu_pjrt_client.cc:826] XLA backend allocating 91809703526 bytes on device 2 for CudaAsyncAllocator.
2024-11-18 20:18:41.703158: I external/xla/xla/stream_executor/gpu/gpu_cudamallocasync_allocator.cc:195] DRIVER VERSION: 12020
2024-11-18 20:18:41.703441: I external/xla/xla/stream_executor/gpu/gpu_cudamallocasync_allocator.cc:261] using default memory pool 0x280b2398
2024-11-18 20:18:41.703451: I external/xla/xla/stream_executor/gpu/gpu_cudamallocasync_allocator.cc:264] gpu_async_2 CudaMallocAsync initialized on platform: 2 with pool size of: 91809703526 this ptr: 0x22e98d40
2024-11-18 20:18:41.703477: I external/xla/xla/stream_executor/gpu/gpu_cudamallocasync_allocator.cc:327] Setting access of the current pool to  location id: 0
2024-11-18 20:18:41.703492: I external/xla/xla/stream_executor/gpu/gpu_cudamallocasync_allocator.cc:352] Set access to the pool id: 0 location id: 2
2024-11-18 20:18:46.548156: I external/xla/xla/stream_executor/gpu/gpu_cudamallocasync_allocator.cc:327] Setting access of the current pool to  location id: 1
2024-11-18 20:18:46.548299: I external/xla/xla/stream_executor/gpu/gpu_cudamallocasync_allocator.cc:352] Set access to the pool id: 1 location id: 2
2024-11-18 20:18:51.331734: I external/xla/xla/stream_executor/gpu/gpu_cudamallocasync_allocator.cc:373] gpu_async_2 GpuCudaMallocAsyncAllocator PoolSize 91809703526
2024-11-18 20:19:05.640897: I external/xla/xla/stream_executor/gpu/gpu_cudamallocasync_allocator.cc:565] gpu_async_2 GpuCudaMallocAsyncAllocator reserved the pool for 91809703526 bytes. First ptr: 0x6960000000
I0000 00:00:1731932345.642018  243046 se_gpu_pjrt_client.cc:826] XLA backend allocating 91809703526 bytes on device 3 for CudaAsyncAllocator.
2024-11-18 20:19:05.642179: I external/xla/xla/stream_executor/gpu/gpu_cudamallocasync_allocator.cc:195] DRIVER VERSION: 12020
2024-11-18 20:19:05.642435: I external/xla/xla/stream_executor/gpu/gpu_cudamallocasync_allocator.cc:261] using default memory pool 0xc88cde8
2024-11-18 20:19:05.642445: I external/xla/xla/stream_executor/gpu/gpu_cudamallocasync_allocator.cc:264] gpu_async_3 CudaMallocAsync initialized on platform: 3 with pool size of: 91809703526 this ptr: 0x1c566a30
2024-11-18 20:19:05.642462: I external/xla/xla/stream_executor/gpu/gpu_cudamallocasync_allocator.cc:327] Setting access of the current pool to  location id: 0
2024-11-18 20:19:05.642474: I external/xla/xla/stream_executor/gpu/gpu_cudamallocasync_allocator.cc:352] Set access to the pool id: 0 location id: 3
2024-11-18 20:19:10.736164: I external/xla/xla/stream_executor/gpu/gpu_cudamallocasync_allocator.cc:327] Setting access of the current pool to  location id: 1
2024-11-18 20:19:10.736312: I external/xla/xla/stream_executor/gpu/gpu_cudamallocasync_allocator.cc:352] Set access to the pool id: 1 location id: 3
2024-11-18 20:19:15.597324: I external/xla/xla/stream_executor/gpu/gpu_cudamallocasync_allocator.cc:327] Setting access of the current pool to  location id: 2
2024-11-18 20:19:15.597501: I external/xla/xla/stream_executor/gpu/gpu_cudamallocasync_allocator.cc:352] Set access to the pool id: 2 location id: 3
2024-11-18 20:19:20.642308: I external/xla/xla/stream_executor/gpu/gpu_cudamallocasync_allocator.cc:373] gpu_async_3 GpuCudaMallocAsyncAllocator PoolSize 91809703526
2024-11-18 20:19:42.562063: I external/xla/xla/stream_executor/gpu/gpu_cudamallocasync_allocator.cc:565] gpu_async_3 GpuCudaMallocAsyncAllocator reserved the pool for 91809703526 bytes. First ptr: 0x9900000000
I0000 00:00:1731932382.562886  243046 se_gpu_pjrt_client.cc:826] XLA backend allocating 91809703526 bytes on device 4 for CudaAsyncAllocator.

code like this:

VLOG(2) << "Set access to the pool id: " << previous_pool_id
            << " location id: " << map.location.id;
    if (auto status = cuDeviceCanAccessPeer(&canAccessPeer, previous_pool_id,
                                            platform_device_id.value())) {
      cuda_state_->pool = nullptr;
      LOG(FATAL)  // Crash OK.
          << "cuDeviceCanAccessPeer failed: " << cuda::ToStatus(status);
    }
    if (canAccessPeer == 1) {
      if (auto status = cuMemPoolSetAccess((*all_pools_)[i], &map, 1)) {
        cuda_state_->pool = nullptr;
        LOG(FATAL)  // Crash OK.
            << "Error when setting access to the pool id: " << previous_pool_id
            << " location id: " << map.location.id
            << " error: " << cuda::ToStatus(status);
      }
    }

seems here, gpu2 cuMemPoolSetAccess is slower than gpu0,and log show,gpu7 slower than gpu6, gpu6 slower than gpu5,and so on;
why here is cuMemPoolSetAccess? I found at pytorch project, there is only one cuMemSetAccess_ at ExpandableSegment class

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant