is set access to the previous GPUs at cudamallocasync_allocator.cc cause allocateRaw slower? #19431

zjjott · 2024-11-18T12:28:49Z

env:
GPU: H20x8
CUDA: 12.4

export TF_CPP_MIN_LOG_LEVEL=0
export TF_CPP_VMODULE="xla_graph_executor=5,gpu_compiler=5,pjrt_computation_client=5,lazy_graph_executor=5,gpu_cudamallocasync_allocator=5"

I found it will cause alloc slower than expect,
init log like this:

2024-11-18 20:18:27.924238: I external/xla/xla/stream_executor/gpu/gpu_cudamallocasync_allocator.cc:195] DRIVER VERSION: 12020
2024-11-18 20:18:27.924499: I external/xla/xla/stream_executor/gpu/gpu_cudamallocasync_allocator.cc:261] using default memory pool 0x21662a28
2024-11-18 20:18:27.924510: I external/xla/xla/stream_executor/gpu/gpu_cudamallocasync_allocator.cc:264] gpu_async_0 CudaMallocAsync initialized on platform: 0 with pool size of: 91809703526 this ptr: 0x22ef0520
2024-11-18 20:18:27.924522: I external/xla/xla/stream_executor/gpu/gpu_cudamallocasync_allocator.cc:373] gpu_async_0 GpuCudaMallocAsyncAllocator PoolSize 91809703526
2024-11-18 20:18:29.461828: I external/xla/xla/stream_executor/gpu/gpu_cudamallocasync_allocator.cc:565] gpu_async_0 GpuCudaMallocAsyncAllocator reserved the pool for 91809703526 bytes. First ptr: 0xa20000000
I0000 00:00:1731932309.461990  243046 se_gpu_pjrt_client.cc:826] XLA backend allocating 91809703526 bytes on device 1 for CudaAsyncAllocator.
2024-11-18 20:18:29.462063: I external/xla/xla/stream_executor/gpu/gpu_cudamallocasync_allocator.cc:195] DRIVER VERSION: 12020
2024-11-18 20:18:29.462236: I external/xla/xla/stream_executor/gpu/gpu_cudamallocasync_allocator.cc:261] using default memory pool 0x1f806288
2024-11-18 20:18:29.462244: I external/xla/xla/stream_executor/gpu/gpu_cudamallocasync_allocator.cc:264] gpu_async_1 CudaMallocAsync initialized on platform: 1 with pool size of: 91809703526 this ptr: 0x22eaec60
2024-11-18 20:18:29.462254: I external/xla/xla/stream_executor/gpu/gpu_cudamallocasync_allocator.cc:327] Setting access of the current pool to  location id: 0
2024-11-18 20:18:29.462268: I external/xla/xla/stream_executor/gpu/gpu_cudamallocasync_allocator.cc:352] Set access to the pool id: 0 location id: 1
2024-11-18 20:18:34.272645: I external/xla/xla/stream_executor/gpu/gpu_cudamallocasync_allocator.cc:373] gpu_async_1 GpuCudaMallocAsyncAllocator PoolSize 91809703526
2024-11-18 20:18:41.702691: I external/xla/xla/stream_executor/gpu/gpu_cudamallocasync_allocator.cc:565] gpu_async_1 GpuCudaMallocAsyncAllocator reserved the pool for 91809703526 bytes. First ptr: 0x39c0000000
I0000 00:00:1731932321.703009  243046 se_gpu_pjrt_client.cc:826] XLA backend allocating 91809703526 bytes on device 2 for CudaAsyncAllocator.
2024-11-18 20:18:41.703158: I external/xla/xla/stream_executor/gpu/gpu_cudamallocasync_allocator.cc:195] DRIVER VERSION: 12020
2024-11-18 20:18:41.703441: I external/xla/xla/stream_executor/gpu/gpu_cudamallocasync_allocator.cc:261] using default memory pool 0x280b2398
2024-11-18 20:18:41.703451: I external/xla/xla/stream_executor/gpu/gpu_cudamallocasync_allocator.cc:264] gpu_async_2 CudaMallocAsync initialized on platform: 2 with pool size of: 91809703526 this ptr: 0x22e98d40
2024-11-18 20:18:41.703477: I external/xla/xla/stream_executor/gpu/gpu_cudamallocasync_allocator.cc:327] Setting access of the current pool to  location id: 0
2024-11-18 20:18:41.703492: I external/xla/xla/stream_executor/gpu/gpu_cudamallocasync_allocator.cc:352] Set access to the pool id: 0 location id: 2
2024-11-18 20:18:46.548156: I external/xla/xla/stream_executor/gpu/gpu_cudamallocasync_allocator.cc:327] Setting access of the current pool to  location id: 1
2024-11-18 20:18:46.548299: I external/xla/xla/stream_executor/gpu/gpu_cudamallocasync_allocator.cc:352] Set access to the pool id: 1 location id: 2
2024-11-18 20:18:51.331734: I external/xla/xla/stream_executor/gpu/gpu_cudamallocasync_allocator.cc:373] gpu_async_2 GpuCudaMallocAsyncAllocator PoolSize 91809703526
2024-11-18 20:19:05.640897: I external/xla/xla/stream_executor/gpu/gpu_cudamallocasync_allocator.cc:565] gpu_async_2 GpuCudaMallocAsyncAllocator reserved the pool for 91809703526 bytes. First ptr: 0x6960000000
I0000 00:00:1731932345.642018  243046 se_gpu_pjrt_client.cc:826] XLA backend allocating 91809703526 bytes on device 3 for CudaAsyncAllocator.
2024-11-18 20:19:05.642179: I external/xla/xla/stream_executor/gpu/gpu_cudamallocasync_allocator.cc:195] DRIVER VERSION: 12020
2024-11-18 20:19:05.642435: I external/xla/xla/stream_executor/gpu/gpu_cudamallocasync_allocator.cc:261] using default memory pool 0xc88cde8
2024-11-18 20:19:05.642445: I external/xla/xla/stream_executor/gpu/gpu_cudamallocasync_allocator.cc:264] gpu_async_3 CudaMallocAsync initialized on platform: 3 with pool size of: 91809703526 this ptr: 0x1c566a30
2024-11-18 20:19:05.642462: I external/xla/xla/stream_executor/gpu/gpu_cudamallocasync_allocator.cc:327] Setting access of the current pool to  location id: 0
2024-11-18 20:19:05.642474: I external/xla/xla/stream_executor/gpu/gpu_cudamallocasync_allocator.cc:352] Set access to the pool id: 0 location id: 3
2024-11-18 20:19:10.736164: I external/xla/xla/stream_executor/gpu/gpu_cudamallocasync_allocator.cc:327] Setting access of the current pool to  location id: 1
2024-11-18 20:19:10.736312: I external/xla/xla/stream_executor/gpu/gpu_cudamallocasync_allocator.cc:352] Set access to the pool id: 1 location id: 3
2024-11-18 20:19:15.597324: I external/xla/xla/stream_executor/gpu/gpu_cudamallocasync_allocator.cc:327] Setting access of the current pool to  location id: 2
2024-11-18 20:19:15.597501: I external/xla/xla/stream_executor/gpu/gpu_cudamallocasync_allocator.cc:352] Set access to the pool id: 2 location id: 3
2024-11-18 20:19:20.642308: I external/xla/xla/stream_executor/gpu/gpu_cudamallocasync_allocator.cc:373] gpu_async_3 GpuCudaMallocAsyncAllocator PoolSize 91809703526
2024-11-18 20:19:42.562063: I external/xla/xla/stream_executor/gpu/gpu_cudamallocasync_allocator.cc:565] gpu_async_3 GpuCudaMallocAsyncAllocator reserved the pool for 91809703526 bytes. First ptr: 0x9900000000
I0000 00:00:1731932382.562886  243046 se_gpu_pjrt_client.cc:826] XLA backend allocating 91809703526 bytes on device 4 for CudaAsyncAllocator.

code like this:

VLOG(2) << "Set access to the pool id: " << previous_pool_id
            << " location id: " << map.location.id;
    if (auto status = cuDeviceCanAccessPeer(&canAccessPeer, previous_pool_id,
                                            platform_device_id.value())) {
      cuda_state_->pool = nullptr;
      LOG(FATAL)  // Crash OK.
          << "cuDeviceCanAccessPeer failed: " << cuda::ToStatus(status);
    }
    if (canAccessPeer == 1) {
      if (auto status = cuMemPoolSetAccess((*all_pools_)[i], &map, 1)) {
        cuda_state_->pool = nullptr;
        LOG(FATAL)  // Crash OK.
            << "Error when setting access to the pool id: " << previous_pool_id
            << " location id: " << map.location.id
            << " error: " << cuda::ToStatus(status);
      }
    }

seems here, gpu2 cuMemPoolSetAccess is slower than gpu0,and log show,gpu7 slower than gpu6, gpu6 slower than gpu5,and so on;
why here is cuMemPoolSetAccess? I found at pytorch project, there is only one cuMemSetAccess_ at ExpandableSegment class

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

is set access to the previous GPUs at cudamallocasync_allocator.cc cause allocateRaw slower? #19431

is set access to the previous GPUs at cudamallocasync_allocator.cc cause allocateRaw slower? #19431

zjjott commented Nov 18, 2024

is set access to the previous GPUs at cudamallocasync_allocator.cc cause allocateRaw slower? #19431

is set access to the previous GPUs at cudamallocasync_allocator.cc cause allocateRaw slower? #19431

Comments

zjjott commented Nov 18, 2024