Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enabling P2P capability on 8 RTX 4090 GPUs results in significantly lower performance in NCCL alltoall_perf tests compared to when P2P capability is disabled. #17

Open
2 tasks done
ZP-AlwaysWin opened this issue Sep 23, 2024 · 27 comments
Labels
bug Something isn't working

Comments

@ZP-AlwaysWin
Copy link

ZP-AlwaysWin commented Sep 23, 2024

NVIDIA Open GPU Kernel Modules Version

NVIDIA-SMI 550.54.15 Driver Version: 550.54.15 CUDA Version: 12.4

Please confirm this issue does not happen with the proprietary driver (of the same version). This issue tracker is only for bugs specific to the open kernel driver.

  • I confirm that this does not happen with the proprietary driver package.

Operating System and Version

Description: Ubuntu 22.04.1 LTS

Kernel Release

5.15.0-60-generic #66-Ubuntu SMP Fri Jan 20 14:29:49 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux

Please confirm you are running a stable release kernel (e.g. not a -rc). We do not accept bug reports for unreleased kernels.

  • I am running on a stable kernel release.

Hardware: GPU

NVIDIA GeForce RTX 4090

Describe the bug

Enabling P2P capability on 8 RTX 4090 GPUs results in significantly lower performance in NCCL alltoall_perf tests compared to when P2P capability is disabled.

To Reproduce

Enabling P2P capability on two RTX 4090 GPUs significantly improves performance in the NCCL alltoall_perf tests compared to when P2P is disabled. However, when testing with eight GPUs, the performance gap between enabling and disabling P2P is much larger, with a severe performance drop when P2P is enabled. The relevant test data is as follows:

  • P2P capability is enabled.
root@moons:~# nvidia-smi topo -p2p rw
  GPU0  GPU1  GPU2  GPU3  GPU4  GPU5  GPU6  GPU7
 GPU0 X OK  OK  OK  OK  OK  OK  OK
 GPU1 OK  X OK  OK  OK  OK  OK  OK
 GPU2 OK  OK  X OK  OK  OK  OK  OK
 GPU3 OK  OK  OK  X OK  OK  OK  OK
 GPU4 OK  OK  OK  OK  X OK  OK  OK
 GPU5 OK  OK  OK  OK  OK  X OK  OK
 GPU6 OK  OK  OK  OK  OK  OK  X OK
 GPU7 OK  OK  OK  OK  OK  OK  OK  X

Legend:

  X    = Self
  OK   = Status Ok
  CNS  = Chipset not supported
  GNS  = GPU not supported
  TNS  = Topology not supported
  NS   = Not supported
  U    = Unknown
  • The simpleP2P test passes.
Enabling peer access between GPU0 and GPU1...
Allocating buffers (64MB on GPU0, GPU1 and CPU Host)...
Creating event handles...
cudaMemcpyPeer / cudaMemcpy between GPU0 and GPU1: 21.11GB/s
Preparing host buffer and memcpy to GPU0...
Run kernel on GPU1, taking source data from GPU0 and writing to GPU1...
Run kernel on GPU0, taking source data from GPU1 and writing to GPU0...
Copy data back to host from GPU0 and verify results...
Disabling peer access...
Shutting down...
Test passed

The alltoall_perf test data for two GPUs with P2P disabled:

root@moons:~/nccl-tests-master# ./build/alltoall_perf -b 8 -e 128M -f 2 -g 2
# nThread 1 nGpus 2 minBytes 8 maxBytes 134217728 step: 2(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
#
# Using devices
#  Rank  0 Group  0 Pid 120986 on      moons device  0 [0x41] NVIDIA GeForce RTX 4090
#  Rank  1 Group  0 Pid 120986 on      moons device  1 [0x42] NVIDIA GeForce RTX 4090
#
#                                                              out-of-place                       in-place
#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)
           8             1     float    none      -1    11.87    0.00    0.00      0    11.64    0.00    0.00    N/A
          16             2     float    none      -1    11.79    0.00    0.00      0    11.47    0.00    0.00    N/A
          32             4     float    none      -1    11.92    0.00    0.00      0    11.46    0.00    0.00    N/A
          64             8     float    none      -1    11.27    0.01    0.00      0    11.59    0.01    0.00    N/A
         128            16     float    none      -1    11.45    0.01    0.01      0    11.34    0.01    0.01    N/A
         256            32     float    none      -1    11.39    0.02    0.01      0    11.24    0.02    0.01    N/A
         512            64     float    none      -1    11.24    0.05    0.02      0    11.18    0.05    0.02    N/A
        1024           128     float    none      -1    11.64    0.09    0.04      0    11.32    0.09    0.05    N/A
        2048           256     float    none      -1    11.33    0.18    0.09      0    11.16    0.18    0.09    N/A
        4096           512     float    none      -1    11.33    0.36    0.18      0    11.07    0.37    0.19    N/A
        8192          1024     float    none      -1    11.70    0.70    0.35      0    10.89    0.75    0.38    N/A
       16384          2048     float    none      -1    11.51    1.42    0.71      0    11.66    1.41    0.70    N/A
       32768          4096     float    none      -1    12.96    2.53    1.26      0    12.88    2.54    1.27    N/A
       65536          8192     float    none      -1    18.67    3.51    1.75      0    18.40    3.56    1.78    N/A
      131072         16384     float    none      -1    18.12    7.23    3.62      0    17.79    7.37    3.68    N/A
      262144         32768     float    none      -1    23.19   11.31    5.65      0    22.85   11.47    5.74    N/A
      524288         65536     float    none      -1    34.97   14.99    7.50      0    34.77   15.08    7.54    N/A
     1048576        131072     float    none      -1    56.78   18.47    9.23      0    56.60   18.52    9.26    N/A
     2097152        262144     float    none      -1    101.1   20.74   10.37      0    100.6   20.85   10.42    N/A
     4194304        524288     float    none      -1    188.1   22.29   11.15      0    186.7   22.47   11.23    N/A
     8388608       1048576     float    none      -1    357.0   23.50   11.75      0    353.3   23.74   11.87    N/A
    16777216       2097152     float    none      -1    619.9   27.07   13.53      0    576.4   29.11   14.55    N/A
    33554432       4194304     float    none      -1   1214.1   27.64   13.82      0   1129.3   29.71   14.86    N/A
    67108864       8388608     float    none      -1   2410.7   27.84   13.92      0   2221.7   30.21   15.10    N/A
   134217728      16777216     float    none      -1   4813.9   27.88   13.94      0   4396.9   30.53   15.26    N/A
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 4.85887
#

The alltoall_perf test data for two GPUs with P2P enabled:

root@moons:~/nccl-tests-master# NCCL_P2P_LEVEL=SYS ./build/alltoall_perf -b 8 -e 128M -f 2 -g 2
# nThread 1 nGpus 2 minBytes 8 maxBytes 134217728 step: 2(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
#
# Using devices
#  Rank  0 Group  0 Pid 121058 on      moons device  0 [0x41] NVIDIA GeForce RTX 4090
#  Rank  1 Group  0 Pid 121058 on      moons device  1 [0x42] NVIDIA GeForce RTX 4090
#
#                                                              out-of-place                       in-place
#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)
           8             1     float    none      -1    12.28    0.00    0.00      0    11.80    0.00    0.00    N/A
          16             2     float    none      -1    12.03    0.00    0.00      0    11.36    0.00    0.00    N/A
          32             4     float    none      -1    11.90    0.00    0.00      0    11.57    0.00    0.00    N/A
          64             8     float    none      -1    11.52    0.01    0.00      0    11.72    0.01    0.00    N/A
         128            16     float    none      -1    11.59    0.01    0.01      0    11.47    0.01    0.01    N/A
         256            32     float    none      -1    11.63    0.02    0.01      0    11.60    0.02    0.01    N/A
         512            64     float    none      -1    11.78    0.04    0.02      0    11.50    0.04    0.02    N/A
        1024           128     float    none      -1    11.69    0.09    0.04      0    11.29    0.09    0.05    N/A
        2048           256     float    none      -1    11.80    0.17    0.09      0    11.52    0.18    0.09    N/A
        4096           512     float    none      -1    11.83    0.35    0.17      0    12.07    0.34    0.17    N/A
        8192          1024     float    none      -1    11.74    0.70    0.35      0    11.50    0.71    0.36    N/A
       16384          2048     float    none      -1    11.64    1.41    0.70      0    12.08    1.36    0.68    N/A
       32768          4096     float    none      -1    11.83    2.77    1.38      0    11.63    2.82    1.41    N/A
       65536          8192     float    none      -1    12.23    5.36    2.68      0    11.91    5.50    2.75    N/A
      131072         16384     float    none      -1    15.97    8.21    4.10      0    15.68    8.36    4.18    N/A
      262144         32768     float    none      -1    20.26   12.94    6.47      0    20.13   13.03    6.51    N/A
      524288         65536     float    none      -1    30.07   17.44    8.72      0    29.61   17.71    8.85    N/A
     1048576        131072     float    none      -1    42.37   24.75   12.38      0    42.20   24.85   12.42    N/A
     2097152        262144     float    none      -1    69.70   30.09   15.04      0    67.78   30.94   15.47    N/A
     4194304        524288     float    none      -1    123.4   33.99   16.99      0    118.3   35.46   17.73    N/A
     8388608       1048576     float    none      -1    223.1   37.59   18.80      0    222.1   37.77   18.88    N/A
    16777216       2097152     float    none      -1    433.9   38.67   19.33      0    423.5   39.61   19.81    N/A
    33554432       4194304     float    none      -1    849.8   39.49   19.74      0    828.4   40.51   20.25    N/A
    67108864       8388608     float    none      -1   1686.1   39.80   19.90      0   1639.1   40.94   20.47    N/A
   134217728      16777216     float    none      -1   3353.1   40.03   20.01      0   3261.5   41.15   20.58    N/A
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 6.75317
#

From the two-GPU test, it's evident that enabling P2P results in a significant performance boost.

The alltoall_perf test data for eight GPUs with P2P disabled:

root@moons:~/nccl-tests-master# ./build/alltoall_perf -b 8 -e 128M -f 2 -g 8
# nThread 1 nGpus 8 minBytes 8 maxBytes 134217728 step: 2(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
#
# Using devices
#  Rank  0 Group  0 Pid 121126 on      moons device  0 [0x41] NVIDIA GeForce RTX 4090
#  Rank  1 Group  0 Pid 121126 on      moons device  1 [0x42] NVIDIA GeForce RTX 4090
#  Rank  2 Group  0 Pid 121126 on      moons device  2 [0x43] NVIDIA GeForce RTX 4090
#  Rank  3 Group  0 Pid 121126 on      moons device  3 [0x44] NVIDIA GeForce RTX 4090
#  Rank  4 Group  0 Pid 121126 on      moons device  4 [0x61] NVIDIA GeForce RTX 4090
#  Rank  5 Group  0 Pid 121126 on      moons device  5 [0x62] NVIDIA GeForce RTX 4090
#  Rank  6 Group  0 Pid 121126 on      moons device  6 [0x63] NVIDIA GeForce RTX 4090
#  Rank  7 Group  0 Pid 121126 on      moons device  7 [0x64] NVIDIA GeForce RTX 4090
#
#                                                              out-of-place                       in-place
#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)
           0             0     float    none      -1    63.21    0.00    0.00      0    60.02    0.00    0.00    N/A
           0             0     float    none      -1    62.41    0.00    0.00      0    62.10    0.00    0.00    N/A
          32             1     float    none      -1    63.09    0.00    0.00      0    62.15    0.00    0.00    N/A
          64             2     float    none      -1    63.61    0.00    0.00      0    63.87    0.00    0.00    N/A
         128             4     float    none      -1    143.9    0.00    0.00      0    63.67    0.00    0.00    N/A
         256             8     float    none      -1    63.84    0.00    0.00      0    62.88    0.00    0.00    N/A
         512            16     float    none      -1    63.01    0.01    0.01      0    62.95    0.01    0.01    N/A
        1024            32     float    none      -1    63.13    0.02    0.01      0    63.05    0.02    0.01    N/A
        2048            64     float    none      -1    65.01    0.03    0.03      0    63.81    0.03    0.03    N/A
        4096           128     float    none      -1    64.14    0.06    0.06      0    63.38    0.06    0.06    N/A
        8192           256     float    none      -1    62.95    0.13    0.11      0    62.81    0.13    0.11    N/A
       16384           512     float    none      -1    64.39    0.25    0.22      0    63.06    0.26    0.23    N/A
       32768          1024     float    none      -1    63.51    0.52    0.45      0    62.81    0.52    0.46    N/A
       65536          2048     float    none      -1    64.23    1.02    0.89      0    63.03    1.04    0.91    N/A
      131072          4096     float    none      -1    65.07    2.01    1.76      0    64.43    2.03    1.78    N/A
      262144          8192     float    none      -1    75.25    3.48    3.05      0    74.81    3.50    3.07    N/A
      524288         16384     float    none      -1    66.32    7.91    6.92      0    65.05    8.06    7.05    N/A
     1048576         32768     float    none      -1    94.40   11.11    9.72      0    93.75   11.18    9.79    N/A
     2097152         65536     float    none      -1    169.6   12.36   10.82      0    169.4   12.38   10.83    N/A
     4194304        131072     float    none      -1    300.8   13.94   12.20      0    295.4   14.20   12.42    N/A
     8388608        262144     float    none      -1    562.6   14.91   13.05      0    561.0   14.95   13.08    N/A
    16777216        524288     float    none      -1   1058.8   15.85   13.86      0   1060.8   15.82   13.84    N/A
    33554432       1048576     float    none      -1   1982.6   16.92   14.81      0   1988.3   16.88   14.77    N/A
    67108864       2097152     float    none      -1   3902.6   17.20   15.05      0   3901.4   17.20   15.05    N/A
   134217728       4194304     float    none      -1   7447.2   18.02   15.77      0   7472.3   17.96   15.72    N/A
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 4.76019
#

The alltoall_perf test data for eight GPUs with P2P enabled:

root@moons:~/nccl-tests-master# NCCL_P2P_LEVEL=SYS ./build/alltoall_perf -b 8 -e 128M -f 2 -g 8
# nThread 1 nGpus 8 minBytes 8 maxBytes 134217728 step: 2(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
#
# Using devices
#  Rank  0 Group  0 Pid 121524 on      moons device  0 [0x41] NVIDIA GeForce RTX 4090
#  Rank  1 Group  0 Pid 121524 on      moons device  1 [0x42] NVIDIA GeForce RTX 4090
#  Rank  2 Group  0 Pid 121524 on      moons device  2 [0x43] NVIDIA GeForce RTX 4090
#  Rank  3 Group  0 Pid 121524 on      moons device  3 [0x44] NVIDIA GeForce RTX 4090
#  Rank  4 Group  0 Pid 121524 on      moons device  4 [0x61] NVIDIA GeForce RTX 4090
#  Rank  5 Group  0 Pid 121524 on      moons device  5 [0x62] NVIDIA GeForce RTX 4090
#  Rank  6 Group  0 Pid 121524 on      moons device  6 [0x63] NVIDIA GeForce RTX 4090
#  Rank  7 Group  0 Pid 121524 on      moons device  7 [0x64] NVIDIA GeForce RTX 4090
#
#                                                              out-of-place                       in-place
#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)
           0             0     float    none      -1    62.06    0.00    0.00      0    60.23    0.00    0.00    N/A
           0             0     float    none      -1    61.34    0.00    0.00      0    59.87    0.00    0.00    N/A
          32             1     float    none      -1    64.00    0.00    0.00      0    62.55    0.00    0.00    N/A
          64             2     float    none      -1    62.70    0.00    0.00      0    62.47    0.00    0.00    N/A
         128             4     float    none      -1    63.38    0.00    0.00      0    61.81    0.00    0.00    N/A
         256             8     float    none      -1    62.82    0.00    0.00      0    62.12    0.00    0.00    N/A
         512            16     float    none      -1    63.87    0.01    0.01      0    62.01    0.01    0.01    N/A
        1024            32     float    none      -1    62.26    0.02    0.01      0    62.37    0.02    0.01    N/A
        2048            64     float    none      -1    63.28    0.03    0.03      0    63.23    0.03    0.03    N/A
        4096           128     float    none      -1    63.83    0.06    0.06      0    62.95    0.07    0.06    N/A
        8192           256     float    none      -1    63.94    0.13    0.11      0    62.09    0.13    0.12    N/A
       16384           512     float    none      -1    63.99    0.26    0.22      0    63.94    0.26    0.22    N/A
       32768          1024     float    none      -1    66.91    0.49    0.43      0    65.51    0.50    0.44    N/A
       65536          2048     float    none      -1    122.3    0.54    0.47      0    120.3    0.54    0.48    N/A
      131072          4096     float    none      -1    237.7    0.55    0.48      0    235.8    0.56    0.49    N/A
      262144          8192     float    none      -1    464.4    0.56    0.49      0    459.7    0.57    0.50    N/A
      524288         16384     float    none      -1    466.4    1.12    0.98      0    460.3    1.14    1.00    N/A
     1048576         32768     float    none      -1    914.1    1.15    1.00      0    913.9    1.15    1.00    N/A
     2097152         65536     float    none      -1   1776.9    1.18    1.03      0   1786.8    1.17    1.03    N/A
     4194304        131072     float    none      -1   3445.6    1.22    1.07      0   3427.3    1.22    1.07    N/A
     8388608        262144     float    none      -1   6377.0    1.32    1.15      0   6258.2    1.34    1.17    N/A
    16777216        524288     float    none      -1    11991    1.40    1.22      0    11809    1.42    1.24    N/A
    33554432       1048576     float    none      -1    23087    1.45    1.27      0    22581    1.49    1.30    N/A
    67108864       2097152     float    none      -1    53267    1.26    1.10      0    53155    1.26    1.10    N/A
   134217728       4194304     float    none      -1   106721    1.26    1.10      0   106371    1.26    1.10    N/A
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 0.492673
#

From the eight-GPU test, it's clear that enabling P2P causes a severe performance drop.

Does anyone have experience in addressing this performance degradation when enabling P2P for eight GPUs?

Bug Incidence

Always

nvidia-bug-report.log.gz

More Info

If more information is needed, I can provide it at any time.

@ZP-AlwaysWin ZP-AlwaysWin added the bug Something isn't working label Sep 23, 2024
@mylesgoose
Copy link

I have 8 gpu rtx 4090 550.90.07-p2p working fine and some are on nvme 4x pcie cards

@ZP-AlwaysWin
Copy link
Author

I have 8 gpu rtx 4090 550.90.07-p2p working fine and some are on nvme 4x pcie cards

I am using the 550.90.07-p2p driver and can still reproduce the issue where enabling P2P capability results in a performance drop in the alltoall_perf test. Could you please share your test data?

@mylesgoose
Copy link

Maybe it don't likey 8 cards :-)

NCCL_P2P_LEVEL=SYS ./alltoall_perf -b 8 -e 128M -f 2 -g 8
# nThread 1 nGpus 8 minBytes 8 maxBytes 134217728 step: 2(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
#
# Using devices
#  Rank  0 Group  0 Pid 370722 on   ubuntu11 device  0 [0x01] NVIDIA GeForce RTX 4090
#  Rank  1 Group  0 Pid 370722 on   ubuntu11 device  1 [0x02] NVIDIA GeForce RTX 4090
#  Rank  2 Group  0 Pid 370722 on   ubuntu11 device  2 [0x2a] NVIDIA GeForce RTX 4090
#  Rank  3 Group  0 Pid 370722 on   ubuntu11 device  3 [0x2c] NVIDIA GeForce RTX 4090
#  Rank  4 Group  0 Pid 370722 on   ubuntu11 device  4 [0x41] NVIDIA GeForce RTX 4090
#  Rank  5 Group  0 Pid 370722 on   ubuntu11 device  5 [0x42] NVIDIA GeForce RTX 4090
#  Rank  6 Group  0 Pid 370722 on   ubuntu11 device  6 [0x61] NVIDIA GeForce RTX 4090
#  Rank  7 Group  0 Pid 370722 on   ubuntu11 device  7 [0x62] NVIDIA GeForce RTX 4090
#
#                                                              out-of-place                       in-place          
#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       
           0             0     float    none      -1    49.16    0.00    0.00      0    48.85    0.00    0.00    N/A
           0             0     float    none      -1    49.42    0.00    0.00      0    48.91    0.00    0.00    N/A
          32             1     float    none      -1    50.95    0.00    0.00      0    50.04    0.00    0.00    N/A
          64             2     float    none      -1    50.87    0.00    0.00      0    50.45    0.00    0.00    N/A
         128             4     float    none      -1    50.24    0.00    0.00      0    50.41    0.00    0.00    N/A
         256             8     float    none      -1    50.94    0.01    0.00      0    89.63    0.00    0.00    N/A
         512            16     float    none      -1    52.09    0.01    0.01      0    51.60    0.01    0.01    N/A
        1024            32     float    none      -1    51.65    0.02    0.02      0    66.24    0.02    0.01    N/A
        2048            64     float    none      -1    54.57    0.04    0.03      0    51.28    0.04    0.03    N/A
        4096           128     float    none      -1    51.27    0.08    0.07      0    51.27    0.08    0.07    N/A
        8192           256     float    none      -1    53.11    0.15    0.13      0    51.23    0.16    0.14    N/A
       16384           512     float    none      -1    60.79    0.27    0.24      0    56.50    0.29    0.25    N/A
       32768          1024     float    none      -1    113.8    0.29    0.25      0    113.5    0.29    0.25    N/A
       65536          2048     float    none      -1    301.3    0.22    0.19      0    229.7    0.29    0.25    N/A
      131072          4096     float    none      -1    426.3    0.31    0.27      0    427.6    0.31    0.27    N/A
      262144          8192     float    none      -1    774.7    0.34    0.30      0    687.1    0.38    0.33    N/A
      524288         16384     float    none      -1    599.9    0.87    0.76      0    474.4    1.11    0.97    N/A
     1048576         32768     float    none      -1   1159.5    0.90    0.79      0   1056.1    0.99    0.87    N/A
     2097152         65536     float    none      -1   2342.7    0.90    0.78      0   2205.7    0.95    0.83    N/A
     4194304        131072     float    none      -1   4459.3    0.94    0.82      0   4234.0    0.99    0.87    N/A
     8388608        262144     float    none      -1   8618.4    0.97    0.85      0   8355.6    1.00    0.88    N/A
    16777216        524288     float    none      -1    16657    1.01    0.88      0    16446    1.02    0.89    N/A
    33554432       1048576     float    none      -1    32761    1.02    0.90      0    32640    1.03    0.90    N/A
    67108864       2097152     float    none      -1    65425    1.03    0.90      0    65250    1.03    0.90    N/A
   134217728       4194304     float    none      -1   132129    1.02    0.89      0   130992    1.02    0.90    N/A
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 0.374518 
#

myles@ubuntu11:~/nccl-tests/build$ ./alltoall_perf -b 8 -e 128M -f 2 -g 8
# nThread 1 nGpus 8 minBytes 8 maxBytes 134217728 step: 2(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
#
# Using devices
#  Rank  0 Group  0 Pid 370879 on   ubuntu11 device  0 [0x01] NVIDIA GeForce RTX 4090
#  Rank  1 Group  0 Pid 370879 on   ubuntu11 device  1 [0x02] NVIDIA GeForce RTX 4090
#  Rank  2 Group  0 Pid 370879 on   ubuntu11 device  2 [0x2a] NVIDIA GeForce RTX 4090
#  Rank  3 Group  0 Pid 370879 on   ubuntu11 device  3 [0x2c] NVIDIA GeForce RTX 4090
#  Rank  4 Group  0 Pid 370879 on   ubuntu11 device  4 [0x41] NVIDIA GeForce RTX 4090
#  Rank  5 Group  0 Pid 370879 on   ubuntu11 device  5 [0x42] NVIDIA GeForce RTX 4090
#  Rank  6 Group  0 Pid 370879 on   ubuntu11 device  6 [0x61] NVIDIA GeForce RTX 4090
#  Rank  7 Group  0 Pid 370879 on   ubuntu11 device  7 [0x62] NVIDIA GeForce RTX 4090
#
#                                                              out-of-place                       in-place          
#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       
           0             0     float    none      -1    66.31    0.00    0.00      0    49.37    0.00    0.00    N/A
           0             0     float    none      -1    50.01    0.00    0.00      0    49.44    0.00    0.00    N/A
          32             1     float    none      -1    50.88    0.00    0.00      0    51.28    0.00    0.00    N/A
          64             2     float    none      -1    64.19    0.00    0.00      0    51.21    0.00    0.00    N/A
         128             4     float    none      -1    51.64    0.00    0.00      0    51.11    0.00    0.00    N/A
         256             8     float    none      -1    51.55    0.00    0.00      0    50.90    0.01    0.00    N/A
         512            16     float    none      -1    51.44    0.01    0.01      0    50.86    0.01    0.01    N/A
        1024            32     float    none      -1    50.94    0.02    0.02      0    115.4    0.01    0.01    N/A
        2048            64     float    none      -1    51.44    0.04    0.03      0    50.84    0.04    0.04    N/A
        4096           128     float    none      -1    51.01    0.08    0.07      0    51.03    0.08    0.07    N/A
        8192           256     float    none      -1    51.89    0.16    0.14      0    51.40    0.16    0.14    N/A
       16384           512     float    none      -1    63.17    0.26    0.23      0    56.80    0.29    0.25    N/A
       32768          1024     float    none      -1    110.4    0.30    0.26      0    112.7    0.29    0.25    N/A
       65536          2048     float    none      -1    229.7    0.29    0.25      0    226.8    0.29    0.25    N/A
      131072          4096     float    none      -1    431.9    0.30    0.27      0    427.4    0.31    0.27    N/A
      262144          8192     float    none      -1    790.9    0.33    0.29      0    699.8    0.37    0.33    N/A
      524288         16384     float    none      -1    580.2    0.90    0.79      0    523.1    1.00    0.88    N/A
     1048576         32768     float    none      -1   1197.2    0.88    0.77      0   1132.2    0.93    0.81    N/A
     2097152         65536     float    none      -1   2355.8    0.89    0.78      0   2166.9    0.97    0.85    N/A
     4194304        131072     float    none      -1   4450.6    0.94    0.82      0   4277.7    0.98    0.86    N/A
     8388608        262144     float    none      -1   8625.1    0.97    0.85      0   8311.6    1.01    0.88    N/A
    16777216        524288     float    none      -1    16530    1.01    0.89      0    16491    1.02    0.89    N/A
    33554432       1048576     float    none      -1    32384    1.04    0.91      0    32438    1.03    0.91    N/A
    67108864       2097152     float    none      -1    65691    1.02    0.89      0    65766    1.02    0.89    N/A
   134217728       4194304     float    none      -1   132444    1.01    0.89      0   130140    1.03    0.90    N/A
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 0.372955 
#

myles@ubuntu11:~/nccl-tests/build$ nvidia-smi topo -p2p rw
 	GPU0	GPU1	GPU2	GPU3	GPU4	GPU5	GPU6	GPU7	
 GPU0	X	OK	OK	OK	OK	OK	OK	OK	
 GPU1	OK	X	OK	OK	OK	OK	OK	OK	
 GPU2	OK	OK	X	OK	OK	OK	OK	OK	
 GPU3	OK	OK	OK	X	OK	OK	OK	OK	
 GPU4	OK	OK	OK	OK	X	OK	OK	OK	
 GPU5	OK	OK	OK	OK	OK	X	OK	OK	
 GPU6	OK	OK	OK	OK	OK	OK	X	OK	
 GPU7	OK	OK	OK	OK	OK	OK	OK	X	

Legend:

  X    = Self
  OK   = Status Ok
  CNS  = Chipset not supported
  GNS  = GPU not supported
  TNS  = Topology not supported
  NS   = Not supported
  U    = Unknown
myles@ubuntu11:~/nccl-tests/build$ 

@mylesgoose
Copy link

I think this would be a normal result because with 8 cards would be saturating the bandwidth maybe? Wonder why this is so different betwen our cards. what motherboard do you use how do you connect the 8 cards? i use the pcie riser cable version pcie 5.0 for 7 of them and then a nvme adaptor for the 8th one. Avg bus bandwidth : 0.372955 of mine vs yours Avg bus bandwidth : 0.492673 but i have the cards all limited to 400w of power.

@ZP-AlwaysWin
Copy link
Author

I think this would be a normal result because with 8 cards would be saturating the bandwidth maybe? Wonder why this is so different betwen our cards. what motherboard do you use how do you connect the 8 cards? i use the pcie riser cable version pcie 5.0 for 7 of them and then a nvme adaptor for the 8th one. Avg bus bandwidth : 0.372955 of mine vs yours Avg bus bandwidth : 0.492673 but i have the cards all limited to 400w of power.

My GPUs are not power-limited, and I’m using an Intel SPR motherboard with all 8 GPUs directly connected to the CPU, using PCIe 4.0. I don't think it’s related to bandwidth saturation, because even with just 3 GPUs, the performance significantly degrades when P2P is enabled.

@mylesgoose
Copy link

have a look it actually says NCCL INFO Channel 03 : 1[1] -> 0[0] via SHM/direct/direct that means its going via the cpu when more than two devices are listed.
and with just two devices shows this
NCCL INFO Channel 03/0 : 1[1] -> 0[0] via P2P/direct pointer

export NCCL_P2P_DISABLE=0
myles@ubuntu11:~/nccl-tests/build$ NCCL_DEBUG=INFO ./all_reduce_perf -g 3
# nThread 1 nGpus 3 minBytes 33554432 maxBytes 33554432 step: 1048576(bytes) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
#
# Using devices
#  Rank  0 Group  0 Pid 103631 on   ubuntu11 device  0 [0x01] NVIDIA GeForce RTX 4090
#  Rank  1 Group  0 Pid 103631 on   ubuntu11 device  1 [0x02] NVIDIA GeForce RTX 4090
#  Rank  2 Group  0 Pid 103631 on   ubuntu11 device  2 [0x2b] NVIDIA GeForce RTX 4090
ubuntu11:103631:103631 [0] NCCL INFO Bootstrap : Using enp37s0f0:192.168.1.32<0>
ubuntu11:103631:103631 [0] NCCL INFO cudaDriverVersion 12060
ubuntu11:103631:103631 [0] NCCL INFO NCCL version 2.23.4+cuda12.4
ubuntu11:103631:103642 [0] NCCL INFO NET/Plugin: Could not find: libnccl-net.so. Using internal network plugin.
ubuntu11:103631:103642 [0] NCCL INFO NET/IB : No device found.
ubuntu11:103631:103642 [0] NCCL INFO NET/Socket : Using [0]enp37s0f0:192.168.1.32<0> [1]enp37s0f1:192.168.1.47<0>
ubuntu11:103631:103642 [0] NCCL INFO PROFILER/Plugin: Could not find: libnccl-profiler.so.
ubuntu11:103631:103642 [0] NCCL INFO Using network Socket
ubuntu11:103631:103644 [2] NCCL INFO Using network Socket
ubuntu11:103631:103643 [1] NCCL INFO Using network Socket
ubuntu11:103631:103644 [2] NCCL INFO ncclCommInitAll comm 0x650203e649b0 rank 2 nranks 3 cudaDev 2 nvmlDev 2 busId 2b000 commId 0xfcf1664b18e49f00 - Init START
ubuntu11:103631:103643 [1] NCCL INFO ncclCommInitAll comm 0x650203e24070 rank 1 nranks 3 cudaDev 1 nvmlDev 1 busId 2000 commId 0xfcf1664b18e49f00 - Init START
ubuntu11:103631:103642 [0] NCCL INFO ncclCommInitAll comm 0x650203de37d0 rank 0 nranks 3 cudaDev 0 nvmlDev 0 busId 1000 commId 0xfcf1664b18e49f00 - Init START
ubuntu11:103631:103643 [1] NCCL INFO Bootstrap timings total 0.001324 (create 0.000039, send 0.000161, recv 0.000395, ring 0.000121, delay 0.000000)
ubuntu11:103631:103642 [0] NCCL INFO Bootstrap timings total 0.001267 (create 0.000048, send 0.000227, recv 0.000700, ring 0.000082, delay 0.000000)
ubuntu11:103631:103644 [2] NCCL INFO Bootstrap timings total 0.001342 (create 0.000042, send 0.000175, recv 0.000675, ring 0.000234, delay 0.000001)
ubuntu11:103631:103643 [1] NCCL INFO NVLS multicast support is not available on dev 1
ubuntu11:103631:103644 [2] NCCL INFO NVLS multicast support is not available on dev 2
ubuntu11:103631:103642 [0] NCCL INFO NVLS multicast support is not available on dev 0
ubuntu11:103631:103644 [2] NCCL INFO comm 0x650203e649b0 rank 2 nRanks 3 nNodes 1 localRanks 3 localRank 2 MNNVL 0
ubuntu11:103631:103643 [1] NCCL INFO comm 0x650203e24070 rank 1 nRanks 3 nNodes 1 localRanks 3 localRank 1 MNNVL 0
ubuntu11:103631:103642 [0] NCCL INFO comm 0x650203de37d0 rank 0 nRanks 3 nNodes 1 localRanks 3 localRank 0 MNNVL 0
ubuntu11:103631:103642 [0] NCCL INFO Channel 00/04 : 0 1 2
ubuntu11:103631:103642 [0] NCCL INFO Channel 01/04 : 0 1 2
ubuntu11:103631:103642 [0] NCCL INFO Channel 02/04 : 0 1 2
ubuntu11:103631:103644 [2] NCCL INFO Trees [0] -1/-1/-1->2->1 [1] -1/-1/-1->2->1 [2] -1/-1/-1->2->1 [3] -1/-1/-1->2->1
ubuntu11:103631:103643 [1] NCCL INFO Trees [0] 2/-1/-1->1->0 [1] 2/-1/-1->1->0 [2] 2/-1/-1->1->0 [3] 2/-1/-1->1->0
ubuntu11:103631:103643 [1] NCCL INFO P2P Chunksize set to 131072
ubuntu11:103631:103642 [0] NCCL INFO Channel 03/04 : 0 1 2
ubuntu11:103631:103642 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1 [2] 1/-1/-1->0->-1 [3] 1/-1/-1->0->-1
ubuntu11:103631:103642 [0] NCCL INFO P2P Chunksize set to 131072
ubuntu11:103631:103644 [2] NCCL INFO P2P Chunksize set to 131072
ubuntu11:103631:103647 [1] NCCL INFO [Proxy Service] Device 1 CPU core 118
ubuntu11:103631:103649 [2] NCCL INFO [Proxy Service] Device 2 CPU core 59
ubuntu11:103631:103652 [2] NCCL INFO [Proxy Service UDS] Device 2 CPU core 7
ubuntu11:103631:103650 [1] NCCL INFO [Proxy Service UDS] Device 1 CPU core 65
ubuntu11:103631:103648 [0] NCCL INFO [Proxy Service] Device 0 CPU core 39
ubuntu11:103631:103651 [0] NCCL INFO [Proxy Service UDS] Device 0 CPU core 105
ubuntu11:103631:103643 [1] NCCL INFO Channel 00 : 1[1] -> 2[2] via SHM/direct/direct
ubuntu11:103631:103644 [2] NCCL INFO Channel 00 : 2[2] -> 0[0] via SHM/direct/direct
ubuntu11:103631:103642 [0] NCCL INFO Channel 00 : 0[0] -> 1[1] via SHM/direct/direct
ubuntu11:103631:103643 [1] NCCL INFO Channel 01 : 1[1] -> 2[2] via SHM/direct/direct
ubuntu11:103631:103644 [2] NCCL INFO Channel 01 : 2[2] -> 0[0] via SHM/direct/direct
ubuntu11:103631:103642 [0] NCCL INFO Channel 01 : 0[0] -> 1[1] via SHM/direct/direct
ubuntu11:103631:103643 [1] NCCL INFO Channel 02 : 1[1] -> 2[2] via SHM/direct/direct
ubuntu11:103631:103644 [2] NCCL INFO Channel 02 : 2[2] -> 0[0] via SHM/direct/direct
ubuntu11:103631:103642 [0] NCCL INFO Channel 02 : 0[0] -> 1[1] via SHM/direct/direct
ubuntu11:103631:103643 [1] NCCL INFO Channel 03 : 1[1] -> 2[2] via SHM/direct/direct
ubuntu11:103631:103644 [2] NCCL INFO Channel 03 : 2[2] -> 0[0] via SHM/direct/direct
ubuntu11:103631:103642 [0] NCCL INFO Channel 03 : 0[0] -> 1[1] via SHM/direct/direct
ubuntu11:103631:103643 [1] NCCL INFO Connected all rings
ubuntu11:103631:103644 [2] NCCL INFO Connected all rings
ubuntu11:103631:103642 [0] NCCL INFO Connected all rings
ubuntu11:103631:103644 [2] NCCL INFO Channel 00 : 2[2] -> 1[1] via SHM/direct/direct
ubuntu11:103631:103644 [2] NCCL INFO Channel 01 : 2[2] -> 1[1] via SHM/direct/direct
ubuntu11:103631:103644 [2] NCCL INFO Channel 02 : 2[2] -> 1[1] via SHM/direct/direct
ubuntu11:103631:103644 [2] NCCL INFO Channel 03 : 2[2] -> 1[1] via SHM/direct/direct
ubuntu11:103631:103643 [1] NCCL INFO Channel 00 : 1[1] -> 0[0] via SHM/direct/direct
ubuntu11:103631:103643 [1] NCCL INFO Channel 01 : 1[1] -> 0[0] via SHM/direct/direct
ubuntu11:103631:103643 [1] NCCL INFO Channel 02 : 1[1] -> 0[0] via SHM/direct/direct
ubuntu11:103631:103643 [1] NCCL INFO Channel 03 : 1[1] -> 0[0] via SHM/direct/direct
ubuntu11:103631:103642 [0] NCCL INFO Connected all trees
ubuntu11:103631:103644 [2] NCCL INFO Connected all trees
ubuntu11:103631:103643 [1] NCCL INFO Connected all trees
ubuntu11:103631:103658 [0] NCCL INFO [Proxy Progress] Device 0 CPU core 80
ubuntu11:103631:103659 [1] NCCL INFO [Proxy Progress] Device 1 CPU core 121
ubuntu11:103631:103643 [1] NCCL INFO threadThresholds 8/8/64 | 24/8/64 | 512 | 512
ubuntu11:103631:103643 [1] NCCL INFO 4 coll channels, 4 collnet channels, 0 nvls channels, 4 p2p channels, 2 p2p channels per peer
ubuntu11:103631:103642 [0] NCCL INFO threadThresholds 8/8/64 | 24/8/64 | 512 | 512
ubuntu11:103631:103642 [0] NCCL INFO 4 coll channels, 4 collnet channels, 0 nvls channels, 4 p2p channels, 2 p2p channels per peer
ubuntu11:103631:103642 [0] NCCL INFO CC Off, Multi-GPU CC Off, workFifoBytes 1048576
ubuntu11:103631:103660 [2] NCCL INFO [Proxy Progress] Device 2 CPU core 97
ubuntu11:103631:103644 [2] NCCL INFO threadThresholds 8/8/64 | 24/8/64 | 512 | 512
ubuntu11:103631:103644 [2] NCCL INFO 4 coll channels, 4 collnet channels, 0 nvls channels, 4 p2p channels, 2 p2p channels per peer
ubuntu11:103631:103642 [0] NCCL INFO TUNER/Plugin: Could not find: libnccl-tuner.so libnccl-net.so. Using internal tuner plugin.
ubuntu11:103631:103642 [0] NCCL INFO ncclCommInitAll comm 0x650203de37d0 rank 0 nranks 3 cudaDev 0 nvmlDev 0 busId 1000 commId 0xfcf1664b18e49f00 - Init COMPLETE
ubuntu11:103631:103642 [0] NCCL INFO Init timings - ncclCommInitAll: rank 0 nranks 3 total 0.56 (kernels 0.26, alloc 0.06, bootstrap 0.00, allgathers 0.00, topo 0.02, graphs 0.00, connections 0.21, rest 0.01)
ubuntu11:103631:103643 [1] NCCL INFO ncclCommInitAll comm 0x650203e24070 rank 1 nranks 3 cudaDev 1 nvmlDev 1 busId 2000 commId 0xfcf1664b18e49f00 - Init COMPLETE
ubuntu11:103631:103644 [2] NCCL INFO ncclCommInitAll comm 0x650203e649b0 rank 2 nranks 3 cudaDev 2 nvmlDev 2 busId 2b000 commId 0xfcf1664b18e49f00 - Init COMPLETE
ubuntu11:103631:103643 [1] NCCL INFO Init timings - ncclCommInitAll: rank 1 nranks 3 total 0.56 (kernels 0.26, alloc 0.06, bootstrap 0.00, allgathers 0.00, topo 0.01, graphs 0.00, connections 0.21, rest 0.01)
ubuntu11:103631:103644 [2] NCCL INFO Init timings - ncclCommInitAll: rank 2 nranks 3 total 0.56 (kernels 0.26, alloc 0.06, bootstrap 0.00, allgathers 0.00, topo 0.02, graphs 0.00, connections 0.22, rest 0.00)
#
#                                                              out-of-place                       in-place          
#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       
    33554432       8388608     float     sum      -1    12835    2.61    3.49      0    12855    2.61    3.48      0
ubuntu11:103631:103631 [0] NCCL INFO comm 0x650203de37d0 rank 0 nranks 3 cudaDev 0 busId 1000 - Destroy COMPLETE
ubuntu11:103631:103631 [2] NCCL INFO comm 0x650203e649b0 rank 2 nranks 3 cudaDev 2 busId 2b000 - Destroy COMPLETE
ubuntu11:103631:103631 [1] NCCL INFO comm 0x650203e24070 rank 1 nranks 3 cudaDev 1 busId 2000 - Destroy COMPLETE
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 3.48293 
#

myles@ubuntu11:~/nccl-tests/build$ NCCL_DEBUG=INFO ./all_reduce_perf -g 2
# nThread 1 nGpus 2 minBytes 33554432 maxBytes 33554432 step: 1048576(bytes) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
#
# Using devices
#  Rank  0 Group  0 Pid 104198 on   ubuntu11 device  0 [0x01] NVIDIA GeForce RTX 4090
#  Rank  1 Group  0 Pid 104198 on   ubuntu11 device  1 [0x02] NVIDIA GeForce RTX 4090
ubuntu11:104198:104198 [0] NCCL INFO Bootstrap : Using enp37s0f0:192.168.1.32<0>
ubuntu11:104198:104198 [0] NCCL INFO cudaDriverVersion 12060
ubuntu11:104198:104198 [0] NCCL INFO NCCL version 2.23.4+cuda12.4
ubuntu11:104198:104212 [0] NCCL INFO NET/Plugin: Could not find: libnccl-net.so. Using internal network plugin.
ubuntu11:104198:104212 [0] NCCL INFO NET/IB : No device found.
ubuntu11:104198:104212 [0] NCCL INFO NET/Socket : Using [0]enp37s0f0:192.168.1.32<0> [1]enp37s0f1:192.168.1.47<0>
ubuntu11:104198:104212 [0] NCCL INFO PROFILER/Plugin: Could not find: libnccl-profiler.so.
ubuntu11:104198:104212 [0] NCCL INFO Using network Socket
ubuntu11:104198:104213 [1] NCCL INFO Using network Socket
ubuntu11:104198:104213 [1] NCCL INFO ncclCommInitAll comm 0x64b9672f4710 rank 1 nranks 2 cudaDev 1 nvmlDev 1 busId 2000 commId 0x91379a7e94fa7192 - Init START
ubuntu11:104198:104212 [0] NCCL INFO ncclCommInitAll comm 0x64b9672b4af0 rank 0 nranks 2 cudaDev 0 nvmlDev 0 busId 1000 commId 0x91379a7e94fa7192 - Init START
ubuntu11:104198:104213 [1] NCCL INFO Bootstrap timings total 0.001072 (create 0.000054, send 0.000212, recv 0.000337, ring 0.000068, delay 0.000000)
ubuntu11:104198:104212 [0] NCCL INFO Bootstrap timings total 0.001006 (create 0.000046, send 0.000226, recv 0.000458, ring 0.000054, delay 0.000000)
ubuntu11:104198:104213 [1] NCCL INFO comm 0x64b9672f4710 rank 1 nRanks 2 nNodes 1 localRanks 2 localRank 1 MNNVL 0
ubuntu11:104198:104213 [1] NCCL INFO Trees [0] -1/-1/-1->1->0 [1] 0/-1/-1->1->-1 [2] -1/-1/-1->1->0 [3] 0/-1/-1->1->-1
ubuntu11:104198:104213 [1] NCCL INFO P2P Chunksize set to 131072
ubuntu11:104198:104212 [0] NCCL INFO comm 0x64b9672b4af0 rank 0 nRanks 2 nNodes 1 localRanks 2 localRank 0 MNNVL 0
ubuntu11:104198:104212 [0] NCCL INFO Channel 00/04 : 0 1
ubuntu11:104198:104212 [0] NCCL INFO Channel 01/04 : 0 1
ubuntu11:104198:104212 [0] NCCL INFO Channel 02/04 : 0 1
ubuntu11:104198:104212 [0] NCCL INFO Channel 03/04 : 0 1
ubuntu11:104198:104212 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] -1/-1/-1->0->1 [2] 1/-1/-1->0->-1 [3] -1/-1/-1->0->1
ubuntu11:104198:104212 [0] NCCL INFO P2P Chunksize set to 131072
ubuntu11:104198:104216 [1] NCCL INFO [Proxy Service] Device 1 CPU core 80
ubuntu11:104198:104217 [1] NCCL INFO [Proxy Service UDS] Device 1 CPU core 43
ubuntu11:104198:104218 [0] NCCL INFO [Proxy Service] Device 0 CPU core 49
ubuntu11:104198:104219 [0] NCCL INFO [Proxy Service UDS] Device 0 CPU core 22
ubuntu11:104198:104212 [0] NCCL INFO Channel 00/0 : 0[0] -> 1[1] via P2P/direct pointer
ubuntu11:104198:104212 [0] NCCL INFO Channel 01/0 : 0[0] -> 1[1] via P2P/direct pointer
ubuntu11:104198:104213 [1] NCCL INFO Channel 00/0 : 1[1] -> 0[0] via P2P/direct pointer
ubuntu11:104198:104212 [0] NCCL INFO Channel 02/0 : 0[0] -> 1[1] via P2P/direct pointer
ubuntu11:104198:104213 [1] NCCL INFO Channel 01/0 : 1[1] -> 0[0] via P2P/direct pointer
ubuntu11:104198:104212 [0] NCCL INFO Channel 03/0 : 0[0] -> 1[1] via P2P/direct pointer
ubuntu11:104198:104213 [1] NCCL INFO Channel 02/0 : 1[1] -> 0[0] via P2P/direct pointer
ubuntu11:104198:104213 [1] NCCL INFO Channel 03/0 : 1[1] -> 0[0] via P2P/direct pointer
ubuntu11:104198:104213 [1] NCCL INFO Connected all rings
ubuntu11:104198:104213 [1] NCCL INFO Connected all trees
ubuntu11:104198:104212 [0] NCCL INFO Connected all rings
ubuntu11:104198:104212 [0] NCCL INFO Connected all trees
ubuntu11:104198:104220 [1] NCCL INFO [Proxy Progress] Device 1 CPU core 15
ubuntu11:104198:104221 [0] NCCL INFO [Proxy Progress] Device 0 CPU core 115
ubuntu11:104198:104212 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512
ubuntu11:104198:104212 [0] NCCL INFO 4 coll channels, 4 collnet channels, 0 nvls channels, 4 p2p channels, 2 p2p channels per peer
ubuntu11:104198:104212 [0] NCCL INFO CC Off, Multi-GPU CC Off, workFifoBytes 1048576
ubuntu11:104198:104213 [1] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512
ubuntu11:104198:104213 [1] NCCL INFO 4 coll channels, 4 collnet channels, 0 nvls channels, 4 p2p channels, 2 p2p channels per peer
ubuntu11:104198:104212 [0] NCCL INFO TUNER/Plugin: Could not find: libnccl-tuner.so libnccl-net.so. Using internal tuner plugin.
ubuntu11:104198:104212 [0] NCCL INFO ncclCommInitAll comm 0x64b9672b4af0 rank 0 nranks 2 cudaDev 0 nvmlDev 0 busId 1000 commId 0x91379a7e94fa7192 - Init COMPLETE
ubuntu11:104198:104212 [0] NCCL INFO Init timings - ncclCommInitAll: rank 0 nranks 2 total 0.29 (kernels 0.19, alloc 0.07, bootstrap 0.00, allgathers 0.00, topo 0.01, graphs 0.00, connections 0.02, rest 0.00)
ubuntu11:104198:104213 [1] NCCL INFO ncclCommInitAll comm 0x64b9672f4710 rank 1 nranks 2 cudaDev 1 nvmlDev 1 busId 2000 commId 0x91379a7e94fa7192 - Init COMPLETE
ubuntu11:104198:104213 [1] NCCL INFO Init timings - ncclCommInitAll: rank 1 nranks 2 total 0.29 (kernels 0.20, alloc 0.06, bootstrap 0.00, allgathers 0.00, topo 0.01, graphs 0.00, connections 0.02, rest 0.00)
#
#                                                              out-of-place                       in-place          
#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       
    33554432       8388608     float     sum      -1   1398.2   24.00   24.00      0   1395.1   24.05   24.05      0
ubuntu11:104198:104198 [0] NCCL INFO comm 0x64b9672b4af0 rank 0 nranks 2 cudaDev 0 busId 1000 - Destroy COMPLETE
ubuntu11:104198:104198 [1] NCCL INFO comm 0x64b9672f4710 rank 1 nranks 2 cudaDev 1 busId 2000 - Destroy COMPLETE
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 24.0255 
#

@mylesgoose
Copy link

NCCL_P2P_LEVEL=PHB

@mylesgoose
Copy link

mylesgoose commented Oct 20, 2024

I think this would be a normal result because with 8 cards would be saturating the bandwidth maybe? Wonder why this is so different betwen our cards. what motherboard do you use how do you connect the 8 cards? i use the pcie riser cable version pcie 5.0 for 7 of them and then a nvme adaptor for the 8th one. Avg bus bandwidth : 0.372955 of mine vs yours Avg bus bandwidth : 0.492673 but i have the cards all limited to 400w of power.

My GPUs are not power-limited, and I’m using an Intel SPR motherboard with all 8 GPUs directly connected to the CPU, using PCIe 4.0. I don't think it’s related to bandwidth saturation, because even with just 3 GPUs, the performance significantly degrades when P2P is enabled.

@ZP-AlwaysWin hey found solution. Nccl is choosing cpu transfer if you says nccl level sys if more than 2 gpu are selected. Export above command and try. I tested on the cuda 12.6 and nvidia driver 560.35.03 and it used p2p for all the transfers. With 7 gpu. I'm going test with 8 gpu in a few min when plug back in another gpu. But this command NCCL_P2P_LEVEL=PHB shows on 7 gpu very high bandwidth 24 gb s.

@ZP-AlwaysWin
Copy link
Author

I think this would be a normal result because with 8 cards would be saturating the bandwidth maybe? Wonder why this is so different betwen our cards. what motherboard do you use how do you connect the 8 cards? i use the pcie riser cable version pcie 5.0 for 7 of them and then a nvme adaptor for the 8th one. Avg bus bandwidth : 0.372955 of mine vs yours Avg bus bandwidth : 0.492673 but i have the cards all limited to 400w of power.

My GPUs are not power-limited, and I’m using an Intel SPR motherboard with all 8 GPUs directly connected to the CPU, using PCIe 4.0. I don't think it’s related to bandwidth saturation, because even with just 3 GPUs, the performance significantly degrades when P2P is enabled.

@ZP-AlwaysWin hey found solution. Nccl is choosing cpu transfer if you says nccl level sys if more than 2 gpu are selected. Export above command and try. I tested on the cuda 12.6 and nvidia driver 560.35.03 and it used p2p for all the transfers. With 7 gpu. I'm going test with 8 gpu in a few min when plug back in another gpu. But this command NCCL_P2P_LEVEL=PHB shows on 7 gpu very high bandwidth 24 gb s.

Here is the log output for the command NCCL_DEBUG=INFO NCCL_P2P_LEVEL=SYS ./alltoall_perf -b 8 -e 128M -f 2 -g 8

@ZP-AlwaysWin
Copy link
Author

ZP-AlwaysWin commented Oct 21, 2024

I think this would be a normal result because with 8 cards would be saturating the bandwidth maybe? Wonder why this is so different betwen our cards. what motherboard do you use how do you connect the 8 cards? i use the pcie riser cable version pcie 5.0 for 7 of them and then a nvme adaptor for the 8th one. Avg bus bandwidth : 0.372955 of mine vs yours Avg bus bandwidth : 0.492673 but i have the cards all limited to 400w of power.

My GPUs are not power-limited, and I’m using an Intel SPR motherboard with all 8 GPUs directly connected to the CPU, using PCIe 4.0. I don't think it’s related to bandwidth saturation, because even with just 3 GPUs, the performance significantly degrades when P2P is enabled.

@ZP-AlwaysWin hey found solution. Nccl is choosing cpu transfer if you says nccl level sys if more than 2 gpu are selected. Export above command and try. I tested on the cuda 12.6 and nvidia driver 560.35.03 and it used p2p for all the transfers. With 7 gpu. I'm going test with 8 gpu in a few min when plug back in another gpu. But this command NCCL_P2P_LEVEL=PHB shows on 7 gpu very high bandwidth 24 gb s.

Please provide detailed test logs or leave a contact information for further communication @mylesgoose

@ZP-AlwaysWin
Copy link
Author

I just tested the P2P capability using version 560.35.03, and the result is the same as the previous version, with no improvement.

@mylesgoose
Copy link

NCCL_P2P_LEVEL=PHB did you try that command? When I tested my one with SYS it failed back to cpu. When I said PHB it went to p2p with 7 cards. And my one was falling back to cpu transfer. At level before. However. If I enabled a card that does not have pcie 16x bandwidth it falls back to cpu copy. If you run with info the command and see. My point is if you have one device that is not on same bandwidth as the rest it falls back to cpu. I cannot get my pc to boot at the moment with 8 cards due to asus wrx80e motherboard pcie16x 4.0 issues. Can you try running with that command and are all your devices pxie16 4.0. @ZP-AlwaysWin try to disable with export cuda visible devices any devices that are not pcie 16.

@ZP-AlwaysWin
Copy link
Author

ZP-AlwaysWin commented Oct 22, 2024

NCCL_P2P_LEVEL=PHB

NCCL_P2P_LEVEL=PHB I tested it, and the result is the same as without enabling P2P capability, which means P2P is effectively not enabled. @mylesgoose

@mylesgoose
Copy link

Yeah, but did you try to disable the None pcie16x cards? So for example if card device number 2 is pcie 8x and the rest are 16x. Export cuda visible devices 0,1,3,4,5,6,7 and nccl level=PHB and test if p2p is still enabled with that level as long as all fevices on dame numa node as I have seen it works with 7 cards at 16x just fine . When I did that test with 7 cards only excluding the 8th card on pcie 4x. The system worked with full p2p. If I did the same test with 7 cards and including the 4x card it sent all via the cpu. As is shown in nccl info. If I had 8 cards and p2p enabled with one of cards being 4x it would be cpu. Because it falls back to cpu on 7 cards with one being at 4x and rest at 16x. Yet works fine with 7 cards at p2p PHB as long as the 7th card is not a 4x card. This leads me to think the problem lies with NCCl software. Which we have the source code for and could find out why. Perhaps it does a quick test to see if all devices are pcie 16x 4.0 equivalent. And I can't get my motherboard working so it's being returned, so I can't prove the output works as long as the cards where all at the same speed. Hopefully, I can order a new motherboard soon. But that asus sage wrx80e is playing up with that many cards. My point is this. It is not a driver issue this is an issue with how nccl is handling that specific request depending on your hardware configuration. Maybe you have 8 cards at pcie 16x 4.0. And if you do and you have tried nccl level PHB and still failed to show p2p in the info export. Then I maybe wrong. But i pinpoint the issue to the fact I can replicate this issue with 5 devices or 6 or 7. Depending if one of the specific devices is at a slower speed pcie bandwidth than the other 4,5,6 cards I feel that explains the culprit. And I think we should investigate that path. So if you can test that that idea. Would be good.

@mylesgoose
Copy link

Above i I show the command nccl debug level info. And nccl =PHB can you try that with your cards that are at pcie 16x 4.0 full bandwidth only. It does not work if one card is at a lower bandwidth than the others. All cards must be same speed or nccl disables p2p. Do above, And see if it says this ]" via SHM/direct/direct" in the terminal. Then you have the answer. Which shows nccl is forcing cpu transfer on those devices. It's not p2p. Which is why the degradation in performance. It's trying 8 cards to all at once copy to cpu ram then back to each other. @ZP-AlwaysWin

@mylesgoose
Copy link

mylesgoose commented Oct 22, 2024

Here I found the outputs proving the bandwidth uses p2p PROVIDING all the cards are all pcie16x bandwidth. If any of the cards are pcie 4x or 8x nccl default to cpu.if the cards are all on same deviceand speed it works. As you can see the speeds are excellent. And do not vary going right up to my max pcie 16x slots. Changing by export cuda visible devices to enable the 4x card and disable one of the 16x cards. Nccl says nope not happy let's go to cpu.

# nThread 1 nGpus 3 minBytes 33554432 maxBytes 33554432 step: 1048576(bytes) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0

#

# Using devices

#  Rank  0 Group  0 Pid  11800 on myles-System-Product-Name device  0 [0x01] NVIDIA GeForce RTX 4090

#  Rank  1 Group  0 Pid  11800 on myles-System-Product-Name device  1 [0x02] NVIDIA GeForce RTX 4090

#  Rank  2 Group  0 Pid  11800 on myles-System-Product-Name device  2 [0x2b] NVIDIA GeForce RTX 4090

myles-System-Product-Name:11800:11800 [0] NCCL INFO Bootstrap : Using enp37s0f0:192.168.1.32<0>

myles-System-Product-Name:11800:11800 [0] NCCL INFO cudaDriverVersion 12060

myles-System-Product-Name:11800:11800 [0] NCCL INFO NCCL version 2.23.4+cuda12.6

myles-System-Product-Name:11800:11832 [1] NCCL INFO NET/Plugin: Could not find: libnccl-net.so. Using internal network plugin.

myles-System-Product-Name:11800:11832 [1] NCCL INFO Failed to open libibverbs.so[.1]

myles-System-Product-Name:11800:11832 [1] NCCL INFO NET/Socket : Using [0]enp37s0f0:192.168.1.32<0> [1]enp37s0f1:192.168.1.47<0>

myles-System-Product-Name:11800:11832 [1] NCCL INFO PROFILER/Plugin: Could not find: libnccl-profiler.so.

myles-System-Product-Name:11800:11832 [1] NCCL INFO Using network Socket

myles-System-Product-Name:11800:11833 [2] NCCL INFO Using network Socket

myles-System-Product-Name:11800:11831 [0] NCCL INFO Using network Socket

myles-System-Product-Name:11800:11833 [2] NCCL INFO ncclCommInitAll comm 0x5b56b2a08a00 rank 2 nranks 3 cudaDev 2 nvmlDev 2 busId 2b000 commId 0x47a48d37963338d8 - Init START

myles-System-Product-Name:11800:11831 [0] NCCL INFO ncclCommInitAll comm 0x5b56b2987820 rank 0 nranks 3 cudaDev 0 nvmlDev 0 busId 1000 commId 0x47a48d37963338d8 - Init START

myles-System-Product-Name:11800:11832 [1] NCCL INFO ncclCommInitAll comm 0x5b56b29c80c0 rank 1 nranks 3 cudaDev 1 nvmlDev 1 busId 2000 commId 0x47a48d37963338d8 - Init START

myles-System-Product-Name:11800:11831 [0] NCCL INFO Bootstrap timings total 0.001027 (create 0.000063, send 0.000146, recv 0.000453, ring 0.000187, delay 0.000000)

myles-System-Product-Name:11800:11832 [1] NCCL INFO Bootstrap timings total 0.000949 (create 0.000048, send 0.000169, recv 0.000485, ring 0.000058, delay 0.000000)

myles-System-Product-Name:11800:11833 [2] NCCL INFO Bootstrap timings total 0.001082 (create 0.000049, send 0.000179, recv 0.000312, ring 0.000073, delay 0.000000)

myles-System-Product-Name:11800:11831 [0] NCCL INFO NCCL_P2P_LEVEL set by environment to SYS

myles-System-Product-Name:11800:11831 [0] NCCL INFO NVLS multicast support is not available on dev 0

myles-System-Product-Name:11800:11833 [2] NCCL INFO NVLS multicast support is not available on dev 2

myles-System-Product-Name:11800:11832 [1] NCCL INFO NVLS multicast support is not available on dev 1

myles-System-Product-Name:11800:11831 [0] NCCL INFO comm 0x5b56b2987820 rank 0 nRanks 3 nNodes 1 localRanks 3 localRank 0 MNNVL 0

myles-System-Product-Name:11800:11832 [1] NCCL INFO comm 0x5b56b29c80c0 rank 1 nRanks 3 nNodes 1 localRanks 3 localRank 1 MNNVL 0

myles-System-Product-Name:11800:11831 [0] NCCL INFO Channel 00/04 : 0 1 2

myles-System-Product-Name:11800:11833 [2] NCCL INFO comm 0x5b56b2a08a00 rank 2 nRanks 3 nNodes 1 localRanks 3 localRank 2 MNNVL 0

myles-System-Product-Name:11800:11831 [0] NCCL INFO Channel 01/04 : 0 1 2

myles-System-Product-Name:11800:11833 [2] NCCL INFO Trees [0] -1/-1/-1->2->1 [1] -1/-1/-1->2->1 [2] -1/-1/-1->2->1 [3] -1/-1/-1->2->1

myles-System-Product-Name:11800:11833 [2] NCCL INFO P2P Chunksize set to 131072

myles-System-Product-Name:11800:11831 [0] NCCL INFO Channel 02/04 : 0 1 2

myles-System-Product-Name:11800:11831 [0] NCCL INFO Channel 03/04 : 0 1 2

myles-System-Product-Name:11800:11832 [1] NCCL INFO Trees [0] 2/-1/-1->1->0 [1] 2/-1/-1->1->0 [2] 2/-1/-1->1->0 [3] 2/-1/-1->1->0

myles-System-Product-Name:11800:11832 [1] NCCL INFO P2P Chunksize set to 131072

myles-System-Product-Name:11800:11831 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1 [2] 1/-1/-1->0->-1 [3] 1/-1/-1->0->-1

myles-System-Product-Name:11800:11831 [0] NCCL INFO P2P Chunksize set to 131072

myles-System-Product-Name:11800:11836 [2] NCCL INFO [Proxy Service] Device 2 CPU core 32

myles-System-Product-Name:11800:11838 [0] NCCL INFO [Proxy Service] Device 0 CPU core 118

myles-System-Product-Name:11800:11837 [1] NCCL INFO [Proxy Service] Device 1 CPU core 41

myles-System-Product-Name:11800:11839 [2] NCCL INFO [Proxy Service UDS] Device 2 CPU core 57

myles-System-Product-Name:11800:11840 [1] NCCL INFO [Proxy Service UDS] Device 1 CPU core 124

myles-System-Product-Name:11800:11841 [0] NCCL INFO [Proxy Service UDS] Device 0 CPU core 41

myles-System-Product-Name:11800:11833 [2] NCCL INFO Channel 00/0 : 2[2] -> 0[0] via P2P/direct pointer

myles-System-Product-Name:11800:11831 [0] NCCL INFO Channel 00/0 : 0[0] -> 1[1] via P2P/direct pointer

myles-System-Product-Name:11800:11832 [1] NCCL INFO Channel 00/0 : 1[1] -> 2[2] via P2P/direct pointer

myles-System-Product-Name:11800:11833 [2] NCCL INFO Channel 01/0 : 2[2] -> 0[0] via P2P/direct pointer

myles-System-Product-Name:11800:11831 [0] NCCL INFO Channel 01/0 : 0[0] -> 1[1] via P2P/direct pointer

myles-System-Product-Name:11800:11832 [1] NCCL INFO Channel 01/0 : 1[1] -> 2[2] via P2P/direct pointer

myles-System-Product-Name:11800:11833 [2] NCCL INFO Channel 02/0 : 2[2] -> 0[0] via P2P/direct pointer

myles-System-Product-Name:11800:11831 [0] NCCL INFO Channel 02/0 : 0[0] -> 1[1] via P2P/direct pointer

myles-System-Product-Name:11800:11832 [1] NCCL INFO Channel 02/0 : 1[1] -> 2[2] via P2P/direct pointer

myles-System-Product-Name:11800:11831 [0] NCCL INFO Channel 03/0 : 0[0] -> 1[1] via P2P/direct pointer

myles-System-Product-Name:11800:11833 [2] NCCL INFO Channel 03/0 : 2[2] -> 0[0] via P2P/direct pointer

myles-System-Product-Name:11800:11832 [1] NCCL INFO Channel 03/0 : 1[1] -> 2[2] via P2P/direct pointer

myles-System-Product-Name:11800:11831 [0] NCCL INFO Connected all rings

myles-System-Product-Name:11800:11833 [2] NCCL INFO Connected all rings

myles-System-Product-Name:11800:11832 [1] NCCL INFO Connected all rings

myles-System-Product-Name:11800:11833 [2] NCCL INFO Channel 00/0 : 2[2] -> 1[1] via P2P/direct pointer

myles-System-Product-Name:11800:11833 [2] NCCL INFO Channel 01/0 : 2[2] -> 1[1] via P2P/direct pointer

myles-System-Product-Name:11800:11833 [2] NCCL INFO Channel 02/0 : 2[2] -> 1[1] via P2P/direct pointer

myles-System-Product-Name:11800:11833 [2] NCCL INFO Channel 03/0 : 2[2] -> 1[1] via P2P/direct pointer

myles-System-Product-Name:11800:11832 [1] NCCL INFO Channel 00/0 : 1[1] -> 0[0] via P2P/direct pointer

myles-System-Product-Name:11800:11832 [1] NCCL INFO Channel 01/0 : 1[1] -> 0[0] via P2P/direct pointer

myles-System-Product-Name:11800:11832 [1] NCCL INFO Channel 02/0 : 1[1] -> 0[0] via P2P/direct pointer

myles-System-Product-Name:11800:11832 [1] NCCL INFO Channel 03/0 : 1[1] -> 0[0] via P2P/direct pointer

myles-System-Product-Name:11800:11831 [0] NCCL INFO Connected all trees

myles-System-Product-Name:11800:11833 [2] NCCL INFO Connected all trees

myles-System-Product-Name:11800:11832 [1] NCCL INFO Connected all trees

myles-System-Product-Name:11800:11842 [0] NCCL INFO [Proxy Progress] Device 0 CPU core 118

myles-System-Product-Name:11800:11843 [1] NCCL INFO [Proxy Progress] Device 1 CPU core 97

myles-System-Product-Name:11800:11844 [2] NCCL INFO [Proxy Progress] Device 2 CPU core 30

myles-System-Product-Name:11800:11831 [0] NCCL INFO threadThresholds 8/8/64 | 24/8/64 | 512 | 512

myles-System-Product-Name:11800:11831 [0] NCCL INFO 4 coll channels, 4 collnet channels, 0 nvls channels, 4 p2p channels, 2 p2p channels per peer

myles-System-Product-Name:11800:11832 [1] NCCL INFO threadThresholds 8/8/64 | 24/8/64 | 512 | 512

myles-System-Product-Name:11800:11832 [1] NCCL INFO 4 coll channels, 4 collnet channels, 0 nvls channels, 4 p2p channels, 2 p2p channels per peer

myles-System-Product-Name:11800:11833 [2] NCCL INFO threadThresholds 8/8/64 | 24/8/64 | 512 | 512

myles-System-Product-Name:11800:11833 [2] NCCL INFO 4 coll channels, 4 collnet channels, 0 nvls channels, 4 p2p channels, 2 p2p channels per peer

myles-System-Product-Name:11800:11831 [0] NCCL INFO CC Off, Multi-GPU CC Off, workFifoBytes 1048576

myles-System-Product-Name:11800:11831 [0] NCCL INFO TUNER/Plugin: Could not find: libnccl-tuner.so libnccl-net.so. Using internal tuner plugin.

myles-System-Product-Name:11800:11831 [0] NCCL INFO ncclCommInitAll comm 0x5b56b2987820 rank 0 nranks 3 cudaDev 0 nvmlDev 0 busId 1000 commId 0x47a48d37963338d8 - Init COMPLETE

myles-System-Product-Name:11800:11831 [0] NCCL INFO Init timings - ncclCommInitAll: rank 0 nranks 3 total 0.36 (kernels 0.25, alloc 0.06, bootstrap 0.00, allgathers 0.00, topo 0.01, graphs 0.00, connections 0.03, rest 0.00)

myles-System-Product-Name:11800:11832 [1] NCCL INFO ncclCommInitAll comm 0x5b56b29c80c0 rank 1 nranks 3 cudaDev 1 nvmlDev 1 busId 2000 commId 0x47a48d37963338d8 - Init COMPLETE

myles-System-Product-Name:11800:11832 [1] NCCL INFO Init timings - ncclCommInitAll: rank 1 nranks 3 total 0.36 (kernels 0.25, alloc 0.06, bootstrap 0.00, allgathers 0.00, topo 0.02, graphs 0.00, connections 0.03, rest 0.00)

myles-System-Product-Name:11800:11833 [2] NCCL INFO ncclCommInitAll comm 0x5b56b2a08a00 rank 2 nranks 3 cudaDev 2 nvmlDev 2 busId 2b000 commId 0x47a48d37963338d8 - Init COMPLETE

myles-System-Product-Name:11800:11833 [2] NCCL INFO Init timings - ncclCommInitAll: rank 2 nranks 3 total 0.36 (kernels 0.25, alloc 0.06, bootstrap 0.00, allgathers 0.00, topo 0.02, graphs 0.00, connections 0.03, rest 0.00)

#

#                                                              out-of-place                       in-place          

#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong

#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       

    33554432       8388608     float     sum      -1   1988.0   16.88   22.50      0   1933.6   17.35   23.14      0

myles-System-Product-Name:11800:11800 [0] NCCL INFO comm 0x5b56b2987820 rank 0 nranks 3 cudaDev 0 busId 1000 - Destroy COMPLETE

myles-System-Product-Name:11800:11800 [2] NCCL INFO comm 0x5b56b2a08a00 rank 2 nranks 3 cudaDev 2 busId 2b000 - Destroy COMPLETE

myles-System-Product-Name:11800:11800 [1] NCCL INFO comm 0x5b56b29c80c0 rank 1 nranks 3 cudaDev 1 busId 2000 - Destroy COMPLETE

# Out of bounds values : 0 OK

# Avg bus bandwidth    : 22.8207 

#



myles@myles-System-Product-Name:~/nccl-tests/build$ NCCL_DEBUG=INFO NCCL_P2P_LEVEL=PHB ./all_reduce_perf -g 3

# nThread 1 nGpus 3 minBytes 33554432 maxBytes 33554432 step: 1048576(bytes) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0

#

# Using devices

#  Rank  0 Group  0 Pid  11860 on myles-System-Product-Name device  0 [0x01] NVIDIA GeForce RTX 4090

#  Rank  1 Group  0 Pid  11860 on myles-System-Product-Name device  1 [0x02] NVIDIA GeForce RTX 4090

#  Rank  2 Group  0 Pid  11860 on myles-System-Product-Name device  2 [0x2b] NVIDIA GeForce RTX 4090

myles-System-Product-Name:11860:11860 [0] NCCL INFO Bootstrap : Using enp37s0f0:192.168.1.32<0>

myles-System-Product-Name:11860:11860 [0] NCCL INFO cudaDriverVersion 12060

myles-System-Product-Name:11860:11860 [0] NCCL INFO NCCL version 2.23.4+cuda12.6

myles-System-Product-Name:11860:11892 [1] NCCL INFO NET/Plugin: Could not find: libnccl-net.so. Using internal network plugin.

myles-System-Product-Name:11860:11892 [1] NCCL INFO Failed to open libibverbs.so[.1]

myles-System-Product-Name:11860:11892 [1] NCCL INFO NET/Socket : Using [0]enp37s0f0:192.168.1.32<0> [1]enp37s0f1:192.168.1.47<0>

myles-System-Product-Name:11860:11892 [1] NCCL INFO PROFILER/Plugin: Could not find: libnccl-profiler.so.

myles-System-Product-Name:11860:11892 [1] NCCL INFO Using network Socket

myles-System-Product-Name:11860:11891 [0] NCCL INFO Using network Socket

myles-System-Product-Name:11860:11893 [2] NCCL INFO Using network Socket

myles-System-Product-Name:11860:11891 [0] NCCL INFO ncclCommInitAll comm 0x5d1f738e9820 rank 0 nranks 3 cudaDev 0 nvmlDev 0 busId 1000 commId 0x64d31cd9129b6291 - Init START

myles-System-Product-Name:11860:11893 [2] NCCL INFO ncclCommInitAll comm 0x5d1f7396aa00 rank 2 nranks 3 cudaDev 2 nvmlDev 2 busId 2b000 commId 0x64d31cd9129b6291 - Init START

myles-System-Product-Name:11860:11892 [1] NCCL INFO ncclCommInitAll comm 0x5d1f7392a0c0 rank 1 nranks 3 cudaDev 1 nvmlDev 1 busId 2000 commId 0x64d31cd9129b6291 - Init START

myles-System-Product-Name:11860:11892 [1] NCCL INFO Bootstrap timings total 0.000900 (create 0.000051, send 0.000168, recv 0.000419, ring 0.000062, delay 0.000000)

myles-System-Product-Name:11860:11893 [2] NCCL INFO Bootstrap timings total 0.001019 (create 0.000048, send 0.000149, recv 0.000293, ring 0.000105, delay 0.000000)

myles-System-Product-Name:11860:11891 [0] NCCL INFO Bootstrap timings total 0.001061 (create 0.000040, send 0.000136, recv 0.000531, ring 0.000144, delay 0.000000)

myles-System-Product-Name:11860:11893 [2] NCCL INFO NCCL_P2P_LEVEL set by environment to PHB

myles-System-Product-Name:11860:11893 [2] NCCL INFO NVLS multicast support is not available on dev 2

myles-System-Product-Name:11860:11891 [0] NCCL INFO NVLS multicast support is not available on dev 0

myles-System-Product-Name:11860:11892 [1] NCCL INFO NVLS multicast support is not available on dev 1

myles-System-Product-Name:11860:11893 [2] NCCL INFO comm 0x5d1f7396aa00 rank 2 nRanks 3 nNodes 1 localRanks 3 localRank 2 MNNVL 0

myles-System-Product-Name:11860:11892 [1] NCCL INFO comm 0x5d1f7392a0c0 rank 1 nRanks 3 nNodes 1 localRanks 3 localRank 1 MNNVL 0

myles-System-Product-Name:11860:11893 [2] NCCL INFO Trees [0] -1/-1/-1->2->1 [1] -1/-1/-1->2->1 [2] -1/-1/-1->2->1 [3] -1/-1/-1->2->1

myles-System-Product-Name:11860:11893 [2] NCCL INFO P2P Chunksize set to 131072

myles-System-Product-Name:11860:11891 [0] NCCL INFO comm 0x5d1f738e9820 rank 0 nRanks 3 nNodes 1 localRanks 3 localRank 0 MNNVL 0

myles-System-Product-Name:11860:11891 [0] NCCL INFO Channel 00/04 : 0 1 2

myles-System-Product-Name:11860:11891 [0] NCCL INFO Channel 01/04 : 0 1 2

myles-System-Product-Name:11860:11891 [0] NCCL INFO Channel 02/04 : 0 1 2

myles-System-Product-Name:11860:11891 [0] NCCL INFO Channel 03/04 : 0 1 2

myles-System-Product-Name:11860:11892 [1] NCCL INFO Trees [0] 2/-1/-1->1->0 [1] 2/-1/-1->1->0 [2] 2/-1/-1->1->0 [3] 2/-1/-1->1->0

myles-System-Product-Name:11860:11892 [1] NCCL INFO P2P Chunksize set to 131072

myles-System-Product-Name:11860:11891 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1 [2] 1/-1/-1->0->-1 [3] 1/-1/-1->0->-1

myles-System-Product-Name:11860:11891 [0] NCCL INFO P2P Chunksize set to 131072

myles-System-Product-Name:11860:11898 [0] NCCL INFO [Proxy Service] Device 0 CPU core 41

myles-System-Product-Name:11860:11897 [2] NCCL INFO [Proxy Service UDS] Device 2 CPU core 78

myles-System-Product-Name:11860:11899 [1] NCCL INFO [Proxy Service] Device 1 CPU core 40

myles-System-Product-Name:11860:11900 [0] NCCL INFO [Proxy Service UDS] Device 0 CPU core 114

myles-System-Product-Name:11860:11896 [2] NCCL INFO [Proxy Service] Device 2 CPU core 4

myles-System-Product-Name:11860:11901 [1] NCCL INFO [Proxy Service UDS] Device 1 CPU core 118

myles-System-Product-Name:11860:11893 [2] NCCL INFO Channel 00/0 : 2[2] -> 0[0] via P2P/direct pointer

myles-System-Product-Name:11860:11892 [1] NCCL INFO Channel 00/0 : 1[1] -> 2[2] via P2P/direct pointer

myles-System-Product-Name:11860:11891 [0] NCCL INFO Channel 00/0 : 0[0] -> 1[1] via P2P/direct pointer

myles-System-Product-Name:11860:11892 [1] NCCL INFO Channel 01/0 : 1[1] -> 2[2] via P2P/direct pointer

myles-System-Product-Name:11860:11893 [2] NCCL INFO Channel 01/0 : 2[2] -> 0[0] via P2P/direct pointer

myles-System-Product-Name:11860:11891 [0] NCCL INFO Channel 01/0 : 0[0] -> 1[1] via P2P/direct pointer

myles-System-Product-Name:11860:11893 [2] NCCL INFO Channel 02/0 : 2[2] -> 0[0] via P2P/direct pointer

myles-System-Product-Name:11860:11892 [1] NCCL INFO Channel 02/0 : 1[1] -> 2[2] via P2P/direct pointer

myles-System-Product-Name:11860:11891 [0] NCCL INFO Channel 02/0 : 0[0] -> 1[1] via P2P/direct pointer

myles-System-Product-Name:11860:11893 [2] NCCL INFO Channel 03/0 : 2[2] -> 0[0] via P2P/direct pointer

myles-System-Product-Name:11860:11892 [1] NCCL INFO Channel 03/0 : 1[1] -> 2[2] via P2P/direct pointer

myles-System-Product-Name:11860:11891 [0] NCCL INFO Channel 03/0 : 0[0] -> 1[1] via P2P/direct pointer

myles-System-Product-Name:11860:11893 [2] NCCL INFO Connected all rings

myles-System-Product-Name:11860:11893 [2] NCCL INFO Channel 00/0 : 2[2] -> 1[1] via P2P/direct pointer

myles-System-Product-Name:11860:11892 [1] NCCL INFO Connected all rings

myles-System-Product-Name:11860:11891 [0] NCCL INFO Connected all rings

myles-System-Product-Name:11860:11893 [2] NCCL INFO Channel 01/0 : 2[2] -> 1[1] via P2P/direct pointer

myles-System-Product-Name:11860:11893 [2] NCCL INFO Channel 02/0 : 2[2] -> 1[1] via P2P/direct pointer

myles-System-Product-Name:11860:11893 [2] NCCL INFO Channel 03/0 : 2[2] -> 1[1] via P2P/direct pointer

myles-System-Product-Name:11860:11892 [1] NCCL INFO Channel 00/0 : 1[1] -> 0[0] via P2P/direct pointer

myles-System-Product-Name:11860:11892 [1] NCCL INFO Channel 01/0 : 1[1] -> 0[0] via P2P/direct pointer

myles-System-Product-Name:11860:11892 [1] NCCL INFO Channel 02/0 : 1[1] -> 0[0] via P2P/direct pointer

myles-System-Product-Name:11860:11892 [1] NCCL INFO Channel 03/0 : 1[1] -> 0[0] via P2P/direct pointer

myles-System-Product-Name:11860:11891 [0] NCCL INFO Connected all trees

myles-System-Product-Name:11860:11893 [2] NCCL INFO Connected all trees

myles-System-Product-Name:11860:11892 [1] NCCL INFO Connected all trees

myles-System-Product-Name:11860:11902 [0] NCCL INFO [Proxy Progress] Device 0 CPU core 40

myles-System-Product-Name:11860:11903 [2] NCCL INFO [Proxy Progress] Device 2 CPU core 5

myles-System-Product-Name:11860:11891 [0] NCCL INFO threadThresholds 8/8/64 | 24/8/64 | 512 | 512

myles-System-Product-Name:11860:11891 [0] NCCL INFO 4 coll channels, 4 collnet channels, 0 nvls channels, 4 p2p channels, 2 p2p channels per peer

myles-System-Product-Name:11860:11904 [1] NCCL INFO [Proxy Progress] Device 1 CPU core 57

myles-System-Product-Name:11860:11893 [2] NCCL INFO threadThresholds 8/8/64 | 24/8/64 | 512 | 512

myles-System-Product-Name:11860:11893 [2] NCCL INFO 4 coll channels, 4 collnet channels, 0 nvls channels, 4 p2p channels, 2 p2p channels per peer

myles-System-Product-Name:11860:11892 [1] NCCL INFO threadThresholds 8/8/64 | 24/8/64 | 512 | 512

myles-System-Product-Name:11860:11892 [1] NCCL INFO 4 coll channels, 4 collnet channels, 0 nvls channels, 4 p2p channels, 2 p2p channels per peer

myles-System-Product-Name:11860:11891 [0] NCCL INFO CC Off, Multi-GPU CC Off, workFifoBytes 1048576

myles-System-Product-Name:11860:11892 [1] NCCL INFO TUNER/Plugin: Could not find: libnccl-tuner.so libnccl-net.so. Using internal tuner plugin.

myles-System-Product-Name:11860:11892 [1] NCCL INFO ncclCommInitAll comm 0x5d1f7392a0c0 rank 1 nranks 3 cudaDev 1 nvmlDev 1 busId 2000 commId 0x64d31cd9129b6291 - Init COMPLETE

myles-System-Product-Name:11860:11893 [2] NCCL INFO ncclCommInitAll comm 0x5d1f7396aa00 rank 2 nranks 3 cudaDev 2 nvmlDev 2 busId 2b000 commId 0x64d31cd9129b6291 - Init COMPLETE

myles-System-Product-Name:11860:11892 [1] NCCL INFO Init timings - ncclCommInitAll: rank 1 nranks 3 total 0.36 (kernels 0.25, alloc 0.06, bootstrap 0.00, allgathers 0.00, topo 0.02, graphs 0.00, connections 0.03, rest 0.00)

myles-System-Product-Name:11860:11893 [2] NCCL INFO Init timings - ncclCommInitAll: rank 2 nranks 3 total 0.36 (kernels 0.25, alloc 0.06, bootstrap 0.00, allgathers 0.00, topo 0.01, graphs 0.00, connections 0.03, rest 0.00)

myles-System-Product-Name:11860:11891 [0] NCCL INFO ncclCommInitAll comm 0x5d1f738e9820 rank 0 nranks 3 cudaDev 0 nvmlDev 0 busId 1000 commId 0x64d31cd9129b6291 - Init COMPLETE

myles-System-Product-Name:11860:11891 [0] NCCL INFO Init timings - ncclCommInitAll: rank 0 nranks 3 total 0.36 (kernels 0.25, alloc 0.06, bootstrap 0.00, allgathers 0.00, topo 0.02, graphs 0.00, connections 0.03, rest 0.00)

#

#                                                              out-of-place                       in-place          

#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong

#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       

    33554432       8388608     float     sum      -1   1825.5   18.38   24.51      0   1820.7   18.43   24.57      0

myles-System-Product-Name:11860:11860 [0] NCCL INFO comm 0x5d1f738e9820 rank 0 nranks 3 cudaDev 0 busId 1000 - Destroy COMPLETE

myles-System-Product-Name:11860:11860 [2] NCCL INFO comm 0x5d1f7396aa00 rank 2 nranks 3 cudaDev 2 busId 2b000 - Destroy COMPLETE

myles-System-Product-Name:11860:11860 [1] NCCL INFO comm 0x5d1f7392a0c0 rank 1 nranks 3 cudaDev 1 busId 2000 - Destroy COMPLETE

# Out of bounds values : 0 OK

# Avg bus bandwidth    : 24.5402 

#



myles@myles-System-Product-Name:~/nccl-tests/build$ NCCL_DEBUG=INFO NCCL_P2P_LEVEL=PHB ./all_reduce_perf -g 4

# nThread 1 nGpus 4 minBytes 33554432 maxBytes 33554432 step: 1048576(bytes) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0

#

# Using devices

#  Rank  0 Group  0 Pid  11905 on myles-System-Product-Name device  0 [0x01] NVIDIA GeForce RTX 4090

#  Rank  1 Group  0 Pid  11905 on myles-System-Product-Name device  1 [0x02] NVIDIA GeForce RTX 4090

#  Rank  2 Group  0 Pid  11905 on myles-System-Product-Name device  2 [0x2b] NVIDIA GeForce RTX 4090

#  Rank  3 Group  0 Pid  11905 on myles-System-Product-Name device  3 [0x41] NVIDIA GeForce RTX 4090

myles-System-Product-Name:11905:11905 [0] NCCL INFO Bootstrap : Using enp37s0f0:192.168.1.32<0>

myles-System-Product-Name:11905:11905 [0] NCCL INFO cudaDriverVersion 12060

myles-System-Product-Name:11905:11905 [0] NCCL INFO NCCL version 2.23.4+cuda12.6

myles-System-Product-Name:11905:11941 [2] NCCL INFO NET/Plugin: Could not find: libnccl-net.so. Using internal network plugin.

myles-System-Product-Name:11905:11941 [2] NCCL INFO Failed to open libibverbs.so[.1]

myles-System-Product-Name:11905:11941 [2] NCCL INFO NET/Socket : Using [0]enp37s0f0:192.168.1.32<0> [1]enp37s0f1:192.168.1.47<0>

myles-System-Product-Name:11905:11941 [2] NCCL INFO PROFILER/Plugin: Could not find: libnccl-profiler.so.

myles-System-Product-Name:11905:11941 [2] NCCL INFO Using network Socket

myles-System-Product-Name:11905:11939 [0] NCCL INFO Using network Socket

myles-System-Product-Name:11905:11942 [3] NCCL INFO Using network Socket

myles-System-Product-Name:11905:11940 [1] NCCL INFO Using network Socket

myles-System-Product-Name:11905:11939 [0] NCCL INFO ncclCommInitAll comm 0x5feb380d49d0 rank 0 nranks 4 cudaDev 0 nvmlDev 0 busId 1000 commId 0x753b93d1e4296bfc - Init START

myles-System-Product-Name:11905:11940 [1] NCCL INFO ncclCommInitAll comm 0x5feb38115ef0 rank 1 nranks 4 cudaDev 1 nvmlDev 1 busId 2000 commId 0x753b93d1e4296bfc - Init START

myles-System-Product-Name:11905:11941 [2] NCCL INFO ncclCommInitAll comm 0x5feb381574b0 rank 2 nranks 4 cudaDev 2 nvmlDev 2 busId 2b000 commId 0x753b93d1e4296bfc - Init START

myles-System-Product-Name:11905:11942 [3] NCCL INFO ncclCommInitAll comm 0x5feb38198a70 rank 3 nranks 4 cudaDev 3 nvmlDev 3 busId 41000 commId 0x753b93d1e4296bfc - Init START

myles-System-Product-Name:11905:11940 [1] NCCL INFO Bootstrap timings total 0.001192 (create 0.000050, send 0.000150, recv 0.000544, ring 0.000327, delay 0.000000)

myles-System-Product-Name:11905:11941 [2] NCCL INFO Bootstrap timings total 0.001156 (create 0.000049, send 0.000164, recv 0.000599, ring 0.000143, delay 0.000000)

myles-System-Product-Name:11905:11939 [0] NCCL INFO Bootstrap timings total 0.001247 (create 0.000056, send 0.000168, recv 0.000387, ring 0.000103, delay 0.000000)

myles-System-Product-Name:11905:11942 [3] NCCL INFO Bootstrap timings total 0.001135 (create 0.000055, send 0.000164, recv 0.000632, ring 0.000090, delay 0.000000)

myles-System-Product-Name:11905:11940 [1] NCCL INFO NCCL_P2P_LEVEL set by environment to PHB

myles-System-Product-Name:11905:11940 [1] NCCL INFO NVLS multicast support is not available on dev 1

myles-System-Product-Name:11905:11942 [3] NCCL INFO NVLS multicast support is not available on dev 3

myles-System-Product-Name:11905:11941 [2] NCCL INFO NVLS multicast support is not available on dev 2

myles-System-Product-Name:11905:11939 [0] NCCL INFO NVLS multicast support is not available on dev 0

myles-System-Product-Name:11905:11940 [1] NCCL INFO comm 0x5feb38115ef0 rank 1 nRanks 4 nNodes 1 localRanks 4 localRank 1 MNNVL 0

myles-System-Product-Name:11905:11942 [3] NCCL INFO comm 0x5feb38198a70 rank 3 nRanks 4 nNodes 1 localRanks 4 localRank 3 MNNVL 0

myles-System-Product-Name:11905:11940 [1] NCCL INFO Trees [0] 2/-1/-1->1->0 [1] 2/-1/-1->1->0 [2] 2/-1/-1->1->0 [3] 2/-1/-1->1->0

myles-System-Product-Name:11905:11940 [1] NCCL INFO P2P Chunksize set to 131072

myles-System-Product-Name:11905:11941 [2] NCCL INFO comm 0x5feb381574b0 rank 2 nRanks 4 nNodes 1 localRanks 4 localRank 2 MNNVL 0

myles-System-Product-Name:11905:11941 [2] NCCL INFO Trees [0] 3/-1/-1->2->1 [1] 3/-1/-1->2->1 [2] 3/-1/-1->2->1 [3] 3/-1/-1->2->1

myles-System-Product-Name:11905:11941 [2] NCCL INFO P2P Chunksize set to 131072

myles-System-Product-Name:11905:11942 [3] NCCL INFO Trees [0] -1/-1/-1->3->2 [1] -1/-1/-1->3->2 [2] -1/-1/-1->3->2 [3] -1/-1/-1->3->2

myles-System-Product-Name:11905:11942 [3] NCCL INFO P2P Chunksize set to 131072

myles-System-Product-Name:11905:11939 [0] NCCL INFO comm 0x5feb380d49d0 rank 0 nRanks 4 nNodes 1 localRanks 4 localRank 0 MNNVL 0

myles-System-Product-Name:11905:11939 [0] NCCL INFO Channel 00/04 : 0 1 2 3

myles-System-Product-Name:11905:11939 [0] NCCL INFO Channel 01/04 : 0 1 2 3

myles-System-Product-Name:11905:11939 [0] NCCL INFO Channel 02/04 : 0 1 2 3

myles-System-Product-Name:11905:11939 [0] NCCL INFO Channel 03/04 : 0 1 2 3

myles-System-Product-Name:11905:11939 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1 [2] 1/-1/-1->0->-1 [3] 1/-1/-1->0->-1

myles-System-Product-Name:11905:11939 [0] NCCL INFO P2P Chunksize set to 131072

myles-System-Product-Name:11905:11945 [1] NCCL INFO [Proxy Service] Device 1 CPU core 118

myles-System-Product-Name:11905:11948 [0] NCCL INFO [Proxy Service] Device 0 CPU core 76

myles-System-Product-Name:11905:11950 [2] NCCL INFO [Proxy Service UDS] Device 2 CPU core 27

myles-System-Product-Name:11905:11946 [1] NCCL INFO [Proxy Service UDS] Device 1 CPU core 125

myles-System-Product-Name:11905:11947 [2] NCCL INFO [Proxy Service] Device 2 CPU core 10

myles-System-Product-Name:11905:11949 [3] NCCL INFO [Proxy Service] Device 3 CPU core 85

myles-System-Product-Name:11905:11951 [3] NCCL INFO [Proxy Service UDS] Device 3 CPU core 32

myles-System-Product-Name:11905:11952 [0] NCCL INFO [Proxy Service UDS] Device 0 CPU core 47

myles-System-Product-Name:11905:11940 [1] NCCL INFO Channel 00/0 : 1[1] -> 2[2] via P2P/direct pointer

myles-System-Product-Name:11905:11942 [3] NCCL INFO Channel 00/0 : 3[3] -> 0[0] via P2P/direct pointer

myles-System-Product-Name:11905:11941 [2] NCCL INFO Channel 00/0 : 2[2] -> 3[3] via P2P/direct pointer

myles-System-Product-Name:11905:11940 [1] NCCL INFO Channel 01/0 : 1[1] -> 2[2] via P2P/direct pointer

myles-System-Product-Name:11905:11939 [0] NCCL INFO Channel 00/0 : 0[0] -> 1[1] via P2P/direct pointer

myles-System-Product-Name:11905:11942 [3] NCCL INFO Channel 01/0 : 3[3] -> 0[0] via P2P/direct pointer

myles-System-Product-Name:11905:11941 [2] NCCL INFO Channel 01/0 : 2[2] -> 3[3] via P2P/direct pointer

myles-System-Product-Name:11905:11940 [1] NCCL INFO Channel 02/0 : 1[1] -> 2[2] via P2P/direct pointer

myles-System-Product-Name:11905:11939 [0] NCCL INFO Channel 01/0 : 0[0] -> 1[1] via P2P/direct pointer

myles-System-Product-Name:11905:11942 [3] NCCL INFO Channel 02/0 : 3[3] -> 0[0] via P2P/direct pointer

myles-System-Product-Name:11905:11941 [2] NCCL INFO Channel 02/0 : 2[2] -> 3[3] via P2P/direct pointer

myles-System-Product-Name:11905:11939 [0] NCCL INFO Channel 02/0 : 0[0] -> 1[1] via P2P/direct pointer

myles-System-Product-Name:11905:11942 [3] NCCL INFO Channel 03/0 : 3[3] -> 0[0] via P2P/direct pointer

myles-System-Product-Name:11905:11940 [1] NCCL INFO Channel 03/0 : 1[1] -> 2[2] via P2P/direct pointer

myles-System-Product-Name:11905:11941 [2] NCCL INFO Channel 03/0 : 2[2] -> 3[3] via P2P/direct pointer

myles-System-Product-Name:11905:11939 [0] NCCL INFO Channel 03/0 : 0[0] -> 1[1] via P2P/direct pointer

myles-System-Product-Name:11905:11940 [1] NCCL INFO Connected all rings

myles-System-Product-Name:11905:11939 [0] NCCL INFO Connected all rings

myles-System-Product-Name:11905:11942 [3] NCCL INFO Connected all rings

myles-System-Product-Name:11905:11941 [2] NCCL INFO Connected all rings

myles-System-Product-Name:11905:11942 [3] NCCL INFO Channel 00/0 : 3[3] -> 2[2] via P2P/direct pointer

myles-System-Product-Name:11905:11942 [3] NCCL INFO Channel 01/0 : 3[3] -> 2[2] via P2P/direct pointer

myles-System-Product-Name:11905:11942 [3] NCCL INFO Channel 02/0 : 3[3] -> 2[2] via P2P/direct pointer

myles-System-Product-Name:11905:11942 [3] NCCL INFO Channel 03/0 : 3[3] -> 2[2] via P2P/direct pointer

myles-System-Product-Name:11905:11940 [1] NCCL INFO Channel 00/0 : 1[1] -> 0[0] via P2P/direct pointer

myles-System-Product-Name:11905:11940 [1] NCCL INFO Channel 01/0 : 1[1] -> 0[0] via P2P/direct pointer

myles-System-Product-Name:11905:11940 [1] NCCL INFO Channel 02/0 : 1[1] -> 0[0] via P2P/direct pointer

myles-System-Product-Name:11905:11941 [2] NCCL INFO Channel 00/0 : 2[2] -> 1[1] via P2P/direct pointer

myles-System-Product-Name:11905:11940 [1] NCCL INFO Channel 03/0 : 1[1] -> 0[0] via P2P/direct pointer

myles-System-Product-Name:11905:11941 [2] NCCL INFO Channel 01/0 : 2[2] -> 1[1] via P2P/direct pointer

myles-System-Product-Name:11905:11941 [2] NCCL INFO Channel 02/0 : 2[2] -> 1[1] via P2P/direct pointer

myles-System-Product-Name:11905:11941 [2] NCCL INFO Channel 03/0 : 2[2] -> 1[1] via P2P/direct pointer

myles-System-Product-Name:11905:11939 [0] NCCL INFO Connected all trees

myles-System-Product-Name:11905:11940 [1] NCCL INFO Connected all trees

myles-System-Product-Name:11905:11942 [3] NCCL INFO Connected all trees

myles-System-Product-Name:11905:11941 [2] NCCL INFO Connected all trees

myles-System-Product-Name:11905:11953 [1] NCCL INFO [Proxy Progress] Device 1 CPU core 53

myles-System-Product-Name:11905:11940 [1] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 512 | 512

myles-System-Product-Name:11905:11940 [1] NCCL INFO 4 coll channels, 4 collnet channels, 0 nvls channels, 4 p2p channels, 2 p2p channels per peer

myles-System-Product-Name:11905:11954 [0] NCCL INFO [Proxy Progress] Device 0 CPU core 75

myles-System-Product-Name:11905:11955 [3] NCCL INFO [Proxy Progress] Device 3 CPU core 76

myles-System-Product-Name:11905:11939 [0] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 512 | 512

myles-System-Product-Name:11905:11939 [0] NCCL INFO 4 coll channels, 4 collnet channels, 0 nvls channels, 4 p2p channels, 2 p2p channels per peer

myles-System-Product-Name:11905:11956 [2] NCCL INFO [Proxy Progress] Device 2 CPU core 84

myles-System-Product-Name:11905:11939 [0] NCCL INFO CC Off, Multi-GPU CC Off, workFifoBytes 1048576

myles-System-Product-Name:11905:11942 [3] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 512 | 512

myles-System-Product-Name:11905:11942 [3] NCCL INFO 4 coll channels, 4 collnet channels, 0 nvls channels, 4 p2p channels, 2 p2p channels per peer

myles-System-Product-Name:11905:11941 [2] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 512 | 512

myles-System-Product-Name:11905:11941 [2] NCCL INFO 4 coll channels, 4 collnet channels, 0 nvls channels, 4 p2p channels, 2 p2p channels per peer

myles-System-Product-Name:11905:11940 [1] NCCL INFO TUNER/Plugin: Could not find: libnccl-tuner.so libnccl-net.so. Using internal tuner plugin.

myles-System-Product-Name:11905:11940 [1] NCCL INFO ncclCommInitAll comm 0x5feb38115ef0 rank 1 nranks 4 cudaDev 1 nvmlDev 1 busId 2000 commId 0x753b93d1e4296bfc - Init COMPLETE

myles-System-Product-Name:11905:11940 [1] NCCL INFO Init timings - ncclCommInitAll: rank 1 nranks 4 total 0.45 (kernels 0.32, alloc 0.06, bootstrap 0.00, allgathers 0.00, topo 0.02, graphs 0.00, connections 0.04, rest 0.01)

myles-System-Product-Name:11905:11942 [3] NCCL INFO ncclCommInitAll comm 0x5feb38198a70 rank 3 nranks 4 cudaDev 3 nvmlDev 3 busId 41000 commId 0x753b93d1e4296bfc - Init COMPLETE

myles-System-Product-Name:11905:11942 [3] NCCL INFO Init timings - ncclCommInitAll: rank 3 nranks 4 total 0.45 (kernels 0.31, alloc 0.06, bootstrap 0.00, allgathers 0.00, topo 0.02, graphs 0.00, connections 0.05, rest 0.00)

myles-System-Product-Name:11905:11941 [2] NCCL INFO ncclCommInitAll comm 0x5feb381574b0 rank 2 nranks 4 cudaDev 2 nvmlDev 2 busId 2b000 commId 0x753b93d1e4296bfc - Init COMPLETE

myles-System-Product-Name:11905:11939 [0] NCCL INFO ncclCommInitAll comm 0x5feb380d49d0 rank 0 nranks 4 cudaDev 0 nvmlDev 0 busId 1000 commId 0x753b93d1e4296bfc - Init COMPLETE

myles-System-Product-Name:11905:11939 [0] NCCL INFO Init timings - ncclCommInitAll: rank 0 nranks 4 total 0.45 (kernels 0.31, alloc 0.06, bootstrap 0.00, allgathers 0.00, topo 0.02, graphs 0.00, connections 0.05, rest 0.00)

myles-System-Product-Name:11905:11941 [2] NCCL INFO Init timings - ncclCommInitAll: rank 2 nranks 4 total 0.45 (kernels 0.31, alloc 0.06, bootstrap 0.00, allgathers 0.00, topo 0.02, graphs 0.00, connections 0.05, rest 0.00)

#

#                                                              out-of-place                       in-place          

#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong

#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       

    33554432       8388608     float     sum      -1   2037.7   16.47   24.70      0   2063.9   16.26   24.39      0

myles-System-Product-Name:11905:11905 [0] NCCL INFO comm 0x5feb380d49d0 rank 0 nranks 4 cudaDev 0 busId 1000 - Destroy COMPLETE

myles-System-Product-Name:11905:11905 [3] NCCL INFO comm 0x5feb38198a70 rank 3 nranks 4 cudaDev 3 busId 41000 - Destroy COMPLETE

myles-System-Product-Name:11905:11905 [2] NCCL INFO comm 0x5feb381574b0 rank 2 nranks 4 cudaDev 2 busId 2b000 - Destroy COMPLETE

myles-System-Product-Name:11905:11905 [1] NCCL INFO comm 0x5feb38115ef0 rank 1 nranks 4 cudaDev 1 busId 2000 - Destroy COMPLETE

# Out of bounds values : 0 OK

# Avg bus bandwidth    : 24.543 

#



myles@myles-System-Product-Name:~/nccl-tests/build$ NCCL_DEBUG=INFO NCCL_P2P_LEVEL=PHB ./all_reduce_perf -g 5

# nThread 1 nGpus 5 minBytes 33554432 maxBytes 33554432 step: 1048576(bytes) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0

#

# Using devices

#  Rank  0 Group  0 Pid  11960 on myles-System-Product-Name device  0 [0x01] NVIDIA GeForce RTX 4090

#  Rank  1 Group  0 Pid  11960 on myles-System-Product-Name device  1 [0x02] NVIDIA GeForce RTX 4090

#  Rank  2 Group  0 Pid  11960 on myles-System-Product-Name device  2 [0x2b] NVIDIA GeForce RTX 4090

#  Rank  3 Group  0 Pid  11960 on myles-System-Product-Name device  3 [0x41] NVIDIA GeForce RTX 4090

#  Rank  4 Group  0 Pid  11960 on myles-System-Product-Name device  4 [0x42] NVIDIA GeForce RTX 4090

myles-System-Product-Name:11960:11960 [0] NCCL INFO Bootstrap : Using enp37s0f0:192.168.1.32<0>

myles-System-Product-Name:11960:11960 [0] NCCL INFO cudaDriverVersion 12060

myles-System-Product-Name:11960:11960 [0] NCCL INFO NCCL version 2.23.4+cuda12.6

myles-System-Product-Name:11960:11995 [0] NCCL INFO NET/Plugin: Could not find: libnccl-net.so. Using internal network plugin.

myles-System-Product-Name:11960:11995 [0] NCCL INFO Failed to open libibverbs.so[.1]

myles-System-Product-Name:11960:11995 [0] NCCL INFO NET/Socket : Using [0]enp37s0f0:192.168.1.32<0> [1]enp37s0f1:192.168.1.47<0>

myles-System-Product-Name:11960:11995 [0] NCCL INFO PROFILER/Plugin: Could not find: libnccl-profiler.so.

myles-System-Product-Name:11960:11995 [0] NCCL INFO Using network Socket

myles-System-Product-Name:11960:11998 [3] NCCL INFO Using network Socket

myles-System-Product-Name:11960:11996 [1] NCCL INFO Using network Socket

myles-System-Product-Name:11960:11997 [2] NCCL INFO Using network Socket

myles-System-Product-Name:11960:11999 [4] NCCL INFO Using network Socket

myles-System-Product-Name:11960:11997 [2] NCCL INFO ncclCommInitAll comm 0x558ba9221be0 rank 2 nranks 5 cudaDev 2 nvmlDev 2 busId 2b000 commId 0x668cd3327447b9bd - Init START

myles-System-Product-Name:11960:11998 [3] NCCL INFO ncclCommInitAll comm 0x558ba9263e20 rank 3 nranks 5 cudaDev 3 nvmlDev 3 busId 41000 commId 0x668cd3327447b9bd - Init START

myles-System-Product-Name:11960:11995 [0] NCCL INFO ncclCommInitAll comm 0x558ba919d800 rank 0 nranks 5 cudaDev 0 nvmlDev 0 busId 1000 commId 0x668cd3327447b9bd - Init START

myles-System-Product-Name:11960:11996 [1] NCCL INFO ncclCommInitAll comm 0x558ba91df9a0 rank 1 nranks 5 cudaDev 1 nvmlDev 1 busId 2000 commId 0x668cd3327447b9bd - Init START

myles-System-Product-Name:11960:11999 [4] NCCL INFO ncclCommInitAll comm 0x558ba92a6060 rank 4 nranks 5 cudaDev 4 nvmlDev 4 busId 42000 commId 0x668cd3327447b9bd - Init START

myles-System-Product-Name:11960:11997 [2] NCCL INFO Bootstrap timings total 0.001402 (create 0.000057, send 0.000163, recv 0.000489, ring 0.000305, delay 0.000000)

myles-System-Product-Name:11960:11998 [3] NCCL INFO Bootstrap timings total 0.001350 (create 0.000054, send 0.000167, recv 0.000764, ring 0.000161, delay 0.000000)

myles-System-Product-Name:11960:11999 [4] NCCL INFO Bootstrap timings total 0.001252 (create 0.000056, send 0.000167, recv 0.000707, ring 0.000123, delay 0.000000)

myles-System-Product-Name:11960:11996 [1] NCCL INFO Bootstrap timings total 0.001282 (create 0.000054, send 0.000162, recv 0.000612, ring 0.000265, delay 0.000000)

myles-System-Product-Name:11960:11995 [0] NCCL INFO Bootstrap timings total 0.001315 (create 0.000047, send 0.000164, recv 0.000581, ring 0.000144, delay 0.000000)

myles-System-Product-Name:11960:11997 [2] NCCL INFO NCCL_P2P_LEVEL set by environment to PHB

myles-System-Product-Name:11960:11997 [2] NCCL INFO NVLS multicast support is not available on dev 2

myles-System-Product-Name:11960:11995 [0] NCCL INFO NVLS multicast support is not available on dev 0

myles-System-Product-Name:11960:11998 [3] NCCL INFO NVLS multicast support is not available on dev 3

myles-System-Product-Name:11960:11999 [4] NCCL INFO NVLS multicast support is not available on dev 4

myles-System-Product-Name:11960:11996 [1] NCCL INFO NVLS multicast support is not available on dev 1

myles-System-Product-Name:11960:11997 [2] NCCL INFO comm 0x558ba9221be0 rank 2 nRanks 5 nNodes 1 localRanks 5 localRank 2 MNNVL 0

myles-System-Product-Name:11960:11996 [1] NCCL INFO comm 0x558ba91df9a0 rank 1 nRanks 5 nNodes 1 localRanks 5 localRank 1 MNNVL 0

myles-System-Product-Name:11960:11999 [4] NCCL INFO comm 0x558ba92a6060 rank 4 nRanks 5 nNodes 1 localRanks 5 localRank 4 MNNVL 0

myles-System-Product-Name:11960:11996 [1] NCCL INFO Trees [0] 2/-1/-1->1->0 [1] 2/-1/-1->1->0 [2] 2/-1/-1->1->0 [3] 2/-1/-1->1->0

myles-System-Product-Name:11960:11998 [3] NCCL INFO comm 0x558ba9263e20 rank 3 nRanks 5 nNodes 1 localRanks 5 localRank 3 MNNVL 0

myles-System-Product-Name:11960:11995 [0] NCCL INFO comm 0x558ba919d800 rank 0 nRanks 5 nNodes 1 localRanks 5 localRank 0 MNNVL 0

myles-System-Product-Name:11960:11998 [3] NCCL INFO Trees [0] 4/-1/-1->3->2 [1] 4/-1/-1->3->2 [2] 4/-1/-1->3->2 [3] 4/-1/-1->3->2

myles-System-Product-Name:11960:11998 [3] NCCL INFO P2P Chunksize set to 131072

myles-System-Product-Name:11960:11999 [4] NCCL INFO Trees [0] -1/-1/-1->4->3 [1] -1/-1/-1->4->3 [2] -1/-1/-1->4->3 [3] -1/-1/-1->4->3

myles-System-Product-Name:11960:11999 [4] NCCL INFO P2P Chunksize set to 131072

myles-System-Product-Name:11960:11996 [1] NCCL INFO P2P Chunksize set to 131072

myles-System-Product-Name:11960:11997 [2] NCCL INFO Trees [0] 3/-1/-1->2->1 [1] 3/-1/-1->2->1 [2] 3/-1/-1->2->1 [3] 3/-1/-1->2->1

myles-System-Product-Name:11960:11997 [2] NCCL INFO P2P Chunksize set to 131072

myles-System-Product-Name:11960:11995 [0] NCCL INFO Channel 00/04 : 0 1 2 3 4

myles-System-Product-Name:11960:11995 [0] NCCL INFO Channel 01/04 : 0 1 2 3 4

myles-System-Product-Name:11960:11995 [0] NCCL INFO Channel 02/04 : 0 1 2 3 4

myles-System-Product-Name:11960:11995 [0] NCCL INFO Channel 03/04 : 0 1 2 3 4

myles-System-Product-Name:11960:11995 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1 [2] 1/-1/-1->0->-1 [3] 1/-1/-1->0->-1

myles-System-Product-Name:11960:11995 [0] NCCL INFO P2P Chunksize set to 131072

myles-System-Product-Name:11960:12002 [3] NCCL INFO [Proxy Service] Device 3 CPU core 4

myles-System-Product-Name:11960:12009 [3] NCCL INFO [Proxy Service UDS] Device 3 CPU core 32

myles-System-Product-Name:11960:12010 [1] NCCL INFO [Proxy Service UDS] Device 1 CPU core 39

myles-System-Product-Name:11960:12005 [1] NCCL INFO [Proxy Service] Device 1 CPU core 84

myles-System-Product-Name:11960:12006 [4] NCCL INFO [Proxy Service] Device 4 CPU core 85

myles-System-Product-Name:11960:12007 [2] NCCL INFO [Proxy Service UDS] Device 2 CPU core 9

myles-System-Product-Name:11960:12004 [0] NCCL INFO [Proxy Service] Device 0 CPU core 12

myles-System-Product-Name:11960:12003 [2] NCCL INFO [Proxy Service] Device 2 CPU core 79

myles-System-Product-Name:11960:12011 [4] NCCL INFO [Proxy Service UDS] Device 4 CPU core 47

myles-System-Product-Name:11960:12008 [0] NCCL INFO [Proxy Service UDS] Device 0 CPU core 24

myles-System-Product-Name:11960:11997 [2] NCCL INFO Channel 00/0 : 2[2] -> 3[3] via P2P/direct pointer

myles-System-Product-Name:11960:11999 [4] NCCL INFO Channel 00/0 : 4[4] -> 0[0] via P2P/direct pointer

myles-System-Product-Name:11960:11995 [0] NCCL INFO Channel 00/0 : 0[0] -> 1[1] via P2P/direct pointer

myles-System-Product-Name:11960:11999 [4] NCCL INFO Channel 01/0 : 4[4] -> 0[0] via P2P/direct pointer

myles-System-Product-Name:11960:11998 [3] NCCL INFO Channel 00/0 : 3[3] -> 4[4] via P2P/direct pointer

myles-System-Product-Name:11960:11996 [1] NCCL INFO Channel 00/0 : 1[1] -> 2[2] via P2P/direct pointer

myles-System-Product-Name:11960:11997 [2] NCCL INFO Channel 01/0 : 2[2] -> 3[3] via P2P/direct pointer

myles-System-Product-Name:11960:11995 [0] NCCL INFO Channel 01/0 : 0[0] -> 1[1] via P2P/direct pointer

myles-System-Product-Name:11960:11998 [3] NCCL INFO Channel 01/0 : 3[3] -> 4[4] via P2P/direct pointer

myles-System-Product-Name:11960:11999 [4] NCCL INFO Channel 02/0 : 4[4] -> 0[0] via P2P/direct pointer

myles-System-Product-Name:11960:11996 [1] NCCL INFO Channel 01/0 : 1[1] -> 2[2] via P2P/direct pointer

myles-System-Product-Name:11960:11995 [0] NCCL INFO Channel 02/0 : 0[0] -> 1[1] via P2P/direct pointer

myles-System-Product-Name:11960:11997 [2] NCCL INFO Channel 02/0 : 2[2] -> 3[3] via P2P/direct pointer

myles-System-Product-Name:11960:11996 [1] NCCL INFO Channel 02/0 : 1[1] -> 2[2] via P2P/direct pointer

myles-System-Product-Name:11960:11998 [3] NCCL INFO Channel 02/0 : 3[3] -> 4[4] via P2P/direct pointer

myles-System-Product-Name:11960:11995 [0] NCCL INFO Channel 03/0 : 0[0] -> 1[1] via P2P/direct pointer

myles-System-Product-Name:11960:11999 [4] NCCL INFO Channel 03/0 : 4[4] -> 0[0] via P2P/direct pointer

myles-System-Product-Name:11960:11997 [2] NCCL INFO Channel 03/0 : 2[2] -> 3[3] via P2P/direct pointer

myles-System-Product-Name:11960:11996 [1] NCCL INFO Channel 03/0 : 1[1] -> 2[2] via P2P/direct pointer

myles-System-Product-Name:11960:11998 [3] NCCL INFO Channel 03/0 : 3[3] -> 4[4] via P2P/direct pointer

myles-System-Product-Name:11960:11997 [2] NCCL INFO Connected all rings

myles-System-Product-Name:11960:11995 [0] NCCL INFO Connected all rings

myles-System-Product-Name:11960:11999 [4] NCCL INFO Connected all rings

myles-System-Product-Name:11960:11998 [3] NCCL INFO Connected all rings

myles-System-Product-Name:11960:11999 [4] NCCL INFO Channel 00/0 : 4[4] -> 3[3] via P2P/direct pointer

myles-System-Product-Name:11960:11996 [1] NCCL INFO Connected all rings

myles-System-Product-Name:11960:11999 [4] NCCL INFO Channel 01/0 : 4[4] -> 3[3] via P2P/direct pointer

myles-System-Product-Name:11960:11999 [4] NCCL INFO Channel 02/0 : 4[4] -> 3[3] via P2P/direct pointer

myles-System-Product-Name:11960:11999 [4] NCCL INFO Channel 03/0 : 4[4] -> 3[3] via P2P/direct pointer

myles-System-Product-Name:11960:11997 [2] NCCL INFO Channel 00/0 : 2[2] -> 1[1] via P2P/direct pointer

myles-System-Product-Name:11960:11998 [3] NCCL INFO Channel 00/0 : 3[3] -> 2[2] via P2P/direct pointer

myles-System-Product-Name:11960:11996 [1] NCCL INFO Channel 00/0 : 1[1] -> 0[0] via P2P/direct pointer

myles-System-Product-Name:11960:11997 [2] NCCL INFO Channel 01/0 : 2[2] -> 1[1] via P2P/direct pointer

myles-System-Product-Name:11960:11998 [3] NCCL INFO Channel 01/0 : 3[3] -> 2[2] via P2P/direct pointer

myles-System-Product-Name:11960:11996 [1] NCCL INFO Channel 01/0 : 1[1] -> 0[0] via P2P/direct pointer

myles-System-Product-Name:11960:11997 [2] NCCL INFO Channel 02/0 : 2[2] -> 1[1] via P2P/direct pointer

myles-System-Product-Name:11960:11998 [3] NCCL INFO Channel 02/0 : 3[3] -> 2[2] via P2P/direct pointer

myles-System-Product-Name:11960:11996 [1] NCCL INFO Channel 02/0 : 1[1] -> 0[0] via P2P/direct pointer

myles-System-Product-Name:11960:11997 [2] NCCL INFO Channel 03/0 : 2[2] -> 1[1] via P2P/direct pointer

myles-System-Product-Name:11960:11998 [3] NCCL INFO Channel 03/0 : 3[3] -> 2[2] via P2P/direct pointer

myles-System-Product-Name:11960:11996 [1] NCCL INFO Channel 03/0 : 1[1] -> 0[0] via P2P/direct pointer

myles-System-Product-Name:11960:11999 [4] NCCL INFO Connected all trees

myles-System-Product-Name:11960:11998 [3] NCCL INFO Connected all trees

myles-System-Product-Name:11960:11995 [0] NCCL INFO Connected all trees

myles-System-Product-Name:11960:11997 [2] NCCL INFO Connected all trees

myles-System-Product-Name:11960:11996 [1] NCCL INFO Connected all trees

myles-System-Product-Name:11960:12012 [3] NCCL INFO [Proxy Progress] Device 3 CPU core 15

myles-System-Product-Name:11960:12013 [0] NCCL INFO [Proxy Progress] Device 0 CPU core 84

myles-System-Product-Name:11960:12014 [2] NCCL INFO [Proxy Progress] Device 2 CPU core 3

myles-System-Product-Name:11960:12015 [4] NCCL INFO [Proxy Progress] Device 4 CPU core 28

myles-System-Product-Name:11960:12016 [1] NCCL INFO [Proxy Progress] Device 1 CPU core 81

myles-System-Product-Name:11960:11998 [3] NCCL INFO threadThresholds 8/8/64 | 40/8/64 | 512 | 512

myles-System-Product-Name:11960:11998 [3] NCCL INFO 4 coll channels, 4 collnet channels, 0 nvls channels, 4 p2p channels, 2 p2p channels per peer

myles-System-Product-Name:11960:11997 [2] NCCL INFO threadThresholds 8/8/64 | 40/8/64 | 512 | 512

myles-System-Product-Name:11960:11997 [2] NCCL INFO 4 coll channels, 4 collnet channels, 0 nvls channels, 4 p2p channels, 2 p2p channels per peer

myles-System-Product-Name:11960:11995 [0] NCCL INFO threadThresholds 8/8/64 | 40/8/64 | 512 | 512

myles-System-Product-Name:11960:11995 [0] NCCL INFO 4 coll channels, 4 collnet channels, 0 nvls channels, 4 p2p channels, 2 p2p channels per peer

myles-System-Product-Name:11960:11999 [4] NCCL INFO threadThresholds 8/8/64 | 40/8/64 | 512 | 512

myles-System-Product-Name:11960:11999 [4] NCCL INFO 4 coll channels, 4 collnet channels, 0 nvls channels, 4 p2p channels, 2 p2p channels per peer

myles-System-Product-Name:11960:11996 [1] NCCL INFO threadThresholds 8/8/64 | 40/8/64 | 512 | 512

myles-System-Product-Name:11960:11996 [1] NCCL INFO 4 coll channels, 4 collnet channels, 0 nvls channels, 4 p2p channels, 2 p2p channels per peer

myles-System-Product-Name:11960:11995 [0] NCCL INFO CC Off, Multi-GPU CC Off, workFifoBytes 1048576

myles-System-Product-Name:11960:11998 [3] NCCL INFO TUNER/Plugin: Could not find: libnccl-tuner.so libnccl-net.so. Using internal tuner plugin.

myles-System-Product-Name:11960:11998 [3] NCCL INFO ncclCommInitAll comm 0x558ba9263e20 rank 3 nranks 5 cudaDev 3 nvmlDev 3 busId 41000 commId 0x668cd3327447b9bd - Init COMPLETE

myles-System-Product-Name:11960:11998 [3] NCCL INFO Init timings - ncclCommInitAll: rank 3 nranks 5 total 0.53 (kernels 0.38, alloc 0.06, bootstrap 0.00, allgathers 0.01, topo 0.03, graphs 0.00, connections 0.06, rest 0.00)

myles-System-Product-Name:11960:11996 [1] NCCL INFO ncclCommInitAll comm 0x558ba91df9a0 rank 1 nranks 5 cudaDev 1 nvmlDev 1 busId 2000 commId 0x668cd3327447b9bd - Init COMPLETE

myles-System-Product-Name:11960:11997 [2] NCCL INFO ncclCommInitAll comm 0x558ba9221be0 rank 2 nranks 5 cudaDev 2 nvmlDev 2 busId 2b000 commId 0x668cd3327447b9bd - Init COMPLETE

myles-System-Product-Name:11960:11999 [4] NCCL INFO ncclCommInitAll comm 0x558ba92a6060 rank 4 nranks 5 cudaDev 4 nvmlDev 4 busId 42000 commId 0x668cd3327447b9bd - Init COMPLETE

myles-System-Product-Name:11960:11996 [1] NCCL INFO Init timings - ncclCommInitAll: rank 1 nranks 5 total 0.53 (kernels 0.38, alloc 0.06, bootstrap 0.00, allgathers 0.00, topo 0.03, graphs 0.00, connections 0.06, rest 0.00)

myles-System-Product-Name:11960:11999 [4] NCCL INFO Init timings - ncclCommInitAll: rank 4 nranks 5 total 0.53 (kernels 0.38, alloc 0.05, bootstrap 0.00, allgathers 0.00, topo 0.03, graphs 0.00, connections 0.06, rest 0.00)

myles-System-Product-Name:11960:11995 [0] NCCL INFO ncclCommInitAll comm 0x558ba919d800 rank 0 nranks 5 cudaDev 0 nvmlDev 0 busId 1000 commId 0x668cd3327447b9bd - Init COMPLETE

myles-System-Product-Name:11960:11995 [0] NCCL INFO Init timings - ncclCommInitAll: rank 0 nranks 5 total 0.53 (kernels 0.37, alloc 0.06, bootstrap 0.00, allgathers 0.01, topo 0.03, graphs 0.00, connections 0.06, rest 0.00)

myles-System-Product-Name:11960:11997 [2] NCCL INFO Init timings - ncclCommInitAll: rank 2 nranks 5 total 0.53 (kernels 0.38, alloc 0.06, bootstrap 0.00, allgathers 0.01, topo 0.03, graphs 0.00, connections 0.06, rest 0.00)

#

#                                                              out-of-place                       in-place          

#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong

#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       

    33554432       8388608     float     sum      -1   2174.8   15.43   24.69      0   2170.2   15.46   24.74      0

myles-System-Product-Name:11960:11960 [0] NCCL INFO comm 0x558ba919d800 rank 0 nranks 5 cudaDev 0 busId 1000 - Destroy COMPLETE

myles-System-Product-Name:11960:11960 [4] NCCL INFO comm 0x558ba92a6060 rank 4 nranks 5 cudaDev 4 busId 42000 - Destroy COMPLETE

myles-System-Product-Name:11960:11960 [3] NCCL INFO comm 0x558ba9263e20 rank 3 nranks 5 cudaDev 3 busId 41000 - Destroy COMPLETE

myles-System-Product-Name:11960:11960 [2] NCCL INFO comm 0x558ba9221be0 rank 2 nranks 5 cudaDev 2 busId 2b000 - Destroy COMPLETE

myles-System-Product-Name:11960:11960 [1] NCCL INFO comm 0x558ba91df9a0 rank 1 nranks 5 cudaDev 1 busId 2000 - Destroy COMPLETE

# Out of bounds values : 0 OK

# Avg bus bandwidth    : 24.7118 

#



myles@myles-System-Product-Name:~/nccl-tests/build$ NCCL_DEBUG=INFO NCCL_P2P_LEVEL=PHB ./all_reduce_perf -g 6

# nThread 1 nGpus 6 minBytes 33554432 maxBytes 33554432 step: 1048576(bytes) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0

#

# Using devices

#  Rank  0 Group  0 Pid  12019 on myles-System-Product-Name device  0 [0x01] NVIDIA GeForce RTX 4090

#  Rank  1 Group  0 Pid  12019 on myles-System-Product-Name device  1 [0x02] NVIDIA GeForce RTX 4090

#  Rank  2 Group  0 Pid  12019 on myles-System-Product-Name device  2 [0x2b] NVIDIA GeForce RTX 4090

#  Rank  3 Group  0 Pid  12019 on myles-System-Product-Name device  3 [0x41] NVIDIA GeForce RTX 4090

#  Rank  4 Group  0 Pid  12019 on myles-System-Product-Name device  4 [0x42] NVIDIA GeForce RTX 4090

#  Rank  5 Group  0 Pid  12019 on myles-System-Product-Name device  5 [0x61] NVIDIA GeForce RTX 4090

myles-System-Product-Name:12019:12019 [0] NCCL INFO Bootstrap : Using enp37s0f0:192.168.1.32<0>

myles-System-Product-Name:12019:12019 [0] NCCL INFO cudaDriverVersion 12060

myles-System-Product-Name:12019:12019 [0] NCCL INFO NCCL version 2.23.4+cuda12.6

myles-System-Product-Name:12019:12056 [0] NCCL INFO NET/Plugin: Could not find: libnccl-net.so. Using internal network plugin.

myles-System-Product-Name:12019:12056 [0] NCCL INFO Failed to open libibverbs.so[.1]

myles-System-Product-Name:12019:12056 [0] NCCL INFO NET/Socket : Using [0]enp37s0f0:192.168.1.32<0> [1]enp37s0f1:192.168.1.47<0>

myles-System-Product-Name:12019:12056 [0] NCCL INFO PROFILER/Plugin: Could not find: libnccl-profiler.so.

myles-System-Product-Name:12019:12056 [0] NCCL INFO Using network Socket

myles-System-Product-Name:12019:12057 [1] NCCL INFO Using network Socket

myles-System-Product-Name:12019:12061 [5] NCCL INFO Using network Socket

myles-System-Product-Name:12019:12060 [4] NCCL INFO Using network Socket

myles-System-Product-Name:12019:12059 [3] NCCL INFO Using network Socket

myles-System-Product-Name:12019:12058 [2] NCCL INFO Using network Socket

myles-System-Product-Name:12019:12060 [4] NCCL INFO ncclCommInitAll comm 0x63acc7870060 rank 4 nranks 6 cudaDev 4 nvmlDev 4 busId 42000 commId 0xbc91574eaba2750d - Init START

myles-System-Product-Name:12019:12058 [2] NCCL INFO ncclCommInitAll comm 0x63acc77ea2e0 rank 2 nranks 6 cudaDev 2 nvmlDev 2 busId 2b000 commId 0xbc91574eaba2750d - Init START

myles-System-Product-Name:12019:12057 [1] NCCL INFO ncclCommInitAll comm 0x63acc77a7420 rank 1 nranks 6 cudaDev 1 nvmlDev 1 busId 2000 commId 0xbc91574eaba2750d - Init START

myles-System-Product-Name:12019:12056 [0] NCCL INFO ncclCommInitAll comm 0x63acc7764600 rank 0 nranks 6 cudaDev 0 nvmlDev 0 busId 1000 commId 0xbc91574eaba2750d - Init START

myles-System-Product-Name:12019:12061 [5] NCCL INFO ncclCommInitAll comm 0x63acc78b2f20 rank 5 nranks 6 cudaDev 5 nvmlDev 5 busId 61000 commId 0xbc91574eaba2750d - Init START

myles-System-Product-Name:12019:12059 [3] NCCL INFO ncclCommInitAll comm 0x63acc782d1a0 rank 3 nranks 6 cudaDev 3 nvmlDev 3 busId 41000 commId 0xbc91574eaba2750d - Init START

myles-System-Product-Name:12019:12056 [0] NCCL INFO Bootstrap timings total 0.001487 (create 0.000049, send 0.000171, recv 0.000441, ring 0.000413, delay 0.000000)

myles-System-Product-Name:12019:12057 [1] NCCL INFO Bootstrap timings total 0.001549 (create 0.000063, send 0.000173, recv 0.000346, ring 0.000623, delay 0.000000)

myles-System-Product-Name:12019:12061 [5] NCCL INFO Bootstrap timings total 0.001454 (create 0.000057, send 0.000170, recv 0.000636, ring 0.000397, delay 0.000000)

myles-System-Product-Name:12019:12058 [2] NCCL INFO Bootstrap timings total 0.001582 (create 0.000060, send 0.000179, recv 0.000907, ring 0.000228, delay 0.000000)

myles-System-Product-Name:12019:12059 [3] NCCL INFO Bootstrap timings total 0.001435 (create 0.000058, send 0.000176, recv 0.000865, ring 0.000142, delay 0.000000)

myles-System-Product-Name:12019:12060 [4] NCCL INFO Bootstrap timings total 0.001640 (create 0.000060, send 0.000176, recv 0.000699, ring 0.000186, delay 0.000001)

myles-System-Product-Name:12019:12060 [4] NCCL INFO NCCL_P2P_LEVEL set by environment to PHB

myles-System-Product-Name:12019:12060 [4] NCCL INFO NVLS multicast support is not available on dev 4

myles-System-Product-Name:12019:12056 [0] NCCL INFO NVLS multicast support is not available on dev 0

myles-System-Product-Name:12019:12059 [3] NCCL INFO NVLS multicast support is not available on dev 3

myles-System-Product-Name:12019:12061 [5] NCCL INFO NVLS multicast support is not available on dev 5

myles-System-Product-Name:12019:12057 [1] NCCL INFO NVLS multicast support is not available on dev 1

myles-System-Product-Name:12019:12058 [2] NCCL INFO NVLS multicast support is not available on dev 2

myles-System-Product-Name:12019:12056 [0] NCCL INFO comm 0x63acc7764600 rank 0 nRanks 6 nNodes 1 localRanks 6 localRank 0 MNNVL 0

myles-System-Product-Name:12019:12060 [4] NCCL INFO comm 0x63acc7870060 rank 4 nRanks 6 nNodes 1 localRanks 6 localRank 4 MNNVL 0

myles-System-Product-Name:12019:12058 [2] NCCL INFO comm 0x63acc77ea2e0 rank 2 nRanks 6 nNodes 1 localRanks 6 localRank 2 MNNVL 0

myles-System-Product-Name:12019:12060 [4] NCCL INFO Trees [0] 5/-1/-1->4->3 [1] 5/-1/-1->4->3 [2] 5/-1/-1->4->3 [3] 5/-1/-1->4->3

myles-System-Product-Name:12019:12060 [4] NCCL INFO P2P Chunksize set to 131072

myles-System-Product-Name:12019:12056 [0] NCCL INFO Channel 00/04 : 0 1 2 3 4 5

myles-System-Product-Name:12019:12058 [2] NCCL INFO Trees [0] 3/-1/-1->2->1 [1] 3/-1/-1->2->1 [2] 3/-1/-1->2->1 [3] 3/-1/-1->2->1

myles-System-Product-Name:12019:12058 [2] NCCL INFO P2P Chunksize set to 131072

myles-System-Product-Name:12019:12061 [5] NCCL INFO comm 0x63acc78b2f20 rank 5 nRanks 6 nNodes 1 localRanks 6 localRank 5 MNNVL 0

myles-System-Product-Name:12019:12057 [1] NCCL INFO comm 0x63acc77a7420 rank 1 nRanks 6 nNodes 1 localRanks 6 localRank 1 MNNVL 0

myles-System-Product-Name:12019:12059 [3] NCCL INFO comm 0x63acc782d1a0 rank 3 nRanks 6 nNodes 1 localRanks 6 localRank 3 MNNVL 0

myles-System-Product-Name:12019:12056 [0] NCCL INFO Channel 01/04 : 0 1 2 3 4 5

myles-System-Product-Name:12019:12056 [0] NCCL INFO Channel 02/04 : 0 1 2 3 4 5

myles-System-Product-Name:12019:12059 [3] NCCL INFO Trees [0] 4/-1/-1->3->2 [1] 4/-1/-1->3->2 [2] 4/-1/-1->3->2 [3] 4/-1/-1->3->2

myles-System-Product-Name:12019:12061 [5] NCCL INFO Trees [0] -1/-1/-1->5->4 [1] -1/-1/-1->5->4 [2] -1/-1/-1->5->4 [3] -1/-1/-1->5->4

myles-System-Product-Name:12019:12061 [5] NCCL INFO P2P Chunksize set to 131072

myles-System-Product-Name:12019:12059 [3] NCCL INFO P2P Chunksize set to 131072

myles-System-Product-Name:12019:12057 [1] NCCL INFO Trees [0] 2/-1/-1->1->0 [1] 2/-1/-1->1->0 [2] 2/-1/-1->1->0 [3] 2/-1/-1->1->0

myles-System-Product-Name:12019:12057 [1] NCCL INFO P2P Chunksize set to 131072

myles-System-Product-Name:12019:12056 [0] NCCL INFO Channel 03/04 : 0 1 2 3 4 5

myles-System-Product-Name:12019:12056 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1 [2] 1/-1/-1->0->-1 [3] 1/-1/-1->0->-1

myles-System-Product-Name:12019:12056 [0] NCCL INFO P2P Chunksize set to 131072

myles-System-Product-Name:12019:12066 [4] NCCL INFO [Proxy Service UDS] Device 4 CPU core 30

myles-System-Product-Name:12019:12065 [2] NCCL INFO [Proxy Service] Device 2 CPU core 61

myles-System-Product-Name:12019:12064 [4] NCCL INFO [Proxy Service] Device 4 CPU core 25

myles-System-Product-Name:12019:12071 [3] NCCL INFO [Proxy Service UDS] Device 3 CPU core 44

myles-System-Product-Name:12019:12074 [5] NCCL INFO [Proxy Service UDS] Device 5 CPU core 118

myles-System-Product-Name:12019:12069 [0] NCCL INFO [Proxy Service] Device 0 CPU core 115

myles-System-Product-Name:12019:12068 [5] NCCL INFO [Proxy Service] Device 5 CPU core 48

myles-System-Product-Name:12019:12072 [1] NCCL INFO [Proxy Service] Device 1 CPU core 19

myles-System-Product-Name:12019:12067 [3] NCCL INFO [Proxy Service] Device 3 CPU core 32

myles-System-Product-Name:12019:12073 [0] NCCL INFO [Proxy Service UDS] Device 0 CPU core 9

myles-System-Product-Name:12019:12070 [2] NCCL INFO [Proxy Service UDS] Device 2 CPU core 103

myles-System-Product-Name:12019:12075 [1] NCCL INFO [Proxy Service UDS] Device 1 CPU core 4

myles-System-Product-Name:12019:12056 [0] NCCL INFO Channel 00/0 : 0[0] -> 1[1] via P2P/direct pointer

myles-System-Product-Name:12019:12060 [4] NCCL INFO Channel 00/0 : 4[4] -> 5[5] via P2P/direct pointer

myles-System-Product-Name:12019:12056 [0] NCCL INFO Channel 01/0 : 0[0] -> 1[1] via P2P/direct pointer

myles-System-Product-Name:12019:12060 [4] NCCL INFO Channel 01/0 : 4[4] -> 5[5] via P2P/direct pointer

myles-System-Product-Name:12019:12058 [2] NCCL INFO Channel 00/0 : 2[2] -> 3[3] via P2P/direct pointer

myles-System-Product-Name:12019:12056 [0] NCCL INFO Channel 02/0 : 0[0] -> 1[1] via P2P/direct pointer

myles-System-Product-Name:12019:12060 [4] NCCL INFO Channel 02/0 : 4[4] -> 5[5] via P2P/direct pointer

myles-System-Product-Name:12019:12059 [3] NCCL INFO Channel 00/0 : 3[3] -> 4[4] via P2P/direct pointer

myles-System-Product-Name:12019:12056 [0] NCCL INFO Channel 03/0 : 0[0] -> 1[1] via P2P/direct pointer

myles-System-Product-Name:12019:12058 [2] NCCL INFO Channel 01/0 : 2[2] -> 3[3] via P2P/direct pointer

myles-System-Product-Name:12019:12057 [1] NCCL INFO Channel 00/0 : 1[1] -> 2[2] via P2P/direct pointer

myles-System-Product-Name:12019:12059 [3] NCCL INFO Channel 01/0 : 3[3] -> 4[4] via P2P/direct pointer

myles-System-Product-Name:12019:12061 [5] NCCL INFO Channel 00/0 : 5[5] -> 0[0] via P2P/direct pointer

myles-System-Product-Name:12019:12060 [4] NCCL INFO Channel 03/0 : 4[4] -> 5[5] via P2P/direct pointer

myles-System-Product-Name:12019:12058 [2] NCCL INFO Channel 02/0 : 2[2] -> 3[3] via P2P/direct pointer

myles-System-Product-Name:12019:12061 [5] NCCL INFO Channel 01/0 : 5[5] -> 0[0] via P2P/direct pointer

myles-System-Product-Name:12019:12057 [1] NCCL INFO Channel 01/0 : 1[1] -> 2[2] via P2P/direct pointer

myles-System-Product-Name:12019:12059 [3] NCCL INFO Channel 02/0 : 3[3] -> 4[4] via P2P/direct pointer

myles-System-Product-Name:12019:12058 [2] NCCL INFO Channel 03/0 : 2[2] -> 3[3] via P2P/direct pointer

myles-System-Product-Name:12019:12059 [3] NCCL INFO Channel 03/0 : 3[3] -> 4[4] via P2P/direct pointer

myles-System-Product-Name:12019:12061 [5] NCCL INFO Channel 02/0 : 5[5] -> 0[0] via P2P/direct pointer

myles-System-Product-Name:12019:12057 [1] NCCL INFO Channel 02/0 : 1[1] -> 2[2] via P2P/direct pointer

myles-System-Product-Name:12019:12057 [1] NCCL INFO Channel 03/0 : 1[1] -> 2[2] via P2P/direct pointer

myles-System-Product-Name:12019:12061 [5] NCCL INFO Channel 03/0 : 5[5] -> 0[0] via P2P/direct pointer

myles-System-Product-Name:12019:12058 [2] NCCL INFO Connected all rings

myles-System-Product-Name:12019:12059 [3] NCCL INFO Connected all rings

myles-System-Product-Name:12019:12060 [4] NCCL INFO Connected all rings

myles-System-Product-Name:12019:12057 [1] NCCL INFO Connected all rings

myles-System-Product-Name:12019:12061 [5] NCCL INFO Connected all rings

myles-System-Product-Name:12019:12056 [0] NCCL INFO Connected all rings

myles-System-Product-Name:12019:12061 [5] NCCL INFO Channel 00/0 : 5[5] -> 4[4] via P2P/direct pointer

myles-System-Product-Name:12019:12061 [5] NCCL INFO Channel 01/0 : 5[5] -> 4[4] via P2P/direct pointer

myles-System-Product-Name:12019:12061 [5] NCCL INFO Channel 02/0 : 5[5] -> 4[4] via P2P/direct pointer

myles-System-Product-Name:12019:12058 [2] NCCL INFO Channel 00/0 : 2[2] -> 1[1] via P2P/direct pointer

myles-System-Product-Name:12019:12061 [5] NCCL INFO Channel 03/0 : 5[5] -> 4[4] via P2P/direct pointer

myles-System-Product-Name:12019:12059 [3] NCCL INFO Channel 00/0 : 3[3] -> 2[2] via P2P/direct pointer

myles-System-Product-Name:12019:12058 [2] NCCL INFO Channel 01/0 : 2[2] -> 1[1] via P2P/direct pointer

myles-System-Product-Name:12019:12060 [4] NCCL INFO Channel 00/0 : 4[4] -> 3[3] via P2P/direct pointer

myles-System-Product-Name:12019:12057 [1] NCCL INFO Channel 00/0 : 1[1] -> 0[0] via P2P/direct pointer

myles-System-Product-Name:12019:12059 [3] NCCL INFO Channel 01/0 : 3[3] -> 2[2] via P2P/direct pointer

myles-System-Product-Name:12019:12058 [2] NCCL INFO Channel 02/0 : 2[2] -> 1[1] via P2P/direct pointer

myles-System-Product-Name:12019:12060 [4] NCCL INFO Channel 01/0 : 4[4] -> 3[3] via P2P/direct pointer

myles-System-Product-Name:12019:12057 [1] NCCL INFO Channel 01/0 : 1[1] -> 0[0] via P2P/direct pointer

myles-System-Product-Name:12019:12059 [3] NCCL INFO Channel 02/0 : 3[3] -> 2[2] via P2P/direct pointer

myles-System-Product-Name:12019:12058 [2] NCCL INFO Channel 03/0 : 2[2] -> 1[1] via P2P/direct pointer

myles-System-Product-Name:12019:12060 [4] NCCL INFO Channel 02/0 : 4[4] -> 3[3] via P2P/direct pointer

myles-System-Product-Name:12019:12057 [1] NCCL INFO Channel 02/0 : 1[1] -> 0[0] via P2P/direct pointer

myles-System-Product-Name:12019:12059 [3] NCCL INFO Channel 03/0 : 3[3] -> 2[2] via P2P/direct pointer

myles-System-Product-Name:12019:12060 [4] NCCL INFO Channel 03/0 : 4[4] -> 3[3] via P2P/direct pointer

myles-System-Product-Name:12019:12057 [1] NCCL INFO Channel 03/0 : 1[1] -> 0[0] via P2P/direct pointer

myles-System-Product-Name:12019:12061 [5] NCCL INFO Connected all trees

myles-System-Product-Name:12019:12060 [4] NCCL INFO Connected all trees

myles-System-Product-Name:12019:12059 [3] NCCL INFO Connected all trees

myles-System-Product-Name:12019:12056 [0] NCCL INFO Connected all trees

myles-System-Product-Name:12019:12058 [2] NCCL INFO Connected all trees

myles-System-Product-Name:12019:12057 [1] NCCL INFO Connected all trees

myles-System-Product-Name:12019:12076 [5] NCCL INFO [Proxy Progress] Device 5 CPU core 25

myles-System-Product-Name:12019:12077 [4] NCCL INFO [Proxy Progress] Device 4 CPU core 32

myles-System-Product-Name:12019:12078 [2] NCCL INFO [Proxy Progress] Device 2 CPU core 115

myles-System-Product-Name:12019:12079 [1] NCCL INFO [Proxy Progress] Device 1 CPU core 124

myles-System-Product-Name:12019:12080 [0] NCCL INFO [Proxy Progress] Device 0 CPU core 42

myles-System-Product-Name:12019:12081 [3] NCCL INFO [Proxy Progress] Device 3 CPU core 5

myles-System-Product-Name:12019:12061 [5] NCCL INFO threadThresholds 8/8/64 | 48/8/64 | 512 | 512

myles-System-Product-Name:12019:12061 [5] NCCL INFO 4 coll channels, 4 collnet channels, 0 nvls channels, 4 p2p channels, 2 p2p channels per peer

myles-System-Product-Name:12019:12060 [4] NCCL INFO threadThresholds 8/8/64 | 48/8/64 | 512 | 512

myles-System-Product-Name:12019:12060 [4] NCCL INFO 4 coll channels, 4 collnet channels, 0 nvls channels, 4 p2p channels, 2 p2p channels per peer

myles-System-Product-Name:12019:12058 [2] NCCL INFO threadThresholds 8/8/64 | 48/8/64 | 512 | 512

myles-System-Product-Name:12019:12058 [2] NCCL INFO 4 coll channels, 4 collnet channels, 0 nvls channels, 4 p2p channels, 2 p2p channels per peer

myles-System-Product-Name:12019:12057 [1] NCCL INFO threadThresholds 8/8/64 | 48/8/64 | 512 | 512

myles-System-Product-Name:12019:12057 [1] NCCL INFO 4 coll channels, 4 collnet channels, 0 nvls channels, 4 p2p channels, 2 p2p channels per peer

myles-System-Product-Name:12019:12056 [0] NCCL INFO threadThresholds 8/8/64 | 48/8/64 | 512 | 512

myles-System-Product-Name:12019:12056 [0] NCCL INFO 4 coll channels, 4 collnet channels, 0 nvls channels, 4 p2p channels, 2 p2p channels per peer

myles-System-Product-Name:12019:12059 [3] NCCL INFO threadThresholds 8/8/64 | 48/8/64 | 512 | 512

myles-System-Product-Name:12019:12059 [3] NCCL INFO 4 coll channels, 4 collnet channels, 0 nvls channels, 4 p2p channels, 2 p2p channels per peer

myles-System-Product-Name:12019:12056 [0] NCCL INFO CC Off, Multi-GPU CC Off, workFifoBytes 1048576

myles-System-Product-Name:12019:12060 [4] NCCL INFO TUNER/Plugin: Could not find: libnccl-tuner.so libnccl-net.so. Using internal tuner plugin.

myles-System-Product-Name:12019:12060 [4] NCCL INFO ncclCommInitAll comm 0x63acc7870060 rank 4 nranks 6 cudaDev 4 nvmlDev 4 busId 42000 commId 0xbc91574eaba2750d - Init COMPLETE

myles-System-Product-Name:12019:12060 [4] NCCL INFO Init timings - ncclCommInitAll: rank 4 nranks 6 total 0.62 (kernels 0.44, alloc 0.06, bootstrap 0.00, allgathers 0.01, topo 0.04, graphs 0.00, connections 0.07, rest 0.00)

myles-System-Product-Name:12019:12057 [1] NCCL INFO ncclCommInitAll comm 0x63acc77a7420 rank 1 nranks 6 cudaDev 1 nvmlDev 1 busId 2000 commId 0xbc91574eaba2750d - Init COMPLETE

myles-System-Product-Name:12019:12061 [5] NCCL INFO ncclCommInitAll comm 0x63acc78b2f20 rank 5 nranks 6 cudaDev 5 nvmlDev 5 busId 61000 commId 0xbc91574eaba2750d - Init COMPLETE

myles-System-Product-Name:12019:12056 [0] NCCL INFO ncclCommInitAll comm 0x63acc7764600 rank 0 nranks 6 cudaDev 0 nvmlDev 0 busId 1000 commId 0xbc91574eaba2750d - Init COMPLETE

myles-System-Product-Name:12019:12059 [3] NCCL INFO ncclCommInitAll comm 0x63acc782d1a0 rank 3 nranks 6 cudaDev 3 nvmlDev 3 busId 41000 commId 0xbc91574eaba2750d - Init COMPLETE

myles-System-Product-Name:12019:12056 [0] NCCL INFO Init timings - ncclCommInitAll: rank 0 nranks 6 total 0.62 (kernels 0.43, alloc 0.06, bootstrap 0.00, allgathers 0.01, topo 0.04, graphs 0.00, connections 0.07, rest 0.00)

myles-System-Product-Name:12019:12061 [5] NCCL INFO Init timings - ncclCommInitAll: rank 5 nranks 6 total 0.62 (kernels 0.44, alloc 0.06, bootstrap 0.00, allgathers 0.00, topo 0.05, graphs 0.01, connections 0.07, rest 0.00)

myles-System-Product-Name:12019:12058 [2] NCCL INFO ncclCommInitAll comm 0x63acc77ea2e0 rank 2 nranks 6 cudaDev 2 nvmlDev 2 busId 2b000 commId 0xbc91574eaba2750d - Init COMPLETE

myles-System-Product-Name:12019:12057 [1] NCCL INFO Init timings - ncclCommInitAll: rank 1 nranks 6 total 0.62 (kernels 0.44, alloc 0.06, bootstrap 0.00, allgathers 0.00, topo 0.05, graphs 0.01, connections 0.07, rest 0.00)

myles-System-Product-Name:12019:12058 [2] NCCL INFO Init timings - ncclCommInitAll: rank 2 nranks 6 total 0.62 (kernels 0.45, alloc 0.05, bootstrap 0.00, allgathers 0.00, topo 0.05, graphs 0.01, connections 0.07, rest 0.00)

myles-System-Product-Name:12019:12059 [3] NCCL INFO Init timings - ncclCommInitAll: rank 3 nranks 6 total 0.62 (kernels 0.44, alloc 0.06, bootstrap 0.00, allgathers 0.01, topo 0.04, graphs 0.01, connections 0.07, rest 0.00)

#

#                                                              out-of-place                       in-place          

#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong

#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       

    33554432       8388608     float     sum      -1   2274.8   14.75   24.58      0   2279.9   14.72   24.53      0

myles-System-Product-Name:12019:12019 [0] NCCL INFO comm 0x63acc7764600 rank 0 nranks 6 cudaDev 0 busId 1000 - Destroy COMPLETE

myles-System-Product-Name:12019:12019 [5] NCCL INFO comm 0x63acc78b2f20 rank 5 nranks 6 cudaDev 5 busId 61000 - Destroy COMPLETE

myles-System-Product-Name:12019:12019 [4] NCCL INFO comm 0x63acc7870060 rank 4 nranks 6 cudaDev 4 busId 42000 - Destroy COMPLETE

myles-System-Product-Name:12019:12019 [3] NCCL INFO comm 0x63acc782d1a0 rank 3 nranks 6 cudaDev 3 busId 41000 - Destroy COMPLETE

myles-System-Product-Name:12019:12019 [2] NCCL INFO comm 0x63acc77ea2e0 rank 2 nranks 6 cudaDev 2 busId 2b000 - Destroy COMPLETE

myles-System-Product-Name:12019:12019 [1] NCCL INFO comm 0x63acc77a7420 rank 1 nranks 6 cudaDev 1 busId 2000 - Destroy COMPLETE

# Out of bounds values : 0 OK

# Avg bus bandwidth    : 24.557 

#



myles@myles-System-Product-Name:~/nccl-tests/build$ NCCL_DEBUG=INFO NCCL_P2P_LEVEL=PHB ./all_reduce_perf -g 7

# nThread 1 nGpus 7 minBytes 33554432 maxBytes 33554432 step: 1048576(bytes) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0

#

# Using devices

#  Rank  0 Group  0 Pid  12085 on myles-System-Product-Name device  0 [0x01] NVIDIA GeForce RTX 4090

#  Rank  1 Group  0 Pid  12085 on myles-System-Product-Name device  1 [0x02] NVIDIA GeForce RTX 4090

#  Rank  2 Group  0 Pid  12085 on myles-System-Product-Name device  2 [0x2b] NVIDIA GeForce RTX 4090

#  Rank  3 Group  0 Pid  12085 on myles-System-Product-Name device  3 [0x41] NVIDIA GeForce RTX 4090

#  Rank  4 Group  0 Pid  12085 on myles-System-Product-Name device  4 [0x42] NVIDIA GeForce RTX 4090

#  Rank  5 Group  0 Pid  12085 on myles-System-Product-Name device  5 [0x61] NVIDIA GeForce RTX 4090

#  Rank  6 Group  0 Pid  12085 on myles-System-Product-Name device  6 [0x62] NVIDIA GeForce RTX 4090

myles-System-Product-Name:12085:12085 [0] NCCL INFO Bootstrap : Using enp37s0f0:192.168.1.32<0>

myles-System-Product-Name:12085:12085 [0] NCCL INFO cudaDriverVersion 12060

myles-System-Product-Name:12085:12085 [0] NCCL INFO NCCL version 2.23.4+cuda12.6

myles-System-Product-Name:12085:12128 [4] NCCL INFO NET/Plugin: Could not find: libnccl-net.so. Using internal network plugin.

myles-System-Product-Name:12085:12128 [4] NCCL INFO Failed to open libibverbs.so[.1]

myles-System-Product-Name:12085:12128 [4] NCCL INFO NET/Socket : Using [0]enp37s0f0:192.168.1.32<0> [1]enp37s0f1:192.168.1.47<0>

myles-System-Product-Name:12085:12128 [4] NCCL INFO PROFILER/Plugin: Could not find: libnccl-profiler.so.

myles-System-Product-Name:12085:12128 [4] NCCL INFO Using network Socket

myles-System-Product-Name:12085:12129 [5] NCCL INFO Using network Socket

myles-System-Product-Name:12085:12124 [0] NCCL INFO Using network Socket

myles-System-Product-Name:12085:12126 [2] NCCL INFO Using network Socket

myles-System-Product-Name:12085:12127 [3] NCCL INFO Using network Socket

myles-System-Product-Name:12085:12125 [1] NCCL INFO Using network Socket

myles-System-Product-Name:12085:12130 [6] NCCL INFO Using network Socket

myles-System-Product-Name:12085:12129 [5] NCCL INFO ncclCommInitAll comm 0x6074a4688650 rank 5 nranks 7 cudaDev 5 nvmlDev 5 busId 61000 commId 0x278945f61c095e7c - Init START

myles-System-Product-Name:12085:12124 [0] NCCL INFO ncclCommInitAll comm 0x6074a4535eb0 rank 0 nranks 7 cudaDev 0 nvmlDev 0 busId 1000 commId 0x278945f61c095e7c - Init START

myles-System-Product-Name:12085:12128 [4] NCCL INFO ncclCommInitAll comm 0x6074a4644b10 rank 4 nranks 7 cudaDev 4 nvmlDev 4 busId 42000 commId 0x278945f61c095e7c - Init START

myles-System-Product-Name:12085:12130 [6] NCCL INFO ncclCommInitAll comm 0x6074a46cc190 rank 6 nranks 7 cudaDev 6 nvmlDev 6 busId 62000 commId 0x278945f61c095e7c - Init START

myles-System-Product-Name:12085:12125 [1] NCCL INFO ncclCommInitAll comm 0x6074a4579950 rank 1 nranks 7 cudaDev 1 nvmlDev 1 busId 2000 commId 0x278945f61c095e7c - Init START

myles-System-Product-Name:12085:12126 [2] NCCL INFO ncclCommInitAll comm 0x6074a45bd490 rank 2 nranks 7 cudaDev 2 nvmlDev 2 busId 2b000 commId 0x278945f61c095e7c - Init START

myles-System-Product-Name:12085:12127 [3] NCCL INFO ncclCommInitAll comm 0x6074a4600fd0 rank 3 nranks 7 cudaDev 3 nvmlDev 3 busId 41000 commId 0x278945f61c095e7c - Init START

myles-System-Product-Name:12085:12129 [5] NCCL INFO Bootstrap timings total 0.001611 (create 0.000054, send 0.000183, recv 0.000613, ring 0.000557, delay 0.000000)

myles-System-Product-Name:12085:12128 [4] NCCL INFO Bootstrap timings total 0.001527 (create 0.000050, send 0.000184, recv 0.000371, ring 0.000196, delay 0.000000)

myles-System-Product-Name:12085:12130 [6] NCCL INFO Bootstrap timings total 0.001496 (create 0.000057, send 0.000187, recv 0.000555, ring 0.000491, delay 0.000000)

myles-System-Product-Name:12085:12124 [0] NCCL INFO Bootstrap timings total 0.001573 (create 0.000057, send 0.000184, recv 0.000725, ring 0.000396, delay 0.000000)

myles-System-Product-Name:12085:12126 [2] NCCL INFO Bootstrap timings total 0.001412 (create 0.000060, send 0.000148, recv 0.000775, ring 0.000209, delay 0.000000)

myles-System-Product-Name:12085:12125 [1] NCCL INFO Bootstrap timings total 0.001473 (create 0.000054, send 0.000159, recv 0.000745, ring 0.000299, delay 0.000000)

myles-System-Product-Name:12085:12127 [3] NCCL INFO Bootstrap timings total 0.001377 (create 0.000060, send 0.000170, recv 0.000776, ring 0.000176, delay 0.000000)

myles-System-Product-Name:12085:12125 [1] NCCL INFO NCCL_P2P_LEVEL set by environment to PHB

myles-System-Product-Name:12085:12125 [1] NCCL INFO NVLS multicast support is not available on dev 1

myles-System-Product-Name:12085:12130 [6] NCCL INFO NVLS multicast support is not available on dev 6

myles-System-Product-Name:12085:12128 [4] NCCL INFO NVLS multicast support is not available on dev 4

myles-System-Product-Name:12085:12129 [5] NCCL INFO NVLS multicast support is not available on dev 5

myles-System-Product-Name:12085:12124 [0] NCCL INFO NVLS multicast support is not available on dev 0

myles-System-Product-Name:12085:12127 [3] NCCL INFO NVLS multicast support is not available on dev 3

myles-System-Product-Name:12085:12126 [2] NCCL INFO NVLS multicast support is not available on dev 2

myles-System-Product-Name:12085:12125 [1] NCCL INFO comm 0x6074a4579950 rank 1 nRanks 7 nNodes 1 localRanks 7 localRank 1 MNNVL 0

myles-System-Product-Name:12085:12126 [2] NCCL INFO comm 0x6074a45bd490 rank 2 nRanks 7 nNodes 1 localRanks 7 localRank 2 MNNVL 0

myles-System-Product-Name:12085:12129 [5] NCCL INFO comm 0x6074a4688650 rank 5 nRanks 7 nNodes 1 localRanks 7 localRank 5 MNNVL 0

myles-System-Product-Name:12085:12125 [1] NCCL INFO Trees [0] 2/-1/-1->1->0 [1] 2/-1/-1->1->0 [2] 2/-1/-1->1->0 [3] 2/-1/-1->1->0

myles-System-Product-Name:12085:12127 [3] NCCL INFO comm 0x6074a4600fd0 rank 3 nRanks 7 nNodes 1 localRanks 7 localRank 3 MNNVL 0

myles-System-Product-Name:12085:12126 [2] NCCL INFO Trees [0] 3/-1/-1->2->1 [1] 3/-1/-1->2->1 [2] 3/-1/-1->2->1 [3] 3/-1/-1->2->1

myles-System-Product-Name:12085:12130 [6] NCCL INFO comm 0x6074a46cc190 rank 6 nRanks 7 nNodes 1 localRanks 7 localRank 6 MNNVL 0

myles-System-Product-Name:12085:12127 [3] NCCL INFO Trees [0] 4/-1/-1->3->2 [1] 4/-1/-1->3->2 [2] 4/-1/-1->3->2 [3] 4/-1/-1->3->2

myles-System-Product-Name:12085:12130 [6] NCCL INFO Trees [0] -1/-1/-1->6->5 [1] -1/-1/-1->6->5 [2] -1/-1/-1->6->5 [3] -1/-1/-1->6->5

myles-System-Product-Name:12085:12129 [5] NCCL INFO Trees [0] 6/-1/-1->5->4 [1] 6/-1/-1->5->4 [2] 6/-1/-1->5->4 [3] 6/-1/-1->5->4

myles-System-Product-Name:12085:12128 [4] NCCL INFO comm 0x6074a4644b10 rank 4 nRanks 7 nNodes 1 localRanks 7 localRank 4 MNNVL 0

myles-System-Product-Name:12085:12124 [0] NCCL INFO comm 0x6074a4535eb0 rank 0 nRanks 7 nNodes 1 localRanks 7 localRank 0 MNNVL 0

myles-System-Product-Name:12085:12125 [1] NCCL INFO P2P Chunksize set to 131072

myles-System-Product-Name:12085:12127 [3] NCCL INFO P2P Chunksize set to 131072

myles-System-Product-Name:12085:12130 [6] NCCL INFO P2P Chunksize set to 131072

myles-System-Product-Name:12085:12129 [5] NCCL INFO P2P Chunksize set to 131072

myles-System-Product-Name:12085:12128 [4] NCCL INFO Trees [0] 5/-1/-1->4->3 [1] 5/-1/-1->4->3 [2] 5/-1/-1->4->3 [3] 5/-1/-1->4->3

myles-System-Product-Name:12085:12128 [4] NCCL INFO P2P Chunksize set to 131072

myles-System-Product-Name:12085:12126 [2] NCCL INFO P2P Chunksize set to 131072

myles-System-Product-Name:12085:12124 [0] NCCL INFO Channel 00/04 : 0 1 2 3 4 5 6

myles-System-Product-Name:12085:12124 [0] NCCL INFO Channel 01/04 : 0 1 2 3 4 5 6

myles-System-Product-Name:12085:12124 [0] NCCL INFO Channel 02/04 : 0 1 2 3 4 5 6

myles-System-Product-Name:12085:12124 [0] NCCL INFO Channel 03/04 : 0 1 2 3 4 5 6

myles-System-Product-Name:12085:12124 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1 [2] 1/-1/-1->0->-1 [3] 1/-1/-1->0->-1

myles-System-Product-Name:12085:12124 [0] NCCL INFO P2P Chunksize set to 131072

myles-System-Product-Name:12085:12133 [1] NCCL INFO [Proxy Service] Device 1 CPU core 117

myles-System-Product-Name:12085:12136 [5] NCCL INFO [Proxy Service] Device 5 CPU core 85

myles-System-Product-Name:12085:12139 [3] NCCL INFO [Proxy Service] Device 3 CPU core 28

myles-System-Product-Name:12085:12137 [0] NCCL INFO [Proxy Service] Device 0 CPU core 91

myles-System-Product-Name:12085:12134 [6] NCCL INFO [Proxy Service] Device 6 CPU core 107

myles-System-Product-Name:12085:12138 [1] NCCL INFO [Proxy Service UDS] Device 1 CPU core 100

myles-System-Product-Name:12085:12144 [0] NCCL INFO [Proxy Service UDS] Device 0 CPU core 52

myles-System-Product-Name:12085:12135 [4] NCCL INFO [Proxy Service] Device 4 CPU core 20

myles-System-Product-Name:12085:12141 [5] NCCL INFO [Proxy Service UDS] Device 5 CPU core 81

myles-System-Product-Name:12085:12143 [2] NCCL INFO [Proxy Service] Device 2 CPU core 105

myles-System-Product-Name:12085:12140 [6] NCCL INFO [Proxy Service UDS] Device 6 CPU core 47

myles-System-Product-Name:12085:12142 [3] NCCL INFO [Proxy Service UDS] Device 3 CPU core 67

myles-System-Product-Name:12085:12145 [4] NCCL INFO [Proxy Service UDS] Device 4 CPU core 91

myles-System-Product-Name:12085:12146 [2] NCCL INFO [Proxy Service UDS] Device 2 CPU core 29

myles-System-Product-Name:12085:12126 [2] NCCL INFO Channel 00/0 : 2[2] -> 3[3] via P2P/direct pointer

myles-System-Product-Name:12085:12129 [5] NCCL INFO Channel 00/0 : 5[5] -> 6[6] via P2P/direct pointer

myles-System-Product-Name:12085:12127 [3] NCCL INFO Channel 00/0 : 3[3] -> 4[4] via P2P/direct pointer

myles-System-Product-Name:12085:12130 [6] NCCL INFO Channel 00/0 : 6[6] -> 0[0] via P2P/direct pointer

myles-System-Product-Name:12085:12126 [2] NCCL INFO Channel 01/0 : 2[2] -> 3[3] via P2P/direct pointer

myles-System-Product-Name:12085:12125 [1] NCCL INFO Channel 00/0 : 1[1] -> 2[2] via P2P/direct pointer

myles-System-Product-Name:12085:12127 [3] NCCL INFO Channel 01/0 : 3[3] -> 4[4] via P2P/direct pointer

myles-System-Product-Name:12085:12124 [0] NCCL INFO Channel 00/0 : 0[0] -> 1[1] via P2P/direct pointer

myles-System-Product-Name:12085:12128 [4] NCCL INFO Channel 00/0 : 4[4] -> 5[5] via P2P/direct pointer

myles-System-Product-Name:12085:12125 [1] NCCL INFO Channel 01/0 : 1[1] -> 2[2] via P2P/direct pointer

myles-System-Product-Name:12085:12130 [6] NCCL INFO Channel 01/0 : 6[6] -> 0[0] via P2P/direct pointer

myles-System-Product-Name:12085:12129 [5] NCCL INFO Channel 01/0 : 5[5] -> 6[6] via P2P/direct pointer

myles-System-Product-Name:12085:12126 [2] NCCL INFO Channel 02/0 : 2[2] -> 3[3] via P2P/direct pointer

myles-System-Product-Name:12085:12124 [0] NCCL INFO Channel 01/0 : 0[0] -> 1[1] via P2P/direct pointer

myles-System-Product-Name:12085:12127 [3] NCCL INFO Channel 02/0 : 3[3] -> 4[4] via P2P/direct pointer

myles-System-Product-Name:12085:12128 [4] NCCL INFO Channel 01/0 : 4[4] -> 5[5] via P2P/direct pointer

myles-System-Product-Name:12085:12124 [0] NCCL INFO Channel 02/0 : 0[0] -> 1[1] via P2P/direct pointer

myles-System-Product-Name:12085:12127 [3] NCCL INFO Channel 03/0 : 3[3] -> 4[4] via P2P/direct pointer

myles-System-Product-Name:12085:12126 [2] NCCL INFO Channel 03/0 : 2[2] -> 3[3] via P2P/direct pointer

myles-System-Product-Name:12085:12129 [5] NCCL INFO Channel 02/0 : 5[5] -> 6[6] via P2P/direct pointer

myles-System-Product-Name:12085:12130 [6] NCCL INFO Channel 02/0 : 6[6] -> 0[0] via P2P/direct pointer

myles-System-Product-Name:12085:12125 [1] NCCL INFO Channel 02/0 : 1[1] -> 2[2] via P2P/direct pointer

myles-System-Product-Name:12085:12128 [4] NCCL INFO Channel 02/0 : 4[4] -> 5[5] via P2P/direct pointer

myles-System-Product-Name:12085:12130 [6] NCCL INFO Channel 03/0 : 6[6] -> 0[0] via P2P/direct pointer

myles-System-Product-Name:12085:12125 [1] NCCL INFO Channel 03/0 : 1[1] -> 2[2] via P2P/direct pointer

myles-System-Product-Name:12085:12129 [5] NCCL INFO Channel 03/0 : 5[5] -> 6[6] via P2P/direct pointer

myles-System-Product-Name:12085:12124 [0] NCCL INFO Channel 03/0 : 0[0] -> 1[1] via P2P/direct pointer

myles-System-Product-Name:12085:12128 [4] NCCL INFO Channel 03/0 : 4[4] -> 5[5] via P2P/direct pointer

myles-System-Product-Name:12085:12130 [6] NCCL INFO Connected all rings

myles-System-Product-Name:12085:12130 [6] NCCL INFO Channel 00/0 : 6[6] -> 5[5] via P2P/direct pointer

myles-System-Product-Name:12085:12128 [4] NCCL INFO Connected all rings

myles-System-Product-Name:12085:12129 [5] NCCL INFO Connected all rings

myles-System-Product-Name:12085:12124 [0] NCCL INFO Connected all rings

myles-System-Product-Name:12085:12125 [1] NCCL INFO Connected all rings

myles-System-Product-Name:12085:12127 [3] NCCL INFO Connected all rings

myles-System-Product-Name:12085:12126 [2] NCCL INFO Connected all rings

myles-System-Product-Name:12085:12130 [6] NCCL INFO Channel 01/0 : 6[6] -> 5[5] via P2P/direct pointer

myles-System-Product-Name:12085:12130 [6] NCCL INFO Channel 02/0 : 6[6] -> 5[5] via P2P/direct pointer

myles-System-Product-Name:12085:12130 [6] NCCL INFO Channel 03/0 : 6[6] -> 5[5] via P2P/direct pointer

myles-System-Product-Name:12085:12128 [4] NCCL INFO Channel 00/0 : 4[4] -> 3[3] via P2P/direct pointer

myles-System-Product-Name:12085:12129 [5] NCCL INFO Channel 00/0 : 5[5] -> 4[4] via P2P/direct pointer

myles-System-Product-Name:12085:12125 [1] NCCL INFO Channel 00/0 : 1[1] -> 0[0] via P2P/direct pointer

myles-System-Product-Name:12085:12126 [2] NCCL INFO Channel 00/0 : 2[2] -> 1[1] via P2P/direct pointer

myles-System-Product-Name:12085:12127 [3] NCCL INFO Channel 00/0 : 3[3] -> 2[2] via P2P/direct pointer

myles-System-Product-Name:12085:12128 [4] NCCL INFO Channel 01/0 : 4[4] -> 3[3] via P2P/direct pointer

myles-System-Product-Name:12085:12129 [5] NCCL INFO Channel 01/0 : 5[5] -> 4[4] via P2P/direct pointer

myles-System-Product-Name:12085:12126 [2] NCCL INFO Channel 01/0 : 2[2] -> 1[1] via P2P/direct pointer

myles-System-Product-Name:12085:12125 [1] NCCL INFO Channel 01/0 : 1[1] -> 0[0] via P2P/direct pointer

myles-System-Product-Name:12085:12127 [3] NCCL INFO Channel 01/0 : 3[3] -> 2[2] via P2P/direct pointer

myles-System-Product-Name:12085:12128 [4] NCCL INFO Channel 02/0 : 4[4] -> 3[3] via P2P/direct pointer

myles-System-Product-Name:12085:12129 [5] NCCL INFO Channel 02/0 : 5[5] -> 4[4] via P2P/direct pointer

myles-System-Product-Name:12085:12126 [2] NCCL INFO Channel 02/0 : 2[2] -> 1[1] via P2P/direct pointer

myles-System-Product-Name:12085:12125 [1] NCCL INFO Channel 02/0 : 1[1] -> 0[0] via P2P/direct pointer

myles-System-Product-Name:12085:12127 [3] NCCL INFO Channel 02/0 : 3[3] -> 2[2] via P2P/direct pointer

myles-System-Product-Name:12085:12128 [4] NCCL INFO Channel 03/0 : 4[4] -> 3[3] via P2P/direct pointer

myles-System-Product-Name:12085:12129 [5] NCCL INFO Channel 03/0 : 5[5] -> 4[4] via P2P/direct pointer

myles-System-Product-Name:12085:12126 [2] NCCL INFO Channel 03/0 : 2[2] -> 1[1] via P2P/direct pointer

myles-System-Product-Name:12085:12125 [1] NCCL INFO Channel 03/0 : 1[1] -> 0[0] via P2P/direct pointer

myles-System-Product-Name:12085:12127 [3] NCCL INFO Channel 03/0 : 3[3] -> 2[2] via P2P/direct pointer

myles-System-Product-Name:12085:12129 [5] NCCL INFO Connected all trees

myles-System-Product-Name:12085:12125 [1] NCCL INFO Connected all trees

myles-System-Product-Name:12085:12124 [0] NCCL INFO Connected all trees

myles-System-Product-Name:12085:12126 [2] NCCL INFO Connected all trees

myles-System-Product-Name:12085:12128 [4] NCCL INFO Connected all trees

myles-System-Product-Name:12085:12130 [6] NCCL INFO Connected all trees

myles-System-Product-Name:12085:12127 [3] NCCL INFO Connected all trees

myles-System-Product-Name:12085:12148 [2] NCCL INFO [Proxy Progress] Device 2 CPU core 106

myles-System-Product-Name:12085:12153 [4] NCCL INFO [Proxy Progress] Device 4 CPU core 30

myles-System-Product-Name:12085:12147 [6] NCCL INFO [Proxy Progress] Device 6 CPU core 86

myles-System-Product-Name:12085:12150 [3] NCCL INFO [Proxy Progress] Device 3 CPU core 8

myles-System-Product-Name:12085:12149 [0] NCCL INFO [Proxy Progress] Device 0 CPU core 90

myles-System-Product-Name:12085:12151 [5] NCCL INFO [Proxy Progress] Device 5 CPU core 119

myles-System-Product-Name:12085:12152 [1] NCCL INFO [Proxy Progress] Device 1 CPU core 28

myles-System-Product-Name:12085:12127 [3] NCCL INFO threadThresholds 8/8/64 | 56/8/64 | 512 | 512

myles-System-Product-Name:12085:12127 [3] NCCL INFO 4 coll channels, 4 collnet channels, 0 nvls channels, 4 p2p channels, 2 p2p channels per peer

myles-System-Product-Name:12085:12130 [6] NCCL INFO threadThresholds 8/8/64 | 56/8/64 | 512 | 512

myles-System-Product-Name:12085:12130 [6] NCCL INFO 4 coll channels, 4 collnet channels, 0 nvls channels, 4 p2p channels, 2 p2p channels per peer

myles-System-Product-Name:12085:12129 [5] NCCL INFO threadThresholds 8/8/64 | 56/8/64 | 512 | 512

myles-System-Product-Name:12085:12129 [5] NCCL INFO 4 coll channels, 4 collnet channels, 0 nvls channels, 4 p2p channels, 2 p2p channels per peer

myles-System-Product-Name:12085:12125 [1] NCCL INFO threadThresholds 8/8/64 | 56/8/64 | 512 | 512

myles-System-Product-Name:12085:12125 [1] NCCL INFO 4 coll channels, 4 collnet channels, 0 nvls channels, 4 p2p channels, 2 p2p channels per peer

myles-System-Product-Name:12085:12128 [4] NCCL INFO threadThresholds 8/8/64 | 56/8/64 | 512 | 512

myles-System-Product-Name:12085:12128 [4] NCCL INFO 4 coll channels, 4 collnet channels, 0 nvls channels, 4 p2p channels, 2 p2p channels per peer

myles-System-Product-Name:12085:12126 [2] NCCL INFO threadThresholds 8/8/64 | 56/8/64 | 512 | 512

myles-System-Product-Name:12085:12126 [2] NCCL INFO 4 coll channels, 4 collnet channels, 0 nvls channels, 4 p2p channels, 2 p2p channels per peer

myles-System-Product-Name:12085:12124 [0] NCCL INFO threadThresholds 8/8/64 | 56/8/64 | 512 | 512

myles-System-Product-Name:12085:12124 [0] NCCL INFO 4 coll channels, 4 collnet channels, 0 nvls channels, 4 p2p channels, 2 p2p channels per peer

myles-System-Product-Name:12085:12124 [0] NCCL INFO CC Off, Multi-GPU CC Off, workFifoBytes 1048576

myles-System-Product-Name:12085:12128 [4] NCCL INFO TUNER/Plugin: Could not find: libnccl-tuner.so libnccl-net.so. Using internal tuner plugin.

myles-System-Product-Name:12085:12128 [4] NCCL INFO ncclCommInitAll comm 0x6074a4644b10 rank 4 nranks 7 cudaDev 4 nvmlDev 4 busId 42000 commId 0x278945f61c095e7c - Init COMPLETE

myles-System-Product-Name:12085:12128 [4] NCCL INFO Init timings - ncclCommInitAll: rank 4 nranks 7 total 0.72 (kernels 0.50, alloc 0.06, bootstrap 0.00, allgathers 0.02, topo 0.05, graphs 0.01, connections 0.08, rest 0.00)

myles-System-Product-Name:12085:12130 [6] NCCL INFO ncclCommInitAll comm 0x6074a46cc190 rank 6 nranks 7 cudaDev 6 nvmlDev 6 busId 62000 commId 0x278945f61c095e7c - Init COMPLETE

myles-System-Product-Name:12085:12130 [6] NCCL INFO Init timings - ncclCommInitAll: rank 6 nranks 7 total 0.72 (kernels 0.52, alloc 0.05, bootstrap 0.00, allgathers 0.01, topo 0.05, graphs 0.01, connections 0.08, rest 0.00)

myles-System-Product-Name:12085:12125 [1] NCCL INFO ncclCommInitAll comm 0x6074a4579950 rank 1 nranks 7 cudaDev 1 nvmlDev 1 busId 2000 commId 0x278945f61c095e7c - Init COMPLETE

myles-System-Product-Name:12085:12125 [1] NCCL INFO Init timings - ncclCommInitAll: rank 1 nranks 7 total 0.74 (kernels 0.52, alloc 0.05, bootstrap 0.00, allgathers 0.02, topo 0.05, graphs 0.01, connections 0.08, rest 0.02)

myles-System-Product-Name:12085:12127 [3] NCCL INFO ncclCommInitAll comm 0x6074a4600fd0 rank 3 nranks 7 cudaDev 3 nvmlDev 3 busId 41000 commId 0x278945f61c095e7c - Init COMPLETE

myles-System-Product-Name:12085:12127 [3] NCCL INFO Init timings - ncclCommInitAll: rank 3 nranks 7 total 0.74 (kernels 0.52, alloc 0.05, bootstrap 0.00, allgathers 0.00, topo 0.06, graphs 0.01, connections 0.08, rest 0.02)

myles-System-Product-Name:12085:12129 [5] NCCL INFO ncclCommInitAll comm 0x6074a4688650 rank 5 nranks 7 cudaDev 5 nvmlDev 5 busId 61000 commId 0x278945f61c095e7c - Init COMPLETE

myles-System-Product-Name:12085:12129 [5] NCCL INFO Init timings - ncclCommInitAll: rank 5 nranks 7 total 0.74 (kernels 0.51, alloc 0.06, bootstrap 0.00, allgathers 0.01, topo 0.05, graphs 0.01, connections 0.08, rest 0.02)

myles-System-Product-Name:12085:12124 [0] NCCL INFO ncclCommInitAll comm 0x6074a4535eb0 rank 0 nranks 7 cudaDev 0 nvmlDev 0 busId 1000 commId 0x278945f61c095e7c - Init COMPLETE

myles-System-Product-Name:12085:12124 [0] NCCL INFO Init timings - ncclCommInitAll: rank 0 nranks 7 total 0.74 (kernels 0.51, alloc 0.06, bootstrap 0.00, allgathers 0.01, topo 0.05, graphs 0.01, connections 0.08, rest 0.02)

myles-System-Product-Name:12085:12126 [2] NCCL INFO ncclCommInitAll comm 0x6074a45bd490 rank 2 nranks 7 cudaDev 2 nvmlDev 2 busId 2b000 commId 0x278945f61c095e7c - Init COMPLETE

myles-System-Product-Name:12085:12126 [2] NCCL INFO Init timings - ncclCommInitAll: rank 2 nranks 7 total 0.74 (kernels 0.51, alloc 0.06, bootstrap 0.00, allgathers 0.00, topo 0.06, graphs 0.01, connections 0.08, rest 0.02)

#

#                                                              out-of-place                       in-place          

#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong

#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       

    33554432       8388608     float     sum      -1   2343.2   14.32   24.55      0   2338.6   14.35   24.60      0

myles-System-Product-Name:12085:12085 [0] NCCL INFO comm 0x6074a4535eb0 rank 0 nranks 7 cudaDev 0 busId 1000 - Destroy COMPLETE

myles-System-Product-Name:12085:12085 [6] NCCL INFO comm 0x6074a46cc190 rank 6 nranks 7 cudaDev 6 busId 62000 - Destroy COMPLETE

myles-System-Product-Name:12085:12085 [5] NCCL INFO comm 0x6074a4688650 rank 5 nranks 7 cudaDev 5 busId 61000 - Destroy COMPLETE

myles-System-Product-Name:12085:12085 [4] NCCL INFO comm 0x6074a4644b10 rank 4 nranks 7 cudaDev 4 busId 42000 - Destroy COMPLETE

myles-System-Product-Name:12085:12085 [3] NCCL INFO comm 0x6074a4600fd0 rank 3 nranks 7 cudaDev 3 busId 41000 - Destroy COMPLETE

myles-System-Product-Name:12085:12085 [2] NCCL INFO comm 0x6074a45bd490 rank 2 nranks 7 cudaDev 2 busId 2b000 - Destroy COMPLETE

myles-System-Product-Name:12085:12085 [1] NCCL INFO comm 0x6074a4579950 rank 1 nranks 7 cudaDev 1 busId 2000 - Destroy COMPLETE

# Out of bounds values : 0 OK

# Avg bus ```

@ZP-AlwaysWin
Copy link
Author

ZP-AlwaysWin commented Oct 23, 2024

NCCL_P2P_LEVEL

I have 8 PCIe 16x 4.0 GPU cards, but as long as P2P capability is enabled, the data transfer speed drastically decreases when using more than two cards. I tested with NCCL_P2P_LEVEL=anything. In one case, there was no improvement, and in another, the performance dropped significantly. and I can share my test results with you. @mylesgoose

@mylesgoose
Copy link

NCCL_P2P_LEVEL

I have 8 PCIe 16x 4.0 GPU cards, but as long as P2P capability is enabled, the data transfer speed drastically decreases when using more than two cards. I tested with NCCL_P2P_LEVEL=anything. In one case, there was no improvement, and in another, the performance dropped significantly. If you're Chinese, you can add me on WeChat at wxid_8704547045612, and I can share my test results with you. @mylesgoose

hello. i am not Chinese and don't have we chat. but its helpful to resolve issue here i think,also.
can you run this command
NCCL_DEBUG=INFO ./all_reduce_perf -g 3

also which motherboard dare you using. because i don't know of many motherboards that offer pcie 16x at 16x speeds with 8 cards. for example you can still have a pcie 16x slot that is running at 8x speed or 4x speed. and bifurcating the lanes. my theory is that when you run nccl it checks how many lanes each device has and if they don't have the same lane number it defaults to cpu. if you run this command you can see if nccl is telling your system to run on cpu or p2p

NCCL_DEBUG=INFO ./all_reduce_perf -g 3

the relevant section is here : 1[1] -> 0[0] via P2P/direct pointer or here : 1[1] -> 2[2] via SHM/direct/direct

In order to make it work for your system. if the SHM is showing you will have to modify the nccl source code which is available on github to facilitate to always use p2p instead of cpu. from my test above you can see there is no issue running p2p with many gpus rtx 4090 with tinygrad driver providing that the cards are all on the same bandwidth. a 16x card at pcie 4x speeds is not working or 8x speeds. if all devices on 16x speeds it works.

I think you maybe being confused by the bandwidth of your cards. obviously it is a 16x card but the mother board must be seriously great to have 128 pcie lanes available to the cpu excluding other devices. even if the transfer is happening via p2p there must be enough balanced lanes to each gpu. if you look here https://tinygrad.org/#tinybox you can see he is selling a computer with 8 gpus in it. so why the need for such a powerful motherboard he is using a motherboard with 2x AMD GENOA why? this is why 128 lanes of PCIe 4.0 support per cpu with a single amd 3995x threadripper. can only handle 8 cards if nothing else is using the other lanes. you are asking nccl to use 128 lanes 16 per gpu to do pcie transfer. this will fail and fall back to cpu if you don't have a cpu with enough lanes. some maybe used for other things also like nvme drives. each using 4. So i believe your issue is your ether trying to mix and match 8 devices some 16x cards but the cpu is allocating less lanes. or you have not got enough lanes to accommodate that many pcie lanes at one time. i have proven the driver on cuda 12.6, 560 NVIDIA works with 7 gpus at full p2p providing the gpus are all on 16x lanes with supported cpu. even though the cpu ram is not used for the transfer the cpu still mediates the process. and i believe each lane must still be active. will it work if we take 8 gpu pcie 16 4.0 and bifurcate to put them all onto 8x lanes. maybe we have to try this. moral of the story this is a hardware issue. or nccl issue not a driver issue

@mylesgoose
Copy link

mylesgoose commented Oct 23, 2024

also i notice your bus bandwidth here

# Out of bounds values : 0 OK
# Avg bus bandwidth    : 4.85887
# 

running only on 2 cards proves your not using 16x lanes this should be 24gb per second at least with 16x lanes
@ZP-AlwaysWin

@ZP-AlwaysWin
Copy link
Author

also i notice your bus bandwidth here

# Out of bounds values : 0 OK
# Avg bus bandwidth    : 4.85887
# 

running only on 2 cards proves your not using 16x lanes this should be 24gb per second at least with 16x lanes @ZP-AlwaysWin

p2p.txt
@mylesgoose Here is my test log file, could you take a look at it? I think my testing is fine, and it is also 16 lanes.

@ZP-AlwaysWin
Copy link
Author

I’m using an Intel SPR motherboard with all 8 GPUs directly connected to the CPU, using PCIe 4.0 x16. Sapphire Rapids has more than 8 PCIe 4.0 x16 lanes.

@mylesgoose
Copy link

@ZP-AlwaysWin looks good now.. what did ya change? See bandwidth gone up to 10 above 3 where it was.and shows is using p2p

@mylesgoose
Copy link

I’m using an Intel SPR motherboard with all 8 GPUs directly connected to the CPU, using PCIe 4.0 x16. Sapphire Rapids has more than 8 PCIe 4.0 x16 lanes.

Is that this one 1 PCIe 5.0 x8,2 PCIe 5.0 x16,4 PCIe 5.0 x8 MCIO

@ZP-AlwaysWin
Copy link
Author

@ZP-AlwaysWin looks good now.. what did ya change? See bandwidth gone up to 10 above 3 where it was.and shows is using p2p

I haven’t changed anything, but my test results still didn’t meet expectations, so I’ve decided to temporarily abandon this P2P capability.

@xiaobuding-cx
Copy link

@ZP-AlwaysWin Hello, I encountered the same issue as you: when specifying more than 2 cards, the performance of all-to-all is very poor. Have you resolved this issue now? Could you provide us with some suggestions, thank you.

@xiaobuding-cx
Copy link

@mylesgoose Hi, any new findings regarding this type of issue? thank you!

@mylesgoose
Copy link

@xiaobuding-cx I have been playing around with it with 8 gpu on dual cpu system. I have not tried these modifications yet however this guy has promising results
Screenshot_20241226_080252_Brave
And what he changed is here.

P2P driver version 560 for CUDA 12.6: https://github.com/aikitoria/open-gpu-kernel-modules

P2P bandwidth result: https://pastebin.com/x37LLh1q

The settings to change are xGMI Link Width (from Auto/Dynamic to Manual x16) and xGMI Link Speed (from Auto to 25Gbps) and IOMMU to Disabled.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants