-
Notifications
You must be signed in to change notification settings - Fork 94
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
560.35.03 p2p #22
base: main
Are you sure you want to change the base?
560.35.03 p2p #22
Conversation
@mylesgoose I'm able to install the driver fine, but I can't seem to replicate your simpleP2P results. I'm running 4x RTX4090's in Ubuntu 24.04 LTS Server. Used
Output from nvidia-smi topo:
Simplep2p:
Any thoughts? |
I was able to get a step further, but simpleP2P fails:
|
@henriklied what do mean by replicating my results. Are you able to get p2p occurring. Is the speed slow. Can you provide more information. I found a simpler way to install it on Ubuntu 24.04 or 24.10. You install the correct open nvidia driver by apt. Right. You then compile the source code, ensuring you clone the correct branch for the driver you have. Let's say the 560 one. You then check that the 560 drive is running fine. Which works well even with Wayland display manager. Where as the p2p one will only really work well with the x11 one. Glitches with gnome nautilus system manager etc. Anyway you then compile the driver using c++14, etc, matching that kernel. You then search your system for nvidia.ko files. You should find two one in usr/modules/uname -r somewhere around there, the apt installed ones. You will also find your p2p ones. In your source folder. Then, use the terminal to copy the original modules files from folder nvidia 560 to a backup place. Then copy in your modified ones. Matching those file names. Then reboot. Or unload the models and then reload them and your display manager. I setup a script that dors it when i need to use p2p or when i just eant to use the standard driver can runscript and it replaces the models and reloads in about 30 seconds.This way apt thinks its still got the original driver so does not try to update it all the time. And you know for certain you replaced the driver with your modified ones ie check the time stamps. Then reboot. Before you reboot if your running a display manager ensure it's xfce, on sddm , kfe, there is issue on my system at least with Wayland. So choose x11. Or xorg. As it does not check for secure ram. I think because mapping to global the vram Wayland does not like. Next you can test simply with nvidia-smi p2p status. Have a look at the issues section on geohot page there are some discussion which show the simple nvidia commands to check p2p. If p2p is working. Then perhaps your meaning your speeds?. Which related to nccl exports and setting the p2p level sys or PBH or whatever er it it. 🤔 you likely have two sets of the same modules trying to load. |
Thanks for the quick reply, @mylesgoose! I think the issue is that the machine has not enabled large BAR support in the BIOS, so I will attempt to enable this when I get physical access to the machine in a couple of days. My error seems to correspond with this issue. So I will try that first. |
Looks to me like your pretty close. Must be that large BAR thing. As it now says p2p enabled. I don't know why but it did not show your messages above in full beforehand. 😕 |
Was able to enable large BAR, but now I'm getting a different error. Any thoughts, @mylesgoose ?
|
@henriklied I think there was an update to this repo that fixed that issue. Must have used the older one which repo or files did you use to install? 1ca8b01 |
@henriklied how did you get on? |
@mylesgoose Thanks. Succeeded to get 560 branch. It seems working good. But, a strange thing. 1->2 is ok, but 2->1 is half bandwidth in p2p. It's ok in non p2p. Device: 0, NVIDIA RTX 6000 Ada Generation, pciBusID: 21, pciDeviceID: 0, pciDomainID:0 Unidirectional P2P=Disabled Bandwidth Matrix (GB/s) |
@ewhacc that's a weird one. 😳 are all your cards identical. What happens if you run the test again directly after running first time or if card is on a different port. Or if you switch cards. Export Nccl debug level info might help figure out. |
@mylesgoose Yeah, it's weird. One card is not identical but the problem is not there. Anyway, I'm gonna test again only with 4090s. Also, changing slot affects as you said. |
@mylesgoose It works great after removing non-4090. Strangely, that affected the p2p of other two 4090s. Bidirectional P2P=Disabled Bandwidth Matrix (GB/s) Mixing cards seems mess in p2p. I have experienced it even between 6000 ada & a6000. |
@ewhacc glad its working for ya bud. |
@henriklied Did you fix the problem? Now, I got the same problem. |
Can you provide more information like what program you where using. Driver that is loaded. Has apt updated your driver? Nvidia-smi -p2p status rw what does it return? @ewhacc how did you install the driver. I think that Deb file has issue. Have to compile from source. Let me know |
I will provide in full the steps I do to install. I have setup a 4 gpu system to test. Sudo apt update
Sudo apt upgrade.
I google search for nvidia cuda toolkit 12.6.2 I use network installer. To add the nvidia repo to apt. I don't install the toolkit yet. I update apt.
I search apt for nvidia-open
I now see that. It's available.
I type sudo apt install nvidia-open-560 --install-suggests durimg the install it tells you the locations where it compiles the kernel modules to. In my case it seems usr/lib/modules/6.8.0-48-generic/updates/dmks/ and i can confirm there is 4 files threr nvidia.ko nvidia-drm.ko nvidia-modeset.ko nvidia-peermem.ko nvidia-uvm.ko i perform a system wide search for nvidia.ko and i find a second file with the same name located usr/lib/dmks/nvidia/560.35.03//6.8.0-generic/x86_64/module. Bot files are size 13.7mb. in that previous step it also installs gcc 11 and build essentials etc. It also means that apt will see your driver is installed and not try to update or replace all the time. And this step removes any other versions.
sudo apt auto remove
sudo apt install git
git clone -b 560.35.03-p2p https://github.com/mylesgoose/open-gpu-kernel-modules.git
cd open-kernel-modules
make modules -j$(nproc)
At this point the compilation fails with errors.
So I type
sudo apt install gcc-12 g++-12
sudo update-alternatives --install /usr/bin/gcc gcc/usr/bin/gcc-12 60
sudo update-alternatives --install /usr/bin/g++ gcc/usr/bin/g++-12 60
sudo update-alternatives --config gcc
sudo update-alternatives --config g++
make clean
Then make modules again as above.
That rime it compiled without errors.
I noted the module location above when we installed the open driver with apt from nvidia repo. I backed up the original modules the .ko files and renamed then ko.bak
for example:
# Backup the original modules in DKMS installation directory
sudo mv /usr/lib/modules/6.8.0-48-generic/updates/dkms/nvidia.ko /usr/lib/modules/6.8.0-48-generic/updates/dkms/nvidia.ko.bak
sudo mv /usr/lib/modules/6.8.0-48-generic/updates/dkms/nvidia-drm.ko /usr/lib/modules/6.8.0-48-generic/updates/dkms/nvidia-drm.ko.bak
sudo mv /usr/lib/modules/6.8.0-48-generic/updates/dkms/nvidia-modeset.ko /usr/lib/modules/6.8.0-48-generic/updates/dkms/nvidia-modeset.ko.bak
sudo mv /usr/lib/modules/6.8.0-48-generic/updates/dkms/nvidia-peermem.ko /usr/lib/modules/6.8.0-48-generic/updates/dkms/nvidia-peermem.ko.bak
sudo mv /usr/lib/modules/6.8.0-48-generic/updates/dkms/nvidia-uvm.ko /usr/lib/modules/6.8.0-48-generic/updates/dkms/nvidia-uvm.ko.bak
# Backup the original modules in DKMS source directory
sudo mv /var/lib/dkms/nvidia/560.35.03/6.8.0-48-generic/x86_64/module/nvidia.ko /var/lib/dkms/nvidia/560.35.03/6.8.0-48-generic/x86_64/module/nvidia.ko.bak
sudo mv /var/lib/dkms/nvidia/560.35.03/6.8.0-48-generic/x86_64/module/nvidia-drm.ko /var/lib/dkms/nvidia/560.35.03/6.8.0-48-generic/x86_64/module/nvidia-drm.ko.bak
sudo mv /var/lib/dkms/nvidia/560.35.03/6.8.0-48-generic/x86_64/module/nvidia-modeset.ko /var/lib/dkms/nvidia/560.35.03/6.8.0-48-generic/x86_64/module/nvidia-modeset.ko.bak
sudo mv /var/lib/dkms/nvidia/560.35.03/6.8.0-48-generic/x86_64/module/nvidia-peermem.ko /var/lib/dkms/nvidia/560.35.03/6.8.0-48-generic/x86_64/module/nvidia-peermem.ko.bak
sudo mv /var/lib/dkms/nvidia/560.35.03/6.8.0-48-generic/x86_64/module/nvidia-uvm.ko /var/lib/dkms/nvidia/560.35.03/6.8.0-48-generic/x86_64/module/nvidia-uvm.ko.bak
# Copy the new modules from the open-gpu folder to DKMS installation directory
sudo cp /home/myles/open-gpu-kernel-modules/kernel-open/nvidia.ko /usr/lib/modules/6.8.0-48-generic/updates/dkms/
sudo cp /home/myles/open-gpu-kernel-modules/kernel-open/nvidia-drm.ko /usr/lib/modules/6.8.0-48-generic/updates/dkms/
sudo cp /home/myles/open-gpu-kernel-modules/kernel-open/nvidia-modeset.ko /usr/lib/modules/6.8.0-48-generic/updates/dkms/
sudo cp /home/myles/open-gpu-kernel-modules/kernel-open/nvidia-peermem.ko /usr/lib/modules/6.8.0-48-generic/updates/dkms/
sudo cp /home/myles/open-gpu-kernel-modules/kernel-open/nvidia-uvm.ko /usr/lib/modules/6.8.0-48-generic/updates/dkms/
# Copy the new modules from the open-gpu folder to DKMS source directory
sudo cp /home/myles/open-gpu-kernel-modules/kernel-open/nvidia.ko /var/lib/dkms/nvidia/560.35.03/6.8.0-48-generic/x86_64/module/
sudo cp /home/myles/open-gpu-kernel-modules/kernel-open/nvidia-drm.ko /var/lib/dkms/nvidia/560.35.03/6.8.0-48-generic/x86_64/module/
sudo cp /home/myles/open-gpu-kernel-modules/kernel-open/nvidia-modeset.ko /var/lib/dkms/nvidia/560.35.03/6.8.0-48-generic/x86_64/module/
sudo cp /home/myles/open-gpu-kernel-modules/kernel-open/nvidia-peermem.ko /var/lib/dkms/nvidia/560.35.03/6.8.0-48-generic/x86_64/module/
sudo cp /home/myles/open-gpu-kernel-modules/kernel-open/nvidia-uvm.ko /var/lib/dkms/nvidia/560.35.03/6.8.0-48-generic/x86_64/module/
I copied the respective compiled modules we just made to the two locations provided by apt installer
I confirmed that the modules I copied from our p2p folder existed in those two locations with size of 26mb each for nvidia.ko
i reboot and type this in terminal.
nvidia-smi topo -p2p rw
GPU0 GPU1 GPU2 GPU3
GPU0 X OK OK OK
GPU1 OK X OK OK
GPU2 OK OK X OK
GPU3 OK OK OK X
Legend:
X = Self
OK = Status Ok
CNS = Chipset not supported
GNS = GPU not supported
TNS = Topology not supported
NS = Not supported
U = Unknown
and then
sudo apt-get -y install cuda-toolkit-12-6
git clone https://github.com/NVIDIA/cuda-samples.git
cd cuda-samples
echo 'export PATH=/usr/local/cuda/bin:$PATH' >> ~/.bashrc
sudo apt install libgl1-mesa-dev libglu1-mesa-dev
source ~/.bashrc
sudo apt install libnccl2 libnccl-dev
sudo apt-get install cmake
sudo apt-get install freeglut3-dev
sudo apt-get install libfreeimage-dev
sudo apt-get install openmpi-bin openmpi-common libopenmpi-dev
wget -qO- https://packages.lunarg.com/lunarg-signing-key-pub.asc | sudo tee /etc/apt/trusted.gpg.d/lunarg.asc
sudo wget -qO /etc/apt/sources.list.d/lunarg-vulkan-1.3.296-jammy.list https://packages.lunarg.com/vulkan/1.3.296/lunarg-vulkan-1.3.296-jammy.list
sudo apt update
sudo apt install vulkan-sdk
make -j$(nproc) |
'/home/myles/cuda-samples/Samples/0_Introduction/simpleP2P/simpleP2P'
[/home/myles/cuda-samples/Samples/0_Introduction/simpleP2P/simpleP2P] - Starting...
Checking for multiple GPUs...
CUDA-capable device count: 4
Checking GPU(s) for support of peer to peer memory access...
> Peer access from NVIDIA GeForce RTX 4090 (GPU0) -> NVIDIA GeForce RTX 4090 (GPU1) : Yes
> Peer access from NVIDIA GeForce RTX 4090 (GPU0) -> NVIDIA GeForce RTX 4090 (GPU2) : Yes
> Peer access from NVIDIA GeForce RTX 4090 (GPU0) -> NVIDIA GeForce RTX 4090 (GPU3) : Yes
> Peer access from NVIDIA GeForce RTX 4090 (GPU1) -> NVIDIA GeForce RTX 4090 (GPU0) : Yes
> Peer access from NVIDIA GeForce RTX 4090 (GPU1) -> NVIDIA GeForce RTX 4090 (GPU2) : Yes
> Peer access from NVIDIA GeForce RTX 4090 (GPU1) -> NVIDIA GeForce RTX 4090 (GPU3) : Yes
> Peer access from NVIDIA GeForce RTX 4090 (GPU2) -> NVIDIA GeForce RTX 4090 (GPU0) : Yes
> Peer access from NVIDIA GeForce RTX 4090 (GPU2) -> NVIDIA GeForce RTX 4090 (GPU1) : Yes
> Peer access from NVIDIA GeForce RTX 4090 (GPU2) -> NVIDIA GeForce RTX 4090 (GPU3) : Yes
> Peer access from NVIDIA GeForce RTX 4090 (GPU3) -> NVIDIA GeForce RTX 4090 (GPU0) : Yes
> Peer access from NVIDIA GeForce RTX 4090 (GPU3) -> NVIDIA GeForce RTX 4090 (GPU1) : Yes
> Peer access from NVIDIA GeForce RTX 4090 (GPU3) -> NVIDIA GeForce RTX 4090 (GPU2) : Yes
Enabling peer access between GPU0 and GPU1...
Allocating buffers (64MB on GPU0, GPU1 and CPU Host)...
Creating event handles...
cudaMemcpyPeer / cudaMemcpy between GPU0 and GPU1: 5.88GB/s
Preparing host buffer and memcpy to GPU0...
Run kernel on GPU1, taking source data from GPU0 and writing to GPU1...
Run kernel on GPU0, taking source data from GPU1 and writing to GPU0...
Copy data back to host from GPU0 and verify results...
Verification error @ element 0: val = nan, ref = 0.000000
Verification error @ element 1: val = nan, ref = 4.000000
Verification error @ element 2: val = nan, ref = 8.000000
Verification error @ element 3: val = nan, ref = 12.000000
Verification error @ element 4: val = nan, ref = 16.000000
Verification error @ element 5: val = nan, ref = 20.000000
Verification error @ element 6: val = nan, ref = 24.000000
Verification error @ element 7: val = nan, ref = 28.000000
Verification error @ element 8: val = nan, ref = 32.000000
Verification error @ element 9: val = nan, ref = 36.000000
Verification error @ element 10: val = nan, ref = 40.000000
Verification error @ element 11: val = nan, ref = 44.000000
Disabling peer access...
Shutting down...
Test failed!
myles@myles-MC62-G40-00:~/cuda-samples$ dmesg | grep -e DMAR -e IOMMU
dmesg: read kernel buffer failed: Operation not permitted
myles@myles-MC62-G40-00:~/cuda-samples$ sudo dmesg | grep -e DMAR -e IOMMU
[sudo] password for myles:
[ 0.829972] pci 0000:60:00.2: AMD-Vi: IOMMU performance counters supported
[ 0.850474] pci 0000:40:00.2: AMD-Vi: IOMMU performance counters supported
[ 0.864277] pci 0000:20:00.2: AMD-Vi: IOMMU performance counters supported
[ 0.876239] pci 0000:00:00.2: AMD-Vi: IOMMU performance counters supported
[ 0.897945] perf/amd_iommu: Detected AMD IOMMU #0 (2 banks, 4 counters/bank).
[ 0.897960] perf/amd_iommu: Detected AMD IOMMU #1 (2 banks, 4 counters/bank).
[ 0.897974] perf/amd_iommu: Detected AMD IOMMU #2 (2 banks, 4 counters/bank).
[ 0.897988] perf/amd_iommu: Detected AMD IOMMU #3 (2 banks, 4 counters/bank).
[ 2238.543680] AMD-Vi: IOMMU Event log restarting
[ 2238.551089] AMD-Vi: IOMMU Event log restarting
[ 2238.558182] AMD-Vi: IOMMU Event log restarting
[ 2238.566681] AMD-Vi: IOMMU Event log restarting
[ 2238.573512] AMD-Vi: IOMMU Event log restarting
[ 2238.581588] AMD-Vi: IOMMU Event log restarting
[ 2238.590563] AMD-Vi: IOMMU Event log restarting
[ 2238.596884] AMD-Vi: IOMMU Event log restarting
[ 2238.604090] AMD-Vi: IOMMU Event log restarting
[ 2238.611923] AMD-Vi: IOMMU Event log restarting
myles@myles-MC62-G40-00:~/cuda-samples$ cat /proc/cmdline
BOOT_IMAGE=/boot/vmlinuz-6.8.0-48-generic root=UUID=67e829f9-2373-4f35-9e25-73171b053f04 ro quiet splash vt.handoff=7
myles@myles-MC62-G40-00:~/cuda-samples$ ls /sys/kernel/iommu_groups
0 11 14 17 2 22 25 28 30 33 36 39 41 44 47 5 52 55 58 60 63 66 69 71 8
1 12 15 18 20 23 26 29 31 34 37 4 42 45 48 50 53 56 59 61 64 67 7 72 9
10 13 16 19 21 24 27 3 32 35 38 40 43 46 49 51 54 57 6 62 65 68 70 73 git clone https://github.com/NVIDIA/nccl.git |
'/home/myles/cuda-samples/Samples/0_Introduction/simpleP2P/simpleP2P'
[/home/myles/cuda-samples/Samples/0_Introduction/simpleP2P/simpleP2P] - Starting...
Checking for multiple GPUs...
CUDA-capable device count: 4
Checking GPU(s) for support of peer to peer memory access...
> Peer access from NVIDIA GeForce RTX 4090 (GPU0) -> NVIDIA GeForce RTX 4090 (GPU1) : Yes
> Peer access from NVIDIA GeForce RTX 4090 (GPU0) -> NVIDIA GeForce RTX 4090 (GPU2) : Yes
> Peer access from NVIDIA GeForce RTX 4090 (GPU0) -> NVIDIA GeForce RTX 4090 (GPU3) : Yes
> Peer access from NVIDIA GeForce RTX 4090 (GPU1) -> NVIDIA GeForce RTX 4090 (GPU0) : Yes
> Peer access from NVIDIA GeForce RTX 4090 (GPU1) -> NVIDIA GeForce RTX 4090 (GPU2) : Yes
> Peer access from NVIDIA GeForce RTX 4090 (GPU1) -> NVIDIA GeForce RTX 4090 (GPU3) : Yes
> Peer access from NVIDIA GeForce RTX 4090 (GPU2) -> NVIDIA GeForce RTX 4090 (GPU0) : Yes
> Peer access from NVIDIA GeForce RTX 4090 (GPU2) -> NVIDIA GeForce RTX 4090 (GPU1) : Yes
> Peer access from NVIDIA GeForce RTX 4090 (GPU2) -> NVIDIA GeForce RTX 4090 (GPU3) : Yes
> Peer access from NVIDIA GeForce RTX 4090 (GPU3) -> NVIDIA GeForce RTX 4090 (GPU0) : Yes
> Peer access from NVIDIA GeForce RTX 4090 (GPU3) -> NVIDIA GeForce RTX 4090 (GPU1) : Yes
> Peer access from NVIDIA GeForce RTX 4090 (GPU3) -> NVIDIA GeForce RTX 4090 (GPU2) : Yes
Enabling peer access between GPU0 and GPU1...
Allocating buffers (64MB on GPU0, GPU1 and CPU Host)...
Creating event handles...
cudaMemcpyPeer / cudaMemcpy between GPU0 and GPU1: 5.88GB/s
Preparing host buffer and memcpy to GPU0...
Run kernel on GPU1, taking source data from GPU0 and writing to GPU1...
Run kernel on GPU0, taking source data from GPU1 and writing to GPU0...
Copy data back to host from GPU0 and verify results...
Verification error @ element 0: val = nan, ref = 0.000000
Verification error @ element 1: val = nan, ref = 4.000000
Verification error @ element 2: val = nan, ref = 8.000000
Verification error @ element 3: val = nan, ref = 12.000000
Verification error @ element 4: val = nan, ref = 16.000000
Verification error @ element 5: val = nan, ref = 20.000000
Verification error @ element 6: val = nan, ref = 24.000000
Verification error @ element 7: val = nan, ref = 28.000000
Verification error @ element 8: val = nan, ref = 32.000000
Verification error @ element 9: val = nan, ref = 36.000000
Verification error @ element 10: val = nan, ref = 40.000000
Verification error @ element 11: val = nan, ref = 44.000000
Disabling peer access...
Shutting down...
Test failed!
myles@myles-MC62-G40-00:~/cuda-samples$ dmesg | grep -e DMAR -e IOMMU
dmesg: read kernel buffer failed: Operation not permitted
myles@myles-MC62-G40-00:~/cuda-samples$ sudo dmesg | grep -e DMAR -e IOMMU
[sudo] password for myles:
[ 0.829972] pci 0000:60:00.2: AMD-Vi: IOMMU performance counters supported
[ 0.850474] pci 0000:40:00.2: AMD-Vi: IOMMU performance counters supported
[ 0.864277] pci 0000:20:00.2: AMD-Vi: IOMMU performance counters supported
[ 0.876239] pci 0000:00:00.2: AMD-Vi: IOMMU performance counters supported
[ 0.897945] perf/amd_iommu: Detected AMD IOMMU #0 (2 banks, 4 counters/bank).
[ 0.897960] perf/amd_iommu: Detected AMD IOMMU #1 (2 banks, 4 counters/bank).
[ 0.897974] perf/amd_iommu: Detected AMD IOMMU #2 (2 banks, 4 counters/bank).
[ 0.897988] perf/amd_iommu: Detected AMD IOMMU #3 (2 banks, 4 counters/bank).
[ 2238.543680] AMD-Vi: IOMMU Event log restarting
[ 2238.551089] AMD-Vi: IOMMU Event log restarting
[ 2238.558182] AMD-Vi: IOMMU Event log restarting
[ 2238.566681] AMD-Vi: IOMMU Event log restarting
[ 2238.573512] AMD-Vi: IOMMU Event log restarting
[ 2238.581588] AMD-Vi: IOMMU Event log restarting
[ 2238.590563] AMD-Vi: IOMMU Event log restarting
[ 2238.596884] AMD-Vi: IOMMU Event log restarting
[ 2238.604090] AMD-Vi: IOMMU Event log restarting
[ 2238.611923] AMD-Vi: IOMMU Event log restarting
myles@myles-MC62-G40-00:~/cuda-samples$ cat /proc/cmdline
BOOT_IMAGE=/boot/vmlinuz-6.8.0-48-generic root=UUID=67e829f9-2373-4f35-9e25-73171b053f04 ro quiet splash vt.handoff=7
myles@myles-MC62-G40-00:~/cuda-samples$ ls /sys/kernel/iommu_groups
0 11 14 17 2 22 25 28 30 33 36 39 41 44 47 5 52 55 58 60 63 66 69 71 8
1 12 15 18 20 23 26 29 31 34 37 4 42 45 48 50 53 56 59 61 64 67 7 72 9
10 13 16 19 21 24 27 3 32 35 38 40 43 46 49 51 54 57 6 62 65 68 70 73 git clone https://github.com/NVIDIA/nccl.git and as you can see the above test fails. becuse iommu was enabled. so i rebooted and disabled iommu in the bios. and then ran that test again. Checking GPU(s) for support of peer to peer memory access...
|
@ewhacc @henriklied I reproduced your errors above and then fixed them by following above procedure. |
git clone https://github.com/NVIDIA/nccl-tests.git NCCL_P2P_LEVEL=SYS NCCL_DEBUG=INFO ./build/all_reduce_perf -b 8 -e 128M -f 2 -g 4
# nThread 1 nGpus 4 minBytes 8 maxBytes 134217728 step: 2(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
#
# Using devices
# Rank 0 Group 0 Pid 6742 on myles-MC62-G40-00 device 0 [0x02] NVIDIA GeForce RTX 4090
# Rank 1 Group 0 Pid 6742 on myles-MC62-G40-00 device 1 [0x41] NVIDIA GeForce RTX 4090
# Rank 2 Group 0 Pid 6742 on myles-MC62-G40-00 device 2 [0x42] NVIDIA GeForce RTX 4090
# Rank 3 Group 0 Pid 6742 on myles-MC62-G40-00 device 3 [0x61] NVIDIA GeForce RTX 4090
myles-MC62-G40-00:6742:6742 [0] NCCL INFO Bootstrap : Using enp100s0:192.168.1.80<0>
myles-MC62-G40-00:6742:6742 [0] NCCL INFO cudaDriverVersion 12060
myles-MC62-G40-00:6742:6742 [0] NCCL INFO NCCL version 2.23.4+cuda12.6
myles-MC62-G40-00:6742:6766 [0] NCCL INFO NET/Plugin: Could not find: libnccl-net.so. Using internal network plugin.
myles-MC62-G40-00:6742:6766 [0] NCCL INFO NET/IB : No device found.
myles-MC62-G40-00:6742:6766 [0] NCCL INFO NET/Socket : Using [0]enp100s0:192.168.1.80<0>
myles-MC62-G40-00:6742:6766 [0] NCCL INFO PROFILER/Plugin: Could not find: libnccl-profiler.so.
myles-MC62-G40-00:6742:6766 [0] NCCL INFO Using network Socket
myles-MC62-G40-00:6742:6768 [2] NCCL INFO Using network Socket
myles-MC62-G40-00:6742:6767 [1] NCCL INFO Using network Socket
myles-MC62-G40-00:6742:6769 [3] NCCL INFO Using network Socket
myles-MC62-G40-00:6742:6768 [2] NCCL INFO ncclCommInitAll comm 0x5b9898450a10 rank 2 nranks 4 cudaDev 2 nvmlDev 2 busId 42000 commId 0x3b8ea5f57b417d28 - Init START
myles-MC62-G40-00:6742:6767 [1] NCCL INFO ncclCommInitAll comm 0x5b989840f4b0 rank 1 nranks 4 cudaDev 1 nvmlDev 1 busId 41000 commId 0x3b8ea5f57b417d28 - Init START
myles-MC62-G40-00:6742:6766 [0] NCCL INFO ncclCommInitAll comm 0x5b98983cdff0 rank 0 nranks 4 cudaDev 0 nvmlDev 0 busId 2000 commId 0x3b8ea5f57b417d28 - Init START
myles-MC62-G40-00:6742:6769 [3] NCCL INFO ncclCommInitAll comm 0x5b9898491f70 rank 3 nranks 4 cudaDev 3 nvmlDev 3 busId 61000 commId 0x3b8ea5f57b417d28 - Init START
myles-MC62-G40-00:6742:6769 [3] NCCL INFO Bootstrap timings total 0.001024 (create 0.000054, send 0.000170, recv 0.000572, ring 0.000102, delay 0.000000)
myles-MC62-G40-00:6742:6767 [1] NCCL INFO Bootstrap timings total 0.001077 (create 0.000059, send 0.000183, recv 0.000488, ring 0.000135, delay 0.000000)
myles-MC62-G40-00:6742:6766 [0] NCCL INFO Bootstrap timings total 0.001053 (create 0.000038, send 0.000133, recv 0.000436, ring 0.000116, delay 0.000000)
myles-MC62-G40-00:6742:6768 [2] NCCL INFO Bootstrap timings total 0.001120 (create 0.000064, send 0.000199, recv 0.000582, ring 0.000085, delay 0.000001)
myles-MC62-G40-00:6742:6769 [3] NCCL INFO NCCL_P2P_LEVEL set by environment to SYS
myles-MC62-G40-00:6742:6769 [3] NCCL INFO NVLS multicast support is not available on dev 3
myles-MC62-G40-00:6742:6766 [0] NCCL INFO NVLS multicast support is not available on dev 0
myles-MC62-G40-00:6742:6767 [1] NCCL INFO NVLS multicast support is not available on dev 1
myles-MC62-G40-00:6742:6768 [2] NCCL INFO NVLS multicast support is not available on dev 2
myles-MC62-G40-00:6742:6766 [0] NCCL INFO comm 0x5b98983cdff0 rank 0 nRanks 4 nNodes 1 localRanks 4 localRank 0 MNNVL 0
myles-MC62-G40-00:6742:6769 [3] NCCL INFO comm 0x5b9898491f70 rank 3 nRanks 4 nNodes 1 localRanks 4 localRank 3 MNNVL 0
myles-MC62-G40-00:6742:6766 [0] NCCL INFO Channel 00/02 : 0 1 2 3
myles-MC62-G40-00:6742:6767 [1] NCCL INFO comm 0x5b989840f4b0 rank 1 nRanks 4 nNodes 1 localRanks 4 localRank 1 MNNVL 0
myles-MC62-G40-00:6742:6766 [0] NCCL INFO Channel 01/02 : 0 1 2 3
myles-MC62-G40-00:6742:6766 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1
myles-MC62-G40-00:6742:6766 [0] NCCL INFO P2P Chunksize set to 131072
myles-MC62-G40-00:6742:6768 [2] NCCL INFO comm 0x5b9898450a10 rank 2 nRanks 4 nNodes 1 localRanks 4 localRank 2 MNNVL 0
myles-MC62-G40-00:6742:6769 [3] NCCL INFO Trees [0] -1/-1/-1->3->2 [1] -1/-1/-1->3->2
myles-MC62-G40-00:6742:6768 [2] NCCL INFO Trees [0] 3/-1/-1->2->1 [1] 3/-1/-1->2->1
myles-MC62-G40-00:6742:6768 [2] NCCL INFO P2P Chunksize set to 131072
myles-MC62-G40-00:6742:6769 [3] NCCL INFO P2P Chunksize set to 131072
myles-MC62-G40-00:6742:6767 [1] NCCL INFO Trees [0] 2/-1/-1->1->0 [1] 2/-1/-1->1->0
myles-MC62-G40-00:6742:6767 [1] NCCL INFO P2P Chunksize set to 131072
myles-MC62-G40-00:6742:6770 [0] NCCL INFO [Proxy Service] Device 0 CPU core 32
myles-MC62-G40-00:6742:6777 [1] NCCL INFO [Proxy Service UDS] Device 1 CPU core 111
myles-MC62-G40-00:6742:6775 [3] NCCL INFO [Proxy Service UDS] Device 3 CPU core 89
myles-MC62-G40-00:6742:6774 [1] NCCL INFO [Proxy Service] Device 1 CPU core 55
myles-MC62-G40-00:6742:6771 [0] NCCL INFO [Proxy Service UDS] Device 0 CPU core 107
myles-MC62-G40-00:6742:6773 [2] NCCL INFO [Proxy Service] Device 2 CPU core 16
myles-MC62-G40-00:6742:6772 [3] NCCL INFO [Proxy Service] Device 3 CPU core 65
myles-MC62-G40-00:6742:6776 [2] NCCL INFO [Proxy Service UDS] Device 2 CPU core 79
myles-MC62-G40-00:6742:6769 [3] NCCL INFO Channel 00/0 : 3[3] -> 0[0] via P2P/direct pointer
myles-MC62-G40-00:6742:6766 [0] NCCL INFO Channel 00/0 : 0[0] -> 1[1] via P2P/direct pointer
myles-MC62-G40-00:6742:6767 [1] NCCL INFO Channel 00/0 : 1[1] -> 2[2] via P2P/direct pointer
myles-MC62-G40-00:6742:6766 [0] NCCL INFO Channel 01/0 : 0[0] -> 1[1] via P2P/direct pointer
myles-MC62-G40-00:6742:6768 [2] NCCL INFO Channel 00/0 : 2[2] -> 3[3] via P2P/direct pointer
myles-MC62-G40-00:6742:6769 [3] NCCL INFO Channel 01/0 : 3[3] -> 0[0] via P2P/direct pointer
myles-MC62-G40-00:6742:6767 [1] NCCL INFO Channel 01/0 : 1[1] -> 2[2] via P2P/direct pointer
myles-MC62-G40-00:6742:6768 [2] NCCL INFO Channel 01/0 : 2[2] -> 3[3] via P2P/direct pointer
myles-MC62-G40-00:6742:6769 [3] NCCL INFO Connected all rings
myles-MC62-G40-00:6742:6769 [3] NCCL INFO Channel 00/0 : 3[3] -> 2[2] via P2P/direct pointer
myles-MC62-G40-00:6742:6766 [0] NCCL INFO Connected all rings
myles-MC62-G40-00:6742:6768 [2] NCCL INFO Connected all rings
myles-MC62-G40-00:6742:6767 [1] NCCL INFO Connected all rings
myles-MC62-G40-00:6742:6769 [3] NCCL INFO Channel 01/0 : 3[3] -> 2[2] via P2P/direct pointer
myles-MC62-G40-00:6742:6768 [2] NCCL INFO Channel 00/0 : 2[2] -> 1[1] via P2P/direct pointer
myles-MC62-G40-00:6742:6767 [1] NCCL INFO Channel 00/0 : 1[1] -> 0[0] via P2P/direct pointer
myles-MC62-G40-00:6742:6768 [2] NCCL INFO Channel 01/0 : 2[2] -> 1[1] via P2P/direct pointer
myles-MC62-G40-00:6742:6767 [1] NCCL INFO Channel 01/0 : 1[1] -> 0[0] via P2P/direct pointer
myles-MC62-G40-00:6742:6766 [0] NCCL INFO Connected all trees
myles-MC62-G40-00:6742:6769 [3] NCCL INFO Connected all trees
myles-MC62-G40-00:6742:6768 [2] NCCL INFO Connected all trees
myles-MC62-G40-00:6742:6767 [1] NCCL INFO Connected all trees
myles-MC62-G40-00:6742:6778 [1] NCCL INFO [Proxy Progress] Device 1 CPU core 116
myles-MC62-G40-00:6742:6780 [2] NCCL INFO [Proxy Progress] Device 2 CPU core 16
myles-MC62-G40-00:6742:6781 [3] NCCL INFO [Proxy Progress] Device 3 CPU core 65
myles-MC62-G40-00:6742:6779 [0] NCCL INFO [Proxy Progress] Device 0 CPU core 35
myles-MC62-G40-00:6742:6767 [1] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 512 | 512
myles-MC62-G40-00:6742:6767 [1] NCCL INFO 2 coll channels, 2 collnet channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
myles-MC62-G40-00:6742:6766 [0] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 512 | 512
myles-MC62-G40-00:6742:6766 [0] NCCL INFO 2 coll channels, 2 collnet channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
myles-MC62-G40-00:6742:6769 [3] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 512 | 512
myles-MC62-G40-00:6742:6769 [3] NCCL INFO 2 coll channels, 2 collnet channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
myles-MC62-G40-00:6742:6768 [2] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 512 | 512
myles-MC62-G40-00:6742:6768 [2] NCCL INFO 2 coll channels, 2 collnet channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
myles-MC62-G40-00:6742:6766 [0] NCCL INFO CC Off, Multi-GPU CC Off, workFifoBytes 1048576
myles-MC62-G40-00:6742:6769 [3] NCCL INFO TUNER/Plugin: Could not find: libnccl-tuner.so libnccl-net.so. Using internal tuner plugin.
myles-MC62-G40-00:6742:6769 [3] NCCL INFO ncclCommInitAll comm 0x5b9898491f70 rank 3 nranks 4 cudaDev 3 nvmlDev 3 busId 61000 commId 0x3b8ea5f57b417d28 - Init COMPLETE
myles-MC62-G40-00:6742:6769 [3] NCCL INFO Init timings - ncclCommInitAll: rank 3 nranks 4 total 0.40 (kernels 0.32, alloc 0.03, bootstrap 0.00, allgathers 0.00, topo 0.02, graphs 0.00, connections 0.03, rest 0.00)
myles-MC62-G40-00:6742:6767 [1] NCCL INFO ncclCommInitAll comm 0x5b989840f4b0 rank 1 nranks 4 cudaDev 1 nvmlDev 1 busId 41000 commId 0x3b8ea5f57b417d28 - Init COMPLETE
myles-MC62-G40-00:6742:6766 [0] NCCL INFO ncclCommInitAll comm 0x5b98983cdff0 rank 0 nranks 4 cudaDev 0 nvmlDev 0 busId 2000 commId 0x3b8ea5f57b417d28 - Init COMPLETE
myles-MC62-G40-00:6742:6766 [0] NCCL INFO Init timings - ncclCommInitAll: rank 0 nranks 4 total 0.40 (kernels 0.31, alloc 0.04, bootstrap 0.00, allgathers 0.00, topo 0.02, graphs 0.00, connections 0.03, rest 0.00)
myles-MC62-G40-00:6742:6767 [1] NCCL INFO Init timings - ncclCommInitAll: rank 1 nranks 4 total 0.40 (kernels 0.31, alloc 0.03, bootstrap 0.00, allgathers 0.00, topo 0.02, graphs 0.00, connections 0.02, rest 0.00)
myles-MC62-G40-00:6742:6768 [2] NCCL INFO ncclCommInitAll comm 0x5b9898450a10 rank 2 nranks 4 cudaDev 2 nvmlDev 2 busId 42000 commId 0x3b8ea5f57b417d28 - Init COMPLETE
myles-MC62-G40-00:6742:6768 [2] NCCL INFO Init timings - ncclCommInitAll: rank 2 nranks 4 total 0.40 (kernels 0.31, alloc 0.03, bootstrap 0.00, allgathers 0.00, topo 0.02, graphs 0.00, connections 0.03, rest 0.00)
#
# out-of-place in-place
# size count type redop root time algbw busbw #wrong time algbw busbw #wrong
# (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s)
8 2 float sum -1 16.71 0.00 0.00 0 15.98 0.00 0.00 0
16 4 float sum -1 16.60 0.00 0.00 0 16.31 0.00 0.00 0
32 8 float sum -1 16.13 0.00 0.00 0 16.44 0.00 0.00 0
64 16 float sum -1 16.21 0.00 0.01 0 16.28 0.00 0.01 0
128 32 float sum -1 16.52 0.01 0.01 0 17.47 0.01 0.01 0
256 64 float sum -1 17.69 0.01 0.02 0 17.78 0.01 0.02 0
512 128 float sum -1 17.77 0.03 0.04 0 17.69 0.03 0.04 0
1024 256 float sum -1 17.42 0.06 0.09 0 17.47 0.06 0.09 0
2048 512 float sum -1 27.93 0.07 0.11 0 17.66 0.12 0.17 0
4096 1024 float sum -1 18.02 0.23 0.34 0 17.87 0.23 0.34 0
8192 2048 float sum -1 18.00 0.46 0.68 0 18.00 0.46 0.68 0
16384 4096 float sum -1 17.70 0.93 1.39 0 17.73 0.92 1.39 0
32768 8192 float sum -1 18.78 1.74 2.62 0 18.95 1.73 2.59 0
65536 16384 float sum -1 25.80 2.54 3.81 0 25.91 2.53 3.79 0
131072 32768 float sum -1 42.44 3.09 4.63 0 42.26 3.10 4.65 0
262144 65536 float sum -1 66.25 3.96 5.94 0 66.16 3.96 5.94 0
524288 131072 float sum -1 94.11 5.57 8.36 0 94.16 5.57 8.35 0
1048576 262144 float sum -1 152.7 6.87 10.30 0 157.6 6.65 9.98 0
2097152 524288 float sum -1 271.8 7.72 11.58 0 275.3 7.62 11.43 0
4194304 1048576 float sum -1 510.6 8.21 12.32 0 510.9 8.21 12.31 0
8388608 2097152 float sum -1 1000.9 8.38 12.57 0 995.7 8.42 12.64 0
16777216 4194304 float sum -1 1993.2 8.42 12.63 0 1979.1 8.48 12.72 0
33554432 8388608 float sum -1 3966.7 8.46 12.69 0 3942.4 8.51 12.77 0
67108864 16777216 float sum -1 7888.2 8.51 12.76 0 7865.1 8.53 12.80 0
134217728 33554432 float sum -1 15705 8.55 12.82 0 15684 8.56 12.84 0
myles-MC62-G40-00:6742:6742 [0] NCCL INFO comm 0x5b98983cdff0 rank 0 nranks 4 cudaDev 0 busId 2000 - Destroy COMPLETE
myles-MC62-G40-00:6742:6742 [3] NCCL INFO comm 0x5b9898491f70 rank 3 nranks 4 cudaDev 3 busId 61000 - Destroy COMPLETE
myles-MC62-G40-00:6742:6742 [2] NCCL INFO comm 0x5b9898450a10 rank 2 nranks 4 cudaDev 2 busId 42000 - Destroy COMPLETE
myles-MC62-G40-00:6742:6742 [1] NCCL INFO comm 0x5b989840f4b0 rank 1 nranks 4 cudaDev 1 busId 41000 - Destroy COMPLETE
# Out of bounds values : 0 OK
# Avg bus bandwidth : 5.02577
# |
./build/all_reduce_perf -g 2
# nThread 1 nGpus 2 minBytes 33554432 maxBytes 33554432 step: 1048576(bytes) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
#
# Using devices
# Rank 0 Group 0 Pid 8411 on myles-MC62-G40-00 device 0 [0x41] NVIDIA GeForce RTX 4090
# Rank 1 Group 0 Pid 8411 on myles-MC62-G40-00 device 1 [0x42] NVIDIA GeForce RTX 4090
#
# out-of-place in-place
# size count type redop root time algbw busbw #wrong time algbw busbw #wrong
# (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s)
33554432 8388608 float sum -1 1390.4 24.13 24.13 0 1371.8 24.46 24.46 0
# Out of bounds values : 0 OK
# Avg bus bandwidth : 24.2967
#
myles@myles-MC62-G40-00:~/nccl-tests$
|
./p2pBandwidthLatencyTest
[P2P (Peer-to-Peer) GPU Bandwidth Latency Test]
Device: 0, NVIDIA GeForce RTX 4090, pciBusID: 41, pciDeviceID: 0, pciDomainID:0
Device: 1, NVIDIA GeForce RTX 4090, pciBusID: 42, pciDeviceID: 0, pciDomainID:0
Device=0 CAN Access Peer Device=1
Device=1 CAN Access Peer Device=0
***NOTE: In case a device doesn't have P2P access to other one, it falls back to normal memcopy procedure.
So you can see lesser Bandwidth (GB/s) and unstable Latency (us) in those cases.
P2P Connectivity Matrix
D\D 0 1
0 1 1
1 1 1
Unidirectional P2P=Disabled Bandwidth Matrix (GB/s)
D\D 0 1
0 912.14 11.56
1 11.41 903.18
Unidirectional P2P=Enabled Bandwidth (P2P Writes) Matrix (GB/s)
D\D 0 1
0 914.28 26.33
1 26.34 933.39
Bidirectional P2P=Disabled Bandwidth Matrix (GB/s)
D\D 0 1
0 919.18 11.42
1 11.44 915.08
Bidirectional P2P=Enabled Bandwidth Matrix (GB/s)
D\D 0 1
0 919.34 51.92
1 52.01 913.00
P2P=Disabled Latency Matrix (us)
GPU 0 1
0 1.43 10.43
1 10.50 1.38
CPU 0 1
0 2.37 6.78
1 6.72 2.30
P2P=Enabled Latency (P2P Writes) Matrix (us)
GPU 0 1
0 1.44 0.96
1 0.98 1.39
CPU 0 1
0 2.31 1.99
1 2.05 2.27
NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.
myles@myles-MC62-G40-00:~/cuda-samples/Samples/5_Domain_Specific/p2pBandwidthLatencyTest$ |
@mylesgoose Omg, thanks a ton! My stuff & methods are about the same, except I try ubuntu 24.04 now. |
@mylesgoose Yeah, iommu is on. That's why I have succeed with 550 before, but now failed with both 550 & 560. $ sudo dmesg | grep -e DMAR -e IOMMU |
@ewhacc iommu off.. how are you installing it do you download that source and compile.. also it works with ubuntu 24.04 and 24.10 however those versions have switched from x11 and gdm3 to Wayland. And Wayland I think does some security checks on the memorry. Of the gpu. And it does not render the windows correctly if at all. If your using ubuntu server you won't notice but if using desktop. You need to install sddm as display manager and use one of the other desktop environment i think uts called proton3 or something an x11 desktop environment. Anything like from gnome does not work correctly on Wayland. Don't know why.. |
the output from your sudo dmesg | grep -e DMAR -e IOMMU command confirms that IOMMU is enabled in your BIOS.
|
@mylesgoose I have disabled iommu in grub. I will disable iommu in bios too tomorrow. Everything works perfectly! p2pBandwidthLatencyTest, simpleP2P, nvbandwidth, nccl-tests. Yes, I have download the source & compile.
|
I have it already done. Thank you for explanation. |
@ewhacc mate so glad you got it working. Well done. Also its a good idea to keep the original modules from apt. Because you can make a script to simply replace the p2p modules now with apt ones and reboot. And then things that use secure ram like steam games or Wayland etc can function as normal. As you have one script to copy files from backup to original location and one script to install the p2p ones when machine learning if needed. Can you edit your last message to put |
@mylesgoose I didn't keep the original modules. This is only for LLM training. I don't have X too. :) I noticed little bandwidth down after disabling iommu in grub. I will check again after disable iommu in bios too. 52.43 -> 50.88 (GB/s) |
Added support for the 560.35.03 Nvidia driver