-
Notifications
You must be signed in to change notification settings - Fork 300
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
one direction bandwidth testing fail with GPUdirect #289
Comments
Seems like an issue we encountered. |
Thanks, i have noted this post and tried to find the coresponding setting in my bios (Z690 mainboard) and found one 4GB MMO one. In the default setting it links with Resize bar and i can disable it if i disable Resize Bar, i tried but failed. The direction which have mentioned issue still can not work but the other direction can. Hope someone else can share their solution or give some insigts. Thanks anyway. |
That kind of problems is usually related to PCIe ACS P2P forwarding/redirection being enabled in the PCIe switch downstream ports. |
Thanks, i think i have disabled all of related funcations in the bios but still can not get both direction work so i give up my tiny nodes plan and install one node with 4gpu instead.Thanks for helping me out. |
Hello,
I am testing my 2 P100 in 2 nodes with 2 cx555 NICs.
It is only successful from one direction but failed in the other.
Success
./ib_write_bw --use_cuda=0 -a 10.10.10.11
./ib_write_bw -d mlx5_0 --use_cuda=0 -a
Fail
./ib_write_bw --use_cuda=0 -a
ethernet_read_keys: Couldn't read remote address
Unable to read to socket/rdma_cm
Failed to exchange data between server and clients
./ib_write_bw -d mlx5_0 --use_cuda=0 -a 10.10.10.10
Completion with error at client
Failed status 4: wr_id 0 syndrom 0x51
scnt=128, ccnt=0
Failed to complete run_iter_bw function successfully
For the testing between both cx555 NICs the bandwidth testings work well.
Driver and Kernel:
Both cx555 are the same driver and firmware
Both P100 are th same driver but different vbios
I am not using Nvidia open source kernel since P100 is not supported but i think it is not the problem of the kernel otherwise why one direction is still working.
For IOMMU
10.10.10.11
sudo dmesg | grep -i dmar
[ 0.173076] DMAR: IOMMU disabled
sudo dmesg | grep -i iommu
[ 0.000000] Command line: BOOT_IMAGE=/boot/vmlinuz-6.8.0-41-generic root=UUID=44a5d7a3-4f19-4106-8a8c-66301c2c9d14 ro intel_iommu=off quiet splash vt.handoff=7
[ 0.173010] Kernel command line: BOOT_IMAGE=/boot/vmlinuz-6.8.0-41-generic root=UUID=44a5d7a3-4f19-4106-8a8c-66301c2c9d14 ro intel_iommu=off quiet splash vt.handoff=7
[ 0.173076] DMAR: IOMMU disabled
[ 2.245922] iommu: Default domain type: Translated
[ 2.245922] iommu: DMA domain TLB invalidation policy: lazy mode
10.10.10.10
sudo dmesg | grep -i dmar
No iputput
sudo dmesg | grep -i iommu
[ 0.000000] Command line: BOOT_IMAGE=/boot/vmlinuz-6.8.0-41-generic root=UUID=6e849e25-4931-4c06-8684-bb553962f200 ro amd_iommu=off quiet splash vt.handoff=7
[ 0.030879] Kernel command line: BOOT_IMAGE=/boot/vmlinuz-6.8.0-41-generic root=UUID=6e849e25-4931-4c06-8684-bb553962f200 ro amd_iommu=off quiet splash vt.handoff=7
[ 1.861879] iommu: Default domain type: Translated
[ 1.861879] iommu: DMA domain TLB invalidation policy: lazy mode
i have set both iommu=off in the kernel but ouput are different.
What will the possible casue for this issue and how can i go deep to find the casue and find the solution.
Thanks
The text was updated successfully, but these errors were encountered: