-
Notifications
You must be signed in to change notification settings - Fork 428
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
UCX ignores exclusively setting TCP devices when RoCE is available. #10049
Comments
Hi @bertiethorpe Therefore we suggest:
|
@bertiethorpe can you pls run with UCX_NET_DEVICES=eth0 and also add |
Some more information:
Are these expected? I should be expecting the mlx to be with eth1 because they're on the same NIC |
|
Can you pls configure OpenMPI with |
So that seems to have done the trick. Now getting the latency I expected. |
Where can you see this in the logs? Forgive my ignorance, but I can't actually see the btl openib component is available at all. Was it removed in v4.1.x?
This is all I see. |
Sorry, i've meant btl/tcp component, not btl/openib. |
Describe the bug
Setting UCX_NET_DEVICES to target only TCP devices when RoCE is available seems to be ignored in favour of some fallback.
I'm running a 2 node IMB_MPI PingPong to benchmark RoCE against regular TCP ethernet.
Setting
UCX_NET_DEVICES=all
ormlx5_0:1
gives the optimal performance and uses RDMA as expected.Setting
UCX_NET_DEVICES=eth0
,eth1
, or anything else still appears to use RoCE at only a slightly longer latencyAs per the docs, setting UCX_NET_DEVICES to one of the TCP devices, I should expect TCP-like latencies of ~15us but am seeing closer to RoCE performance with latencies ~2.1us.
Stranger still, is the latency for specifically targeting
mlx5_0:1
orall
is different (lower latency ~1.6us), so it looks like the fallback is notall
when setting toeth0
etc.Is this behaviour determined somewhere else or accounted for in some way?
Steps to Reproduce
Configured with: --build=x86_64-redhat-linux-gnu --host=x86_64-redhat-linux-gnu --program-prefix= --disable-dependency-tracking --prefix=/usr --exec-prefix=/usr --bindir=/usr/bin --sbindir=/usr/sbin --sysconfdir=/etc --datadir=/usr/share --includedir=/usr/include --libdir=/usr/lib64 --libexecdir=/usr/libexec --localstatedir=/var --sharedstatedir=/var/lib --mandir=/usr/share/man --infodir=/usr/share/info --disable-optimizations --disable-logging --disable-debug --disable-assertions --enable-mt --disable-params-check --without-go --without-java --enable-cma --without-cuda --without-gdrcopy --with-verbs --with-knem --with-rdmacm --without-rocm --with-xpmem --without-fuse3 --without-ugni --without-mad --without-ze
Setup and versions
ibstat
oribv_devinfo -vv
commandAdditional information (depending on the issue)
Logs:
eth0.txt
mlxlog.txt
The text was updated successfully, but these errors were encountered: