-
Notifications
You must be signed in to change notification settings - Fork 861
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
v4.1.5 UCX_NET_DEVICES not selecting TCP devices correctly #12785
Labels
Comments
@bertiethorpe I can't reproduce described behavior with ompi and ucx bult from sources (see below), what I'm missing?
|
@bertiethorpe can you pls increase the verbosity of OpenMPI, by adding |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Details of the problem
Setting UCX_NET_DEVICES to target only TCP devices when RoCE is available seems to be ignored in favour of some fallback.
I'm running a 2 node IMB_MPI PingPong to benchmark RoCE against regular TCP ethernet.
Setting
UCX_NET_DEVICES=all
ormlx5_0:1
gives the optimal performance and uses RDMA as expected.Setting
UCX_NET_DEVICES=eth0
,eth1
, or anything else still appears to use RoCE at only a slightly longer latencyHW information from
ibstat
oribv_devinfo -vv
command :How ompi is configured from
ompi_info | grep Configure
:Following the advice from Here, it is apparently due to a higher priority of OpenMPI's btl/openib component but I don't think it can be if
--without-verbs
and openib is not available when searchingompi_info | grep btl
.As suggested in the UCX issue, adding
-mca pml_ucx_tls any -mca pml_ucx_devices any
to my mpirun has fixed this problem, but I was wondering what in the MCA precisely causes this behaviour.Here's my batch script:
The text was updated successfully, but these errors were encountered: