Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UCX ignores exclusively setting TCP devices when RoCE is available. #10049

Open
bertiethorpe opened this issue Aug 6, 2024 · 8 comments
Open
Labels

Comments

@bertiethorpe
Copy link

Describe the bug

Setting UCX_NET_DEVICES to target only TCP devices when RoCE is available seems to be ignored in favour of some fallback.

I'm running a 2 node IMB_MPI PingPong to benchmark RoCE against regular TCP ethernet.

Setting UCX_NET_DEVICES=all or mlx5_0:1 gives the optimal performance and uses RDMA as expected.
Setting UCX_NET_DEVICES=eth0, eth1, or anything else still appears to use RoCE at only a slightly longer latency

As per the docs, setting UCX_NET_DEVICES to one of the TCP devices, I should expect TCP-like latencies of ~15us but am seeing closer to RoCE performance with latencies ~2.1us.

Stranger still, is the latency for specifically targeting mlx5_0:1 or all is different (lower latency ~1.6us), so it looks like the fallback is not all when setting to eth0 etc.

Is this behaviour determined somewhere else or accounted for in some way?

Steps to Reproduce

  • Batch Script:
#!/usr/bin/env bash

#SBATCH --ntasks=2
#SBATCH --ntasks-per-node=1
#SBATCH --output=%x.%j.out
#SBATCH --error=%x.%j.out
#SBATCH --exclusive
#SBATCH --partition=standard

module load gnu12 openmpi4 imb

export UCX_NET_DEVICES=mlx5_0:1

echo SLURM_JOB_NODELIST: $SLURM_JOB_NODELIST
echo SLURM_JOB_ID: $SLURM_JOB_ID
echo UCX_NET_DEVICES: $UCX_NET_DEVICES

export UCX_LOG_LEVEL=data
#srun --mpi=pmi2 IMB-MPI1 pingpong # doesn't work in ohpc v2.1
mpirun IMB-MPI1 pingpong -iter_policy off
  • UCX version 1.17.0
  • Git branch '', revision 7bb2722
    Configured with: --build=x86_64-redhat-linux-gnu --host=x86_64-redhat-linux-gnu --program-prefix= --disable-dependency-tracking --prefix=/usr --exec-prefix=/usr --bindir=/usr/bin --sbindir=/usr/sbin --sysconfdir=/etc --datadir=/usr/share --includedir=/usr/include --libdir=/usr/lib64 --libexecdir=/usr/libexec --localstatedir=/var --sharedstatedir=/var/lib --mandir=/usr/share/man --infodir=/usr/share/info --disable-optimizations --disable-logging --disable-debug --disable-assertions --enable-mt --disable-params-check --without-go --without-java --enable-cma --without-cuda --without-gdrcopy --with-verbs --with-knem --with-rdmacm --without-rocm --with-xpmem --without-fuse3 --without-ugni --without-mad --without-ze
  • Any UCX environment variables used
    • See logs

Setup and versions

  • OS version (e.g Linux distro)
    • Rocky Linux release 9.4 (Blue Onyx)
  • Driver version:
    • rdma-core-2404mlnx51-1.2404066.x86_64
    • MLNX_OFED_LINUX-24.04-0.6.6.0
  • HW information from ibstat or ibv_devinfo -vv command
    •  transport:                      InfiniBand (0)
       fw_ver:                         20.36.1010
       node_guid:                      fa16:3eff:fe4f:f5e9
       sys_image_guid:                 0c42:a103:0003:5d82
       vendor_id:                      0x02c9
       vendor_part_id:                 4124
       hw_ver:                         0x0
       board_id:                       MT_0000000224
       phys_port_cnt:                  1
               port:   1
                       state:                  PORT_ACTIVE (4)
                       max_mtu:                4096 (5)
                       active_mtu:             1024 (3)
                       sm_lid:                 0
                       port_lid:               0
                       port_lmc:               0x00
                       link_layer:             Ethernet
      
      

Additional information (depending on the issue)

  • OpenMPI version 4.1.5

Logs:

@gleon99
Copy link
Contributor

gleon99 commented Aug 6, 2024

Hi @bertiethorpe
In the attached eth0.txt log file, there's no evidence of UCX connection establishment, also the environment variable UCX_NET_DEVICES is not propagated to the config parser - unlike in the mlxlog.txt file.

Therefore we suggest:

  1. Please double-check the command line for both cases and ensure UCX is used.
  2. Run ucx_info -e -u t -P inter with various UCX_NET_DEVICES and check whether the used devices are the ones you expect.

@yosefe
Copy link
Contributor

yosefe commented Aug 6, 2024

@bertiethorpe can you pls run with UCX_NET_DEVICES=eth0 and also add -mca pml_base_verbose 99 -mca pml_ucx_verbose 99 -mca pml ucx to mpirun?
Also, what were the configure flags for OpenMPI?
It seems OpenMPI is not using UCX component when UCX_NET_DEVICES=eth0, due to a higher priority of OpenMPI's btl/openib component, which is also using RDMA.

@bertiethorpe
Copy link
Author

Some more information:

  • This is all virtualised

Run ucx_info -e -u t -P inter with various UCX_NET_DEVICES and check whether the used devices are the ones you expect.

ucx_info -e -u t -P inter
#
# UCP endpoint 
#
#               peer: <no debug data>
#                 lane[0]:  8:rc_mlx5/mlx5_0:1.0 md[4]      -> md[4]/ib/sysdev[255] rma_bw#0 am am_bw#0
#                 lane[1]:  3:tcp/eth1.0 md[1]              -> md[1]/tcp/sysdev[255] rma_bw#1 wireup
#
#                tag_send: 0..<egr/short>..227..<egr/bcopy>..263060..<rndv>..(inf)
#            tag_send_nbr: 0..<egr/short>..227..<egr/bcopy>..262144..<rndv>..(inf)
#           tag_send_sync: 0..<egr/short>..227..<egr/bcopy>..263060..<rndv>..(inf)
#
#                  rma_bw: mds [1] [4] #
#                     rma: mds rndv_rkey_size 19
#
UCX_NET_DEVICES=eth0 ucx_info -e -u t -P inter
#
# UCP endpoint 
#
#               peer: <no debug data>
#                 lane[0]:  1:tcp/eth0.0 md[1]              -> md[1]/tcp/sysdev[255] rma_bw#0 am am_bw#0 wireup
#
#                tag_send: 0..<egr/short>..8185..<egr/zcopy>..20424..<rndv>..(inf)
#            tag_send_nbr: 0..<egr/short>..8185..<egr/bcopy>..262144..<rndv>..(inf)
#           tag_send_sync: 0..<egr/short>..8185..<egr/zcopy>..20424..<rndv>..(inf)
#
#                  rma_bw: mds [1] #
#                     rma: mds rndv_rkey_size 10
#
UCX_NET_DEVICES=eth1 ucx_info -e -u t -P inter
#
# UCP endpoint 
#
#               peer: <no debug data>
#                 lane[0]:  1:tcp/eth1.0 md[1]              -> md[1]/tcp/sysdev[255] rma_bw#0 am am_bw#0 wireup
#
#                tag_send: 0..<egr/short>..8185..<egr/zcopy>..19505..<rndv>..(inf)
#            tag_send_nbr: 0..<egr/short>..8185..<egr/bcopy>..262144..<rndv>..(inf)
#           tag_send_sync: 0..<egr/short>..8185..<egr/zcopy>..19505..<rndv>..(inf)
#
#                  rma_bw: mds [1] #
#                     rma: mds rndv_rkey_size 10
#

Are these expected? I should be expecting the mlx to be with eth1 because they're on the same NIC

@bertiethorpe
Copy link
Author

bertiethorpe commented Aug 6, 2024

can you pls run with UCX_NET_DEVICES=eth0 and also add -mca pml_base_verbose 99 -mca pml_ucx_verbose 99 -mca pml ucx to mpirun?

ucxlog.txt

@yosefe
Copy link
Contributor

yosefe commented Aug 6, 2024

can you pls run with UCX_NET_DEVICES=eth0 and also add -mca pml_base_verbose 99 -mca pml_ucx_verbose 99 -mca pml ucx to mpirun?

ucxlog.txt

Can you pls configure OpenMPI with --with-platform=contrib/platform/mellanox/optimized ?
It will force using UCX also with TCP transports.
Alternatively, can add -mca pml_ucx_tls any -mca pml_ucx_devices any to mpirun

@bertiethorpe
Copy link
Author

ucxlog2.txt

So that seems to have done the trick. Now getting the latency I expected.

@bertiethorpe
Copy link
Author

It seems OpenMPI is not using UCX component when UCX_NET_DEVICES=eth0, due to a higher priority of OpenMPI's btl/openib component, which is also using RDMA.

Where can you see this in the logs? Forgive my ignorance, but I can't actually see the btl openib component is available at all. Was it removed in v4.1.x?

ompi_info |  grep btl
                 MCA btl: ofi (MCA v2.1.0, API v3.1.0, Component v4.1.5)
                 MCA btl: self (MCA v2.1.0, API v3.1.0, Component v4.1.5)
                 MCA btl: tcp (MCA v2.1.0, API v3.1.0, Component v4.1.5)
                 MCA btl: vader (MCA v2.1.0, API v3.1.0, Component v4.1.5)
                MCA fbtl: posix (MCA v2.1.0, API v2.0.0, Component v4.1.5)

This is all I see.

@yosefe
Copy link
Contributor

yosefe commented Nov 10, 2024

It seems OpenMPI is not using UCX component when UCX_NET_DEVICES=eth0, due to a higher priority of OpenMPI's btl/openib component, which is also using RDMA.

Where can you see this in the logs? Forgive my ignorance, but I can't actually see the btl openib component is available at all. Was it removed in v4.1.x?

ompi_info |  grep btl
                 MCA btl: ofi (MCA v2.1.0, API v3.1.0, Component v4.1.5)
                 MCA btl: self (MCA v2.1.0, API v3.1.0, Component v4.1.5)
                 MCA btl: tcp (MCA v2.1.0, API v3.1.0, Component v4.1.5)
                 MCA btl: vader (MCA v2.1.0, API v3.1.0, Component v4.1.5)
                MCA fbtl: posix (MCA v2.1.0, API v2.0.0, Component v4.1.5)

This is all I see.

Sorry, i've meant btl/tcp component, not btl/openib.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants