-
Notifications
You must be signed in to change notification settings - Fork 428
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Error in mlx5dv_create_qp in the DC transport #5749
Comments
@lyu this is MLNX_OFED issue. Can you pls try MLNX_OFED 5.1-2 or higher? |
@yosefe Sorry can't do, I don't have admin access. |
Similar issues here with TX2 and MLNX_OFED_LINUX-5.1-0.6.6.0 I just built UCX |
@yosefe it reproduces on the course setup in HUJI, which doesn't have MOFED (I can give you access to it, it's 4 hosts with CX3). Looks like UCX incorrectly detects DC is supported on the device when in fact it isn't, and fails the entire worker when DCT creation fails. This is still relevant to the current master branch. BTW - a simpler workaround (without rebuilding UCX) would be setting UCX_TLS (this is what I do on that setup until this is resolved). |
@alex--m can you pls upload the output of "ucx_info -dvb" command, with UCX_LOG_LEVEL=data? |
|
Also, this ibstat output may be helpful:
|
@alex--m i see there is a COnnect-IB card (which supports DC) and DC iface create does not have any error |
@yosefe this looks problematic (and also OMPI crashes during MPI_Init() on worker creation following a similar DCT failure):
|
@alex--m seems it could be issue with scatter-to-cqe initialization on DC |
Kernel 5.4.81, no MOFED, rdma-core version unclear (no privs to check, ofed_info not installed) - any Ideas how to query? P.S. I can help you or a member of your team to connect to those servers, if it helps. |
Yes can u pls send access info by mail? |
I don't have access to 'rpm' executable (it's a netboot image). I'll send info shortly. |
I have the same problems, let me know if further details are required. UCX version: 1.11.2 Error:
lsb_release -a |
could you please check dmesg for syndrome - this is how-to do it |
Thanks, but I don't have sudo access, so changing the dynamic debug isn't going to work? |
@Artemy-Mellanox I got help from our sys-admin, here is the attached output: Output of dmesg
|
was there some problem with enabling dynamic debug? |
I am not sure :) I'll ask again. |
Hi here is the "sys-admin". aarch64 without ofed: okay Hope it helps..... Best regard, Sebastian ofed-x86-64.txt |
@Artemy-Mellanox any update here? |
I have an update......(when not using OFED), then upgrading the mellanox-firmware solves the problem.
Best regards, Sebastian |
Maybe you can attach from old kernel fw config: |
Hi okay...here you go... - installation is without ofed and a fresh
|
Can you please run failing ucx_info with strace |
Yes....attached. Thanks :) |
Can you please send output of |
here it is:
|
Can you please verify that in this case it's still same libibverbs version and not some other, probably installed by OFED |
This machine is and was OFED-free. Just compiled a fresh openucx on this box....
|
please try |
|
Looks like this libibverbs version has bug, could you please update it, version 22.4-2 or later should have a fix to this. |
Yes.... thanks a lot...confirmed.... using the updating the infiniband-packages which are coming with Scientific-Linux-7.9 are
Case closed.... (at least from our side...) Thanks again + Best regards, Sebastian |
Describe the bug
ucx_info
anducx_perftest
reportsdc_mlx5.c:329 UCX ERROR mlx5dv_create_qp(mlx5_0:1, DCI): failed: Invalid argument
.Steps to Reproduce
UCX version:
UCT version=1.10.0 revision c7add93
UCX build config:
--prefix=$PREFIX --enable-debug --enable-assertions --enable-params-check --enable-frame-pointer --enable-backtrace-detail
Setup and versions
lsb_release -a
:ofed_info -s
:MLNX_OFED_LINUX-5.1-0.6.6.0
rpm -q rdma-core
:rdma-core-51mlnx1-1.51066.aarch64
rpm -q libibverbs
:libibverbs-51mlnx1-1.51066.aarch64
Additional information (depending on the issue)
For
ucx_info -d
, this happens when it tries to print info about thedc_mlx5
transport.For
ucx_perftest
, it happens when running any UCP test without any environment variable set.All issues go away if I add
--without-dc
to the configure script.This doesn't happen with UCX 1.9.0, dc transport will be enabled and work correctly.
This also doesn't happen when built against MLNX_OFED_LINUX-4.5-1.0.1.0 on another ThunderX2 machine, but it looks like dc is automatically disabled there.
The text was updated successfully, but these errors were encountered: