TL/MLX5: various optimizations #1012

samnordmann · 2024-08-21T11:32:29Z

What

This PR contains various optimizations for TL/MLX5/a2a. In order of importance/relevance:

support rectangular blocks
other configurations in how we post the WQEs:
- iterate across nodes before blocks when posting the WQEs
- reuse dm chunks
- send blocks by batch
knomial fan-in for the internode sync

We might want to merge this PR as is, or to divide it into several smaller ones. But this branch is at least a pointer for a working version, that can be used as is for performance experimentation.

TODO:

One important optimization that is yet to be implemented is to support using several NICs. So far, our algorithm only uses one NIC.

cc @lappazos @x41lakazam

src/components/tl/mlx5/alltoall/alltoall_coll.c

MamziB · 2024-09-25T18:50:57Z

src/components/tl/mlx5/alltoall/alltoall_coll.c

+    ucc_tl_mlx5_alltoall_t *a2a       = team->a2a;
+    int                     node_size = a2a->node.sbgp->group_size;
+    int                     net_size  = a2a->net.sbgp->group_size;
+    int op_msgsize = node_size * a2a->max_msg_size * UCC_TL_TEAM_SIZE(team) *


here git-clang-format reverts to this form, which I also find more readable

MamziB · 2024-09-25T18:51:27Z

src/components/tl/mlx5/alltoall/alltoall_coll.c

+    int          block_msgsize = block_h * block_w * task->alltoall.msg_size;
+    ucc_status_t status        = UCC_OK;
+    int          node_grid_w   = node_size / block_w;
+    int node_nbr_blocks        = (node_size * node_size) / (block_h * block_w);


here git-clang-format revert to this form, which I also find more readable

src/components/tl/mlx5/alltoall/alltoall_coll.c

src/components/tl/mlx5/tl_mlx5.c

src/components/tl/mlx5/tl_mlx5_dm.c

swx-jenkins3 · 2024-12-07T04:21:43Z

Can one of the admins verify this patch?

TL/MLX5: add npolls cfg for FANIN TL/MLX5: knomial fanin TL/MLX5: add prints and profile events TL/MLX5: remove debug prints

tiny bit more robust print blocks dimensions fully working configurable batch_size, serialization, and pollings

clean

clean and working TL/MLX5: add more config for block dimensions force longer by default

lintrunner cleaning

janjust requested review from MamziB and janjust September 19, 2024 15:23