Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TL/MLX5: various optimizations #1012

Open
wants to merge 8 commits into
base: master
Choose a base branch
from

Conversation

samnordmann
Copy link
Collaborator

@samnordmann samnordmann commented Aug 21, 2024

What

This PR contains various optimizations for TL/MLX5/a2a. In order of importance/relevance:

  1. support rectangular blocks
  2. other configurations in how we post the WQEs:
    • iterate across nodes before blocks when posting the WQEs
    • reuse dm chunks
    • send blocks by batch
  3. knomial fan-in for the internode sync

We might want to merge this PR as is, or to divide it into several smaller ones. But this branch is at least a pointer for a working version, that can be used as is for performance experimentation.

TODO:

One important optimization that is yet to be implemented is to support using several NICs. So far, our algorithm only uses one NIC.

cc @lappazos @x41lakazam

@janjust janjust requested review from MamziB and janjust September 19, 2024 15:23
ucc_tl_mlx5_alltoall_t *a2a = team->a2a;
int node_size = a2a->node.sbgp->group_size;
int net_size = a2a->net.sbgp->group_size;
int op_msgsize = node_size * a2a->max_msg_size * UCC_TL_TEAM_SIZE(team) *
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

align

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here git-clang-format reverts to this form, which I also find more readable

int block_msgsize = block_h * block_w * task->alltoall.msg_size;
ucc_status_t status = UCC_OK;
int node_grid_w = node_size / block_w;
int node_nbr_blocks = (node_size * node_size) / (block_h * block_w);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

align

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here git-clang-format revert to this form, which I also find more readable

@swx-jenkins3
Copy link

Can one of the admins verify this patch?

TL/MLX5: add npolls cfg for FANIN

TL/MLX5: knomial fanin

TL/MLX5: add prints and profile events

TL/MLX5: remove debug prints
@samnordmann samnordmann force-pushed the tl/mlx5/pr/main_clean_history branch from 32d0718 to 43dd1d7 Compare December 18, 2024 16:48
tiny bit more robust

print blocks dimensions

fully working configurable batch_size, serialization, and pollings
clean and working

TL/MLX5: add more config for block dimensions

force longer by default
@samnordmann samnordmann force-pushed the tl/mlx5/pr/main_clean_history branch from 43dd1d7 to e12410c Compare December 30, 2024 18:27
@samnordmann samnordmann force-pushed the tl/mlx5/pr/main_clean_history branch from e12410c to 8312300 Compare December 30, 2024 19:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants