Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TL/MLX5: adding mcast allgather staging based algo #994

Merged

Conversation

MamziB
Copy link
Collaborator

@MamziB MamziB commented Jun 27, 2024

What

add the algorithm for mcast-based allgather

Why

scalability and performance improvement over sw based allgather

How ?

Realizing the Allgather operations as N (team-size) concurrent Bcast operations (every process becomes the root). We use the one-sided design that was merged before for its reliability.

@MamziB MamziB requested review from samnordmann and Sergei-Lebedev and removed request for samnordmann June 27, 2024 18:54
@MamziB MamziB self-assigned this Jun 27, 2024
Copy link
Collaborator

@samnordmann samnordmann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't have time to review everything yet, but here is a first quick round of review. I see several natural ways to make the PR smaller -- more than 1k lines is a large pr, so please try to break it down whenever possible.

src/components/tl/mlx5/tl_mlx5.c Show resolved Hide resolved
src/components/tl/mlx5/tl_mlx5_coll.c Outdated Show resolved Hide resolved
src/components/tl/mlx5/tl_mlx5_coll.c Outdated Show resolved Hide resolved
src/components/tl/mlx5/mcast/tl_mlx5_mcast.h Outdated Show resolved Hide resolved
src/components/tl/mlx5/mcast/tl_mlx5_mcast.h Show resolved Hide resolved
src/components/tl/mlx5/mcast/tl_mlx5_mcast.h Show resolved Hide resolved
src/components/tl/mlx5/mcast/tl_mlx5_mcast.h Show resolved Hide resolved
src/components/tl/mlx5/mcast/tl_mlx5_mcast_team.c Outdated Show resolved Hide resolved
src/components/tl/mlx5/mcast/tl_mlx5_mcast_team.c Outdated Show resolved Hide resolved
src/components/tl/mlx5/tl_mlx5.h Outdated Show resolved Hide resolved
@samnordmann
Copy link
Collaborator

The CI has relevant issues

@MamziB
Copy link
Collaborator Author

MamziB commented Jul 1, 2024

Hi @samnordmann Thanks for the constructive comments. I have addressed those and added a new commit (second commit). Please take a look and let me know your thoughts.

Copy link
Collaborator

@samnordmann samnordmann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for addressing the comment! Here is another round

src/components/tl/mlx5/mcast/tl_mlx5_mcast_allgather.c Outdated Show resolved Hide resolved
src/components/tl/mlx5/mcast/tl_mlx5_mcast_allgather.c Outdated Show resolved Hide resolved
src/components/tl/mlx5/mcast/tl_mlx5_mcast_allgather.c Outdated Show resolved Hide resolved
src/components/tl/mlx5/mcast/tl_mlx5_mcast_allgather.c Outdated Show resolved Hide resolved
src/components/tl/mlx5/mcast/tl_mlx5_mcast_allgather.c Outdated Show resolved Hide resolved
src/components/tl/mlx5/mcast/tl_mlx5_mcast_allgather.c Outdated Show resolved Hide resolved
src/components/tl/mlx5/mcast/tl_mlx5_mcast_allgather.c Outdated Show resolved Hide resolved
@MamziB
Copy link
Collaborator Author

MamziB commented Jul 3, 2024

Thanks @samnordmann for today's comments. I have added a new separate commit. Please take a look. Also, I have yet to cut something from this PR to make it shorter.

Copy link
Collaborator

@samnordmann samnordmann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for addressing the comments

@MamziB
Copy link
Collaborator Author

MamziB commented Jul 3, 2024

#994 (comment)
@samnordmann I would rather bail out gracefully, so let me throw an error if mcast req is NULL. Please look at the last commit.

@MamziB MamziB force-pushed the mamzi/mcast-allgather-3-part-1 branch from 711f798 to 61b99b6 Compare July 3, 2024 22:05
src/components/tl/mlx5/tl_mlx5_coll.h Show resolved Hide resolved
src/components/tl/mlx5/tl_mlx5.c Show resolved Hide resolved
src/components/tl/mlx5/mcast/tl_mlx5_mcast_allgather.c Outdated Show resolved Hide resolved
src/components/tl/mlx5/mcast/tl_mlx5_mcast_allgather.c Outdated Show resolved Hide resolved
src/components/tl/mlx5/mcast/tl_mlx5_mcast_allgather.c Outdated Show resolved Hide resolved
src/components/tl/mlx5/mcast/tl_mlx5_mcast_allgather.c Outdated Show resolved Hide resolved
src/components/tl/mlx5/mcast/tl_mlx5_mcast_allgather.c Outdated Show resolved Hide resolved
src/components/tl/mlx5/mcast/tl_mlx5_mcast_allgather.c Outdated Show resolved Hide resolved
@MamziB MamziB force-pushed the mamzi/mcast-allgather-3-part-1 branch from 13e6845 to 75576ee Compare July 16, 2024 18:23
@MamziB
Copy link
Collaborator Author

MamziB commented Jul 16, 2024

Thank you Sergey and Sam. All the comments have been resolved.

@manjugv manjugv requested a review from wfaderhold21 July 16, 2024 18:46
@MamziB
Copy link
Collaborator Author

MamziB commented Aug 20, 2024

@ Sergei-Lebedev can you also look at this PR and let me know if you have more comments? Thank you, Sergey!

@MamziB
Copy link
Collaborator Author

MamziB commented Sep 16, 2024

@Sergei-Lebedev Can you please let me know if you have more comments on this? So that we can continue implementing the rest of the design. This is a blocker for future PRs. Thank you!

@manjugv
Copy link
Contributor

manjugv commented Sep 16, 2024

Ping @wfaderhold21 . Can you please review?

@janjust
Copy link
Collaborator

janjust commented Sep 30, 2024

@wfaderhold21 Can we get this PR reviewed?

src/components/tl/mlx5/tl_mlx5_coll.c Outdated Show resolved Hide resolved
src/components/tl/mlx5/tl_mlx5_coll.c Outdated Show resolved Hide resolved
src/components/tl/mlx5/mcast/tl_mlx5_mcast_allgather.c Outdated Show resolved Hide resolved
src/components/tl/mlx5/mcast/tl_mlx5_mcast_allgather.c Outdated Show resolved Hide resolved
src/components/tl/mlx5/mcast/tl_mlx5_mcast_allgather.c Outdated Show resolved Hide resolved
src/components/tl/mlx5/mcast/tl_mlx5_mcast_allgather.c Outdated Show resolved Hide resolved
src/components/tl/mlx5/mcast/tl_mlx5_mcast_allgather.c Outdated Show resolved Hide resolved
src/components/tl/mlx5/mcast/tl_mlx5_mcast_allgather.c Outdated Show resolved Hide resolved
Copy link
Collaborator

@wfaderhold21 wfaderhold21 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Minor comments...

src/components/tl/mlx5/mcast/tl_mlx5_mcast_coll.c Outdated Show resolved Hide resolved
src/components/tl/mlx5/mcast/tl_mlx5_mcast_allgather.c Outdated Show resolved Hide resolved
@MamziB
Copy link
Collaborator Author

MamziB commented Oct 21, 2024

Thanks @wfaderhold21 and @Sergei-Lebedev for the comments. I push the changes.

@MamziB MamziB force-pushed the mamzi/mcast-allgather-3-part-1 branch 3 times, most recently from e7a7bcd to 39f3069 Compare October 22, 2024 17:00
@MamziB MamziB force-pushed the mamzi/mcast-allgather-3-part-1 branch from 39f3069 to 30749be Compare October 22, 2024 17:49
@MamziB MamziB force-pushed the mamzi/mcast-allgather-3-part-1 branch 3 times, most recently from edb71ae to c584457 Compare November 5, 2024 18:09
@MamziB MamziB force-pushed the mamzi/mcast-allgather-3-part-1 branch from c584457 to 867f828 Compare November 6, 2024 22:08
@MamziB
Copy link
Collaborator Author

MamziB commented Nov 6, 2024

Hi @Sergei-Lebedev Thank you for your constructive comments. I have addressed all of them. Thank you.

@MamziB MamziB force-pushed the mamzi/mcast-allgather-3-part-1 branch from 867f828 to 6e85c33 Compare November 6, 2024 22:12
@Sergei-Lebedev
Copy link
Contributor

Hi @Sergei-Lebedev Thank you for your constructive comments. I have addressed all of them. Thank you.

Thanks @MamziB

@Sergei-Lebedev Sergei-Lebedev merged commit 4c5ac3e into openucx:master Nov 7, 2024
11 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants