Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TL/MLX5: fix memtype in bcast reliability #1022

Merged
merged 1 commit into from
Oct 24, 2024

Conversation

MamziB
Copy link
Collaborator

@MamziB MamziB commented Sep 19, 2024

TL/MLX5: fix memtype in bcast reliability

Copy link
Collaborator

@samnordmann samnordmann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand it is a fix, but I am surprised we didn't catch that bug before. It seems that we are not testing mcast bcast cuda memory at all, is it correct? If so, can you please add a test?

@MamziB
Copy link
Collaborator Author

MamziB commented Sep 20, 2024

@samnordmann you are correct, it was a shortcoming from my side. Reliablity protocol kicks when a packet is dropped and it does not happen in normal scenarios.
Previously, I tried forcing the dropped packet in the HOST version for testing purposes, but not for GPU. Since we saw a dropped packet on the ROCK system, where MTT runs, we saw this error, and here is a fix for that.

@samnordmann
Copy link
Collaborator

@samnordmann you are correct, it was a shortcoming from my side. Reliablity protocol kicks when a packet is dropped and it does not happen in normal scenarios. Previously, I tried forcing the dropped packet in the HOST version for testing purposes, but not for GPU. Since we saw a dropped packet on the ROCK system, where MTT runs, we saw this error, and here is a fix for that.

Can you please add a test, covering both Host and CUDA memtype, checking that the reliability is not boggy?

@MamziB
Copy link
Collaborator Author

MamziB commented Sep 23, 2024

@samnordmann OK I will open a separate PR soon

@janjust
Copy link
Collaborator

janjust commented Sep 23, 2024

@MamziB no need for a separate PR, just push the commit here

@MamziB
Copy link
Collaborator Author

MamziB commented Sep 23, 2024

@janjust I have added the gtest before but it fails the github ucc tests for some unknown reasons. I rather open a new PR so that it does not affect this PR as I would like this PR to go in ASAP. I will open a PR today for that gtest changes.

@janjust
Copy link
Collaborator

janjust commented Sep 23, 2024

I don't think this PR will go in until the test is in, open the other PR, but then this PR will be predicated on the test going in.

@MamziB
Copy link
Collaborator Author

MamziB commented Sep 23, 2024

I opened #1023 for testing mcast

@janjust
Copy link
Collaborator

janjust commented Sep 23, 2024

great, thanks -now the hard part, figuring out why the other is failing

@janjust
Copy link
Collaborator

janjust commented Sep 23, 2024

@samnordmann @Sergei-Lebedev
#1023

Here is the test

@samnordmann
Copy link
Collaborator

@samnordmann @Sergei-Lebedev #1023

Here is the test

Thank you, but shouldn't we also test the reliability protocol in case of dropped packets? The present PR fixes a bug that we cannot catch nor reproduce, so how can we ensure that it is effectively fixed?

I don't want to block the PR if there are some time considerations though, so let me know what you guys think

@MamziB
Copy link
Collaborator Author

MamziB commented Sep 23, 2024

@samnordmann Sure, I will update the other PR to force the bcast to have dropped packets

@MamziB MamziB requested a review from samnordmann September 26, 2024 18:03
@MamziB MamziB force-pushed the mamzi/reliability-memtype-fix branch 2 times, most recently from ff20b59 to e1979b1 Compare September 28, 2024 22:46
Copy link
Collaborator

@samnordmann samnordmann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@MamziB MamziB requested a review from janjust October 7, 2024 17:02
@MamziB
Copy link
Collaborator Author

MamziB commented Oct 9, 2024

@janjust can you please let me know if you have comments on this?

@janjust
Copy link
Collaborator

janjust commented Oct 9, 2024

@MamziB not specifically on this PR, but I'm still curious why the other PR, the test is not failing, I haven't investigated it yet.

@MamziB
Copy link
Collaborator Author

MamziB commented Oct 9, 2024

@janjust thanks Tommy. Since I am preparing a PR which is gonna be based on this PR, can we merge this PR so that we can proceed?

@janjust janjust enabled auto-merge October 16, 2024 20:56
auto-merge was automatically disabled October 21, 2024 18:53

Head branch was pushed to by a user without write access

@MamziB MamziB force-pushed the mamzi/reliability-memtype-fix branch 2 times, most recently from 3e87fd9 to e81d9a8 Compare October 21, 2024 19:00
@MamziB MamziB force-pushed the mamzi/reliability-memtype-fix branch from e81d9a8 to d53725a Compare October 21, 2024 19:03
@MamziB
Copy link
Collaborator Author

MamziB commented Oct 21, 2024

@Sergei-Lebedev i rebased it

@janjust janjust merged commit 070eb64 into openucx:master Oct 24, 2024
11 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants