Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TL/MLX5: add gtest for mcast #1023

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

MamziB
Copy link
Collaborator

@MamziB MamziB commented Sep 23, 2024

TL/MLX5: add gtest for mcast

Copy link
Collaborator

@samnordmann samnordmann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think those tests are run... On both Azure and github CI, tl/mlx5/bcast seem to have trouble initializing. Take a look at the logs

test/gtest/coll/test_bcast.cc Outdated Show resolved Hide resolved
test/gtest/coll/test_bcast.cc Show resolved Hide resolved
@MamziB MamziB force-pushed the mamzi/mcast-test branch 2 times, most recently from 1b5a340 to e5d263d Compare September 25, 2024 17:47
@MamziB
Copy link
Collaborator Author

MamziB commented Sep 25, 2024

@samnordmann thanks Sam for the comments. I updated the PR

@samnordmann
Copy link
Collaborator

I still think that those tests are not run. Am I missing something?

@MamziB
Copy link
Collaborator Author

MamziB commented Sep 26, 2024

So we have multiple algorithms in this file for Bcast, I added MCAST as new algorithm to the list. Not sure what else I should do.

@samnordmann
Copy link
Collaborator

Can you please re-trigger the CI?

@MamziB
Copy link
Collaborator Author

MamziB commented Oct 1, 2024

@samnordmann done

@nsarka
Copy link
Collaborator

nsarka commented Oct 3, 2024

This is probably not a change for this PR, but should we update the gtest to show which tests are run instead of just a number?

For example,

diff --git a/test/gtest/coll/test_bcast.cc b/test/gtest/coll/test_bcast.cc
index 6d80816..faf4318 100644
--- a/test/gtest/coll/test_bcast.cc
+++ b/test/gtest/coll/test_bcast.cc
@@ -276,7 +276,20 @@ ucc_job_env_t two_step_env = {{"UCC_CL_HIER_TUNE", "bcast:@2step:0-inf:inf"},
                               {"UCC_CLS", "all"}};
 ucc_job_env_t dbt_env      = {{"UCC_TL_UCP_TUNE", "bcast:@dbt:0-inf:inf"},
                               {"UCC_CLS", "basic"}};
-INSTANTIATE_TEST_CASE_P(
+
+struct PrintToStringParamNameEdited {
+  static bool remove_non_alphanumeric(char c) {
+    return !isalnum(c) && c != '_';
+  }
+  template <class Param_2>
+  std::string operator()(const testing::TestParamInfo<Param_2>& info) const {
+    std::string s = ::testing::PrintToString(info.param);
+    s.erase(std::remove_if(s.begin(), s.end(), remove_non_alphanumeric), s.end());
+    return s;
+  }
+};
+
+INSTANTIATE_TEST_SUITE_P(
     , test_bcast_alg,
     ::testing::Combine(
 #ifdef HAVE_CUDA
@@ -287,4 +300,4 @@ INSTANTIATE_TEST_CASE_P(
 #endif
         ::testing::Values(two_step_env, dbt_env), //env
         ::testing::Values(8, 65536), // count
-        ::testing::Values(15,16))); // n_procs
+        ::testing::Values(15,16)), PrintToStringParamNameEdited()); // n_procs

This will make it so instead of the test run looking like:

[2024-10-01T18:20:04.150Z] [----------] 72 tests from test_bcast_alg
[2024-10-01T18:20:04.150Z] [ RUN      ] test_bcast_alg.0

it will look like this:

[----------] 24 tests from test_bcast_alg
[ RUN      ] test_bcast_alg.0UCC_CL_HIER_TUNEbcast2step0infinfUCC_CLSall815

for every test

Copy link
Collaborator

@janjust janjust left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test in on itself works as intended, the question is is there a fallback in the infra, hece why we're not seeing a failure.
@B-a-S , Where can we access these machines? We need to verify if we're actually running MCAST enabled test suite.

@MamziB MamziB requested a review from samnordmann October 21, 2024 17:27
@samnordmann
Copy link
Collaborator

The test in on itself works as intended, the question is is there a fallback in the infra, hece why we're not seeing a failure. @B-a-S , Where can we access these machines? We need to verify if we're actually running MCAST enabled test suite.

I thought the conclusion was that the CI on this PR should fail and reveal the bug fixed in #1022. Is there any update on this? Do we want to merge this PR regardless?

@janjust
Copy link
Collaborator

janjust commented Oct 22, 2024

Hey Sam, you are correct this PR should in fact fail, and it gives a false negative. The false negative seems to be that we're not running MCAST and instead are falling back or just ignoring the TL_MLX5. I need to dig further and haven't had a chance to look at it. But outside of the CI the test indeed fails.
I think we should merge this test, however. Just keep in mind about the general CI of any MCAST tests. I'll follow up on this.
But as is, I think it's ok to merge.

@janjust
Copy link
Collaborator

janjust commented Nov 14, 2024

@MamziB rebase please

@swx-jenkins3
Copy link

Can one of the admins verify this patch?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants