Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Call init/destroy_process_group once. #3143

Merged
merged 7 commits into from
Oct 15, 2024
Merged

Call init/destroy_process_group once. #3143

merged 7 commits into from
Oct 15, 2024

Conversation

wujingyue
Copy link
Collaborator

Fixes #3129.

This is done by using fixtures.

It's unclear why this avoids #3129. However, this appears to be the best practice according to https://pytorch.org/docs/stable/distributed.html#shutdown, and is more efficient.

@@ -34,7 +34,7 @@ def barrier(self):
self._communicator.barrier()


@pytest.fixture
@pytest.fixture(scope="session")
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because setup_process_group, has module scope, uses mpi_test, mpi_test must have module or session scope.

@wujingyue
Copy link
Collaborator Author

!build

@wujingyue wujingyue requested a review from samnordmann October 9, 2024 05:38
Copy link
Collaborator

@samnordmann samnordmann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not using Communicator for those tests? The behavior of this test differs from all the other tests, which is confusing and add significant technical debt. The present present pr is an example of the technical debt incurred by managing another pg.

Using Communicator would reuse and exercise our infrastructure, have a cheaper overhead, and would uniformize the behavior of this test with the others.

For example, with the way the test is implemented today: we need OMPI to be installed, mpirun to be the launcher, cannot run it on a multi-node system, cannot set the port, etc.

tests/python/test_transformer_engine.py Outdated Show resolved Hide resolved
@@ -23,6 +23,21 @@ class ComputeType(Enum):
BACKWARD = auto()


@pytest.fixture(scope="module")
def setup_process_group(mpi_test) -> None:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe move to mpi_fixtures.py?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll do that if I have to duplicate the code. This test is an exception -- normally we want nvFuser to set up process groups.

@@ -32,13 +47,10 @@ class ComputeType(Enum):
@pytest.mark.mpi
@pytest.mark.parametrize(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure but it might work to specify scope=function here and remove the scope arguments from the two fixtures

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I understand your suggestion. AFAIK, scope is an attribute of a fixture, not of a test.

tests/python/test_transformer_engine.py Outdated Show resolved Hide resolved
tests/python/test_transformer_engine.py Outdated Show resolved Hide resolved
@wujingyue
Copy link
Collaborator Author

wujingyue commented Oct 9, 2024

Why not using Communicator for those tests?

Good question!

test_transformer_engine.py in particular is an exception. It serves as a performance baseline and doesn't need to depend on nvFuser (and thus Communicator). That lack of dependency helped me once when I gave this as a repro to the TE team. So, strictly speaking, test_transformer_engine.py doesn't even need to be in the nvFuser repo. I put it here because it's the repo we've been collaborating on and it's easy to refer to the baseline when it's in the same repo.

If by "those tests" you mean other Python distributed tests, that's a good idea. I'm running into some problems with mpi4py that can be solved by this.

@wujingyue
Copy link
Collaborator Author

!build

1 similar comment
@wujingyue
Copy link
Collaborator Author

!build

@wujingyue wujingyue requested a review from samnordmann October 10, 2024 18:29
@wujingyue
Copy link
Collaborator Author

I'm too busy with #2199 and the Communicator change will have to come after that. Nonetheless, this PR alone fixes a bug and restores the performance baseline, so I recommend we merge this and clean up the mpi_test fixture after October.

@wujingyue wujingyue requested a review from cowanmeg October 14, 2024 23:13
@samnordmann
Copy link
Collaborator

If by "those tests" you mean other Python distributed tests, that's a good idea. I'm running into some problems with mpi4py that can be solved by this.

I am actually talking about both this tests and other tests, but I agree (and didn't realized) that it is probably less important for the present tests.

Copy link
Collaborator

@samnordmann samnordmann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry for the delay!

@wujingyue wujingyue merged commit 23e9dcf into main Oct 15, 2024
45 checks passed
@wujingyue wujingyue deleted the wjy/bug3129 branch October 15, 2024 17:00
@wujingyue wujingyue added the bug Something isn't working label Nov 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

test_transformer_engine:test_transformer_layer[backward] hangs.
2 participants