-
Notifications
You must be signed in to change notification settings - Fork 54
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Call init/destroy_process_group once. #3143
Conversation
@@ -34,7 +34,7 @@ def barrier(self): | |||
self._communicator.barrier() | |||
|
|||
|
|||
@pytest.fixture | |||
@pytest.fixture(scope="session") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because setup_process_group, has module scope, uses mpi_test, mpi_test must have module or session scope.
!build |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not using Communicator
for those tests? The behavior of this test differs from all the other tests, which is confusing and add significant technical debt. The present present pr is an example of the technical debt incurred by managing another pg.
Using Communicator
would reuse and exercise our infrastructure, have a cheaper overhead, and would uniformize the behavior of this test with the others.
For example, with the way the test is implemented today: we need OMPI to be installed, mpirun
to be the launcher, cannot run it on a multi-node system, cannot set the port, etc.
@@ -23,6 +23,21 @@ class ComputeType(Enum): | |||
BACKWARD = auto() | |||
|
|||
|
|||
@pytest.fixture(scope="module") | |||
def setup_process_group(mpi_test) -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe move to mpi_fixtures.py
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll do that if I have to duplicate the code. This test is an exception -- normally we want nvFuser to set up process groups.
@@ -32,13 +47,10 @@ class ComputeType(Enum): | |||
@pytest.mark.mpi | |||
@pytest.mark.parametrize( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure but it might work to specify scope=function
here and remove the scope arguments from the two fixtures
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I understand your suggestion. AFAIK, scope
is an attribute of a fixture, not of a test.
Good question! test_transformer_engine.py in particular is an exception. It serves as a performance baseline and doesn't need to depend on nvFuser (and thus If by "those tests" you mean other Python distributed tests, that's a good idea. I'm running into some problems with mpi4py that can be solved by this. |
!build |
1 similar comment
!build |
I'm too busy with #2199 and the Communicator change will have to come after that. Nonetheless, this PR alone fixes a bug and restores the performance baseline, so I recommend we merge this and clean up the mpi_test fixture after October. |
I am actually talking about both this tests and other tests, but I agree (and didn't realized) that it is probably less important for the present tests. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry for the delay!
Fixes #3129.
This is done by using fixtures.
It's unclear why this avoids #3129. However, this appears to be the best practice according to https://pytorch.org/docs/stable/distributed.html#shutdown, and is more efficient.