Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Large error when using matmul in the distributed matmul test. #2460

Closed
wujingyue opened this issue Jun 26, 2024 · 3 comments
Closed

Large error when using matmul in the distributed matmul test. #2460

wujingyue opened this issue Jun 26, 2024 · 3 comments
Labels
Multi-GPU Testing e.g. improving test infra and test coverage Triage

Comments

@wujingyue
Copy link
Collaborator

          @wujingyue - I added the MLP test with aten matmul. Note, that the tolerance is bumped up to a bit to pass validation. 

Validation error in output 0 (linear1) on line 583 in file /tests/cpp/test_multidevice_matmul.cpp.
Detected abs error of: 0.122498
absolute tolerance was set to 0.005
and relative tolerance set to 5e-05

Validation error in output 2 (linear2) on line 583 in file tests/cpp/test_multidevice_matmul.cpp.
Detected abs error of: 4.08847
absolute tolerance was set to 2
and relative tolerance set to 0.02

Originally posted by @cowanmeg in #2360 (comment)

@wujingyue
Copy link
Collaborator Author

wujingyue commented Jun 26, 2024

To reproduce the error, check out wjy/error (see 7eb2f43 for the change) and run _bn && mpirun -np 2 bin/test_multidevice --gtest_filter=DistributedMatmulTest.MLP_Layer*.

You'll see use_aten_matmul==true leads to the following error, and use_aten_matmul==false passes within 5e-3.

Validation error in output 0 on line 583 in file /opt/pytorch/nvfuser/tests/cpp/test_multidevice_matmul.cpp.
  Detected abs error of: 0.122498
    absolute tolerance was set to 0.005
    and relative tolerance set to 5e-05

Note Detected abs error of: 0.122498 is not the max absolute error. The max is at least 4. This motivates a side feature request to print out the max absolute error instead of the first (?) one being detected.

@wujingyue
Copy link
Collaborator Author

cc @Priya2698

@wujingyue
Copy link
Collaborator Author

Error has been much lower since #2905 and is less of a concern at this moment.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Multi-GPU Testing e.g. improving test infra and test coverage Triage
Projects
None yet
Development

No branches or pull requests

2 participants