-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Xccl process group for Pytorch #1
Conversation
You missed xccl register in https://github.com/Chao1Han/pytorch/blob/master/torch/distributed/distributed_c10d.py |
There is some record function call in ProcessGroupNCCL. Could you check whether those needed on XPU? We need to make sure basic profiler can work on XPU distributed. |
@gujinghui @guangyey @zhangxiaoli73 pls help review again. |
TORCH_CHECK(false, "ProcessGroupXCCL::broadcast not implemented"); | ||
} | ||
|
||
c10::intrusive_ptr<Work> allreduce_sparse( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't have plan to support sparse. Then could we skip this op in your ops register for XPU?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed
We don't have plan to support sparse. Then could we skip this op in your ops register for XPU?
LGTM. Next step is to add UTs? |
Yes, will create simple ut in test/distributed/test_c10d_xccl.py since just enable allreduce. |
target_compile_definitions(torch_xpu PUBLIC USE_C10D_XCCL) | ||
set_source_files_properties( | ||
${TORCH_SRC_DIR}/csrc/distributed/c10d/ProcessGroupXCCL.cpp | ||
PROPERTIES COMPILE_DEFINITIONS "CCL_ENABLE_ZE;CCL_ENABLE_SYCL") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Those flags are just workaround and will be removed after oneCCL gcc build issue is fixed (target to basekit 2025.0).
@Chao1Han Please make sure your PR include: 1) non-MPI launcher, please pass correct env options to oneCCL 2) Error handling. Abort the work(may need to call some oneCCL finalize API) for timeout. |
Add UT and also some implementation details refine (see my comments). @Chao1Han |
blockingWait_ = getCvarBool(TORCH_XCCL_BLOCKING_WAIT, false); | ||
init(); | ||
|
||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You need to check its launcher is IMPI or not. Only set CCL_LOCAL_RANK/CCL_LOCAL_SIZE for non-MPI launcher
std::lock_guard<std::mutex> lock(kvs_mutex); | ||
if (kvs) | ||
return kvs; | ||
std::string storeKey = "ccl_kvs"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
xccl_kvs
Summary: Quite a few times, we see the NCCL PG abort taking too long. There's no easy way to measure this, so let's add a counter to measure this across the stack. This will help us measure how much time we take the NCCL abort. Test Plan: Unit tests Reviewed By: c-p-i-o Differential Revision: D62675010 Pull Request resolved: pytorch#136067 Approved by: https://github.com/fduwjj
Pull Request resolved: pytorch#136143 Approved by: https://github.com/ezyang
…ch#136219) Changes in this PR: - Monkey-patching `F.scaled_dot_product_attention` with a lambda seems to not work in some cases. This PR avoids using a lambda. - Running `fullgraph=True` and `fullgraph=False` in the same unit test seems to cause the two cases to interfere with each other and causes error. This PR splits them into two separate unit tests. - The checks in the unit tests might not work with compile cache. This PR turns off the cache in order to have a more predictable compile behavior to do unit test on. Test commands: - `pytest -rA test/distributed/_composable/fsdp/test_fully_shard_compile.py::TestFullyShardCompile::test_nested_fully_shard_backend_inductor_fullgraph_True` - `pytest -rA test/distributed/_composable/fsdp/test_fully_shard_compile.py::TestFullyShardCompile::test_nested_fully_shard_backend_inductor_fullgraph_False` - `pytest -rA test/distributed/_composable/fsdp/test_fully_shard_compile.py::TestFullyShardCompile::test_transformer_backend_inductor_fullgraph_True` - `pytest -rA test/distributed/_composable/fsdp/test_fully_shard_compile.py::TestFullyShardCompile::test_transformer_backend_inductor_fullgraph_False` Pull Request resolved: pytorch#136219 Approved by: https://github.com/yifuwang
…31763) Fixes #ISSUE_NUMBER Recent change from PR#123487 used torch.cuda.Stream directly and this causes failure for other backends. This PR will generalize the stream handling for all backends like cuda/hpu/xpu Pull Request resolved: pytorch#131763 Approved by: https://github.com/yanboliang, https://github.com/yf225
After pytorch#135706 `getGatherScatterScalarType` returns exactly the same results as `scalarToMetalTypeString` , so delete the function and call `scalarToMetalTypeString` Pull Request resolved: pytorch#136295 Approved by: https://github.com/kit1980
[reverseSquareRootWithTensor:](https://developer.apple.com/documentation/metalperformanceshadersgraph/mpsgraph/reversesquareroot(with:name:)?changes=__8&language=objc) were deprecated in favor of [reciprocalSquareRootWithTensor:](https://developer.apple.com/documentation/metalperformanceshadersgraph/mpsgraph/reciprocalsquareroot(_:name:)?changes=__8&language=objc) Without it, following warnings are generated if compiled on recently released MacOS Sequoia: ``` /Users/malfet/git/pytorch/pytorch/aten/src/ATen/native/mps/operations/Normalization.mm:720:35: warning: 'reverseSquareRootWithTensor:name:' is deprecated: first deprecated in macOS 15.0 [-Wdeprecated-declarations] 720 | rsqrtTensor = [mpsGraph reverseSquareRootWithTensor:varianceEpsTensor name:nil]; | ^~~~~~~~~~~~~~~~~~~~~~~~~~~ | reciprocalSquareRootWithTensor /Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX15.0.sdk/usr/include/c++/v1/__type_traits/invoke.h:341:10: note: in instantiation of function template specialization 'at::native::batch_norm_backward_mps(const Tensor &, const Tensor &, const std::optional<Tensor> &, const std::optional<Tensor> &, const std::optional<Tensor> &, const std::optional<Tensor> &, const std::optional<Tensor> &, bool, double, std::array<bool, 3>)::(anonymous class)::operator()<MPSGraph *, CachedGraph *>' requested here 341 | decltype(std::declval<_Fp>()(std::declval<_Args>()...)) | ^ /Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX15.0.sdk/usr/include/c++/v1/__type_traits/invoke.h:351:19: note: while substituting deduced template arguments into function template '__invoke' [with _Fp = (lambda at /Users/malfet/git/pytorch/pytorch/aten/src/ATen/native/mps/operations/Normalization.mm:623:68) &, _Args = <MPSGraph *, CachedGraph *>] 351 | static decltype(std::__invoke(std::declval<_XFp>(), std::declval<_XArgs>()...)) __try_call(int); | ^ /Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX15.0.sdk/usr/include/c++/v1/__type_traits/invoke.h:357:28: note: while substituting deduced template arguments into function template '__try_call' [with _XFp = (lambda at /Users/malfet/git/pytorch/pytorch/aten/src/ATen/native/mps/operations/Normalization.mm:623:68) &, _XArgs = (no value)] 357 | using _Result = decltype(__try_call<_Fp, _Args...>(0)); | ^ /Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX15.0.sdk/usr/include/c++/v1/__type_traits/conjunction.h:27:32: note: in instantiation of template class 'std::__invokable_r<void, (lambda at /Users/malfet/git/pytorch/pytorch/aten/src/ATen/native/mps/operations/Normalization.mm:623:68) &, MPSGraph *, CachedGraph *>' requested here 27 | __expand_to_true<__enable_if_t<_Pred::value>...> __and_helper(int); | ^ /Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX15.0.sdk/usr/include/c++/v1/__type_traits/conjunction.h:38:39: note: while substituting explicitly-specified template arguments into function template '__and_helper' 38 | using _And _LIBCPP_NODEBUG = decltype(std::__and_helper<_Pred...>(0)); | ^ /Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX15.0.sdk/usr/include/c++/v1/__functional/function.h:828:20: note: (skipping 1 context in backtrace; use -ftemplate-backtrace-limit=0 to see all) 828 | bool = _And< _IsNotSame<__remove_cvref_t<_Fp>, function>, __invokable<_Fp, _ArgTypes...> >::value> | ^ /Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX15.0.sdk/usr/include/c++/v1/__functional/function.h:841:49: note: in instantiation of default argument for '__callable<(lambda at /Users/malfet/git/pytorch/pytorch/aten/src/ATen/native/mps/operations/Normalization.mm:623:68) &>' required here 841 | using _EnableIfLValueCallable = __enable_if_t<__callable<_Fp&>::value>; | ^~~~~~~~~~~~~~~~ /Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX15.0.sdk/usr/include/c++/v1/__functional/function.h:851:32: note: in instantiation of template type alias '_EnableIfLValueCallable' requested here 851 | template <class _Fp, class = _EnableIfLValueCallable<_Fp>> | ^ /Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX15.0.sdk/usr/include/c++/v1/__functional/function.h:852:25: note: in instantiation of default argument for 'function<(lambda at /Users/malfet/git/pytorch/pytorch/aten/src/ATen/native/mps/operations/Normalization.mm:623:68)>' required here 852 | _LIBCPP_HIDE_FROM_ABI function(_Fp); | ^~~~~~~~~~~~~ /Users/malfet/git/pytorch/pytorch/aten/src/ATen/native/mps/operations/Normalization.mm:623:68: note: while substituting deduced template arguments into function template 'function' [with _Fp = (lambda at /Users/malfet/git/pytorch/pytorch/aten/src/ATen/native/mps/operations/Normalization.mm:623:68), $1 = (no value)] 623 | auto cachedGraph = LookUpOrCreateCachedGraph<CachedGraph>(key, [&](auto mpsGraph, auto newCachedGraph) { | ^ /Users/malfet/git/pytorch/pytorch/aten/src/ATen/native/mps/operations/Normalization.mm:623:24: note: while substituting deduced template arguments into function template 'LookUpOrCreateCachedGraph' [with T = CachedGraph] 623 | auto cachedGraph = LookUpOrCreateCachedGraph<CachedGraph>(key, [&](auto mpsGraph, auto newCachedGraph) { | ^ /Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX15.0.sdk/System/Library/Frameworks/MetalPerformanceShadersGraph.framework/Headers/MPSGraphArithmeticOps.h:123:1: note: 'reverseSquareRootWithTensor:name:' has been explicitly marked deprecated here 123 | -(MPSGraphTensor *) reverseSquareRootWithTensor:(MPSGraphTensor *) tensor | ^ /Users/malfet/git/pytorch/pytorch/aten/src/ATen/native/mps/operations/Normalization.mm:745:37: warning: 'reverseSquareRootWithTensor:name:' is deprecated: first deprecated in macOS 15.0 [-Wdeprecated-declarations] 745 | rsqrtTensor = [mpsGraph reverseSquareRootWithTensor:varianceEpsTensor name:nil]; | ^~~~~~~~~~~~~~~~~~~~~~~~~~~ | reciprocalSquareRootWithTensor /Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX15.0.sdk/System/Library/Frameworks/MetalPerformanceShadersGraph.framework/Headers/MPSGraphArithmeticOps.h:123:1: note: 'reverseSquareRootWithTensor:name:' has been explicitly marked deprecated here 123 | -(MPSGraphTensor *) reverseSquareRootWithTensor:(MPSGraphTensor *) tensor | ^ 2 warnings generated. ``` Pull Request resolved: pytorch#136292 Approved by: https://github.com/kit1980
Summary: att Test Plan: python test/test_quantization.py -k TestObserver Reviewers: Subscribers: Tasks: Tags: Differential Revision: [D62909821](https://our.internmc.facebook.com/intern/diff/D62909821) Pull Request resolved: pytorch#136238 Approved by: https://github.com/tarun292
Fix crash that was happening in pytorch#128095, because we were trying to extract a constant incorrectly. Pull Request resolved: pytorch#136010 Approved by: https://github.com/yanboliang, https://github.com/anijain2305, https://github.com/jansel
Fixes pytorch#127049 There's already a meta func in `meta_registrations.py` for `add_` and `sub_` methods. I added a second meta function for error checking, i.e `int.add/sub_(float)` and `bool.add/sub_(other types)` . Also the corresponding test with Dynamo passes, removed `@xfailIfTorchDynamo`. Pull Request resolved: pytorch#135864 Approved by: https://github.com/williamwen42
Fixes pytorch#128551 As this has been failing in trunk for a while and there is no owner yet to fix it properly. Pull Request resolved: pytorch#136252 Approved by: https://github.com/andrewkho
…ytorch#134568) When tensor folding occurs during matmul operation returned tensor is a view. This can cause issues when matmul is used inside a custom function and such view is then returned as output. Then it cannot be modified inplace and causes errors. It can be especially problematic when after such function inplace allreduce is performed. Issue is resolved when unsafe_view is returned from matmul instead. This solution aligns matmul decomposition with eager implementation in such a way that a non view tensor is returned. Test included in this PR reproduces the issue. Pull Request resolved: pytorch#134568 Approved by: https://github.com/zou3519
…ations (pytorch#136288) Summary: To facilitate PSS-2 upgrade, this uses `ndt.NDArray` instead of `nd.ndarray` in type annotations. In Numpy-1.19 (PSS-1) it's an alias to `nd.ndarray` -- a noop. In Numpy-1.24, `ndt.NDArray` a proper generic type, and without this change uses of `nd.ndarray` generate this Pyre type error: ```counterexample Invalid type parameters [24]: Generic type `np.ndarray` expects 2 type parameters. ``` Test Plan: Sandcastle plus visual inspection Differential Revision: D62977370 Pull Request resolved: pytorch#136288 Approved by: https://github.com/kit1980
This reverts commit 083c914. Reverted pytorch#136213 on behalf of https://github.com/jeanschmidt due to Seems to have introduced regressions in rocm signals ([comment](pytorch#136213 (comment)))
``` CUDA_VISIBLE_DEVICES=2,3,6,7 pytest test/distributed/_composable/test_composability/test_2d_composability.py -k test_train_parity_2d_transformer ``` Differential Revision: [D62964658](https://our.internmc.facebook.com/intern/diff/D62964658) Pull Request resolved: pytorch#136237 Approved by: https://github.com/weifengpy
…6142) Summary: X-link: pytorch/benchmark#2454 This adds structured logging overhead at a per compile basis to compilation metrics. To do so, we track the frame_id_frame_compile_id that trace_structured uses to categorize compiles, and use that as the key in our timing table. Implementation notes: - If there's times we call trace_structured without a compile id, the time won't be measured. Not really a good way around that today given the compile id framework of compilation metrics. Strobelight is still the best way to measure on a per job basis. - We don't actually measure the time it takes to log the compilation metrics itself. Fundamentally, it's not possible to log this properly if we're storing the logging number *in* compilation metrics, since there's no way to measure it before we do it(unless we want discrepancies between dynamo_compile and tlparse, which seems suboptimal). Hopefully for a large job, the cost of structured_logging compilation metrics itself is small. - I wanted to use frame_phase_timing here, but there's a bunch of ids to iron out, and I don't really want to deal with that headache. compilation_time_metrics is sort of what I want, but that isn't by frame/compile id, so it's also a bit off. Putting it into torch.logging as a separate thing so logging tracks its own overhead seems fine, though. Test Plan: Run benchmarks/nanogpt and staging logger. See that the new compilation metric is logged to the staged dynamo_compile table: https://fburl.com/scuba/logger_staging_jjwu_30582a48f1ff9cf5f4ac50a4c40af/xazjg5xq Note that the sum(structured_logging_overhead_s) / sum(entire_frame_compile_time) = 8.387 / 124.278 = 6%, which seems reasonable as the overhead for a small compilation like this. You can also look at samples for a more detailed log of this. Reviewed By: oulgen Differential Revision: D62643611 Pull Request resolved: pytorch#136142 Approved by: https://github.com/bobrenjc93
Signed-off-by: Bob Ren <[email protected]> I was chatting with @jamesjwu about strategies to learn the code and he suggested adding types to some files. This stack of PRs adds types to _sympy/functions.py Pull Request resolved: pytorch#136205 Approved by: https://github.com/Skylion007, https://github.com/jamesjwu
Pull Request resolved: pytorch#136245 Approved by: https://github.com/malfet
…s.h (pytorch#134546) When size value equal to one, tensor strides value need be skipped to compare. @ezyang Pull Request resolved: pytorch#134546 Approved by: https://github.com/janeyx99
…nter (pytorch#136182) Summary: Add a third mode where we only print kernel names without dumping any intermediate actual tensor value info. It can be helpful in quickly identifying the troublesome kernels in CUDA IMA issues. thanks ColinPeppler and henrylhtsang for this "feature request". Test Plan: The output can look like this if set the `AOT_INDUCTOR_DEBUG_INTERMEDIATE_VALUE_PRINTER=3`: {F1871629091} Differential Revision: D62791371 Pull Request resolved: pytorch#136182 Approved by: https://github.com/henrylhtsang
…#136175) Pull Request resolved: pytorch#136175 Approved by: https://github.com/ezyang
Follows pytorch#134537 Pull Request resolved: pytorch#134829 Approved by: https://github.com/ezyang
…rch#136266) This fixes a subset of issues for dynamic shapes + DTensor. It's pretty easy to run into other issues - it's likely that we need pytorch#125941 to land for DTensor + dynamic shapes to work more generally. I ended up writing a test that had dynamic shape inputs but not dynamic shape outputs in order to properly test this fix Pull Request resolved: pytorch#136266 Approved by: https://github.com/ezyang, https://github.com/yf225
…a 12.6 (pytorch#136321) To prepare for future cuda updates. Pull Request resolved: pytorch#136321 Approved by: https://github.com/Skylion007, https://github.com/eqy
Summary: This logs all operations when tracing log level is enabled for the `TCPStoreLibUvBackend`. This is very useful for debugging collective operations when issues occur as it logs all hosts and the keys that they're modifying. To minimize total data we only log the keys and not the values This changes the C10D_* macros to be much more efficient -- previously we would always format the log string even if they would never be printed which is very wasteful for detailed tracing. This now gates them with an if statement to achieve the same behavior with no overhead Test Plan: ``` TORCH_DISTRIBUTED_DEBUG=DETAIL torchrun --nnodes 1 --nproc_per_node 1 --no-python /bin/bash -c "echo foo" ``` ``` I0919 09:26:52.352013 34271 TCPStore.cpp:285] [c10d - debug] The server has started on port = 29500. I0919 09:26:52.352246 34271 socket.cpp:783] [c10d - debug] The client socket will attempt to connect to an IPv6 address of (127.0.0.1, 29500). I0919 09:26:52.352241 36903 TCPStoreLibUvBackend.cpp:1173] [c10d - debug] Uv main loop running I0919 09:26:52.352308 34271 socket.cpp:854] [c10d - trace] The client socket is attempting to connect to [localhost]:29500. I0919 09:26:52.353633 34271 socket.cpp:945] [c10d] The client socket has connected to [localhost]:29500 on SocketImpl(fd=41, addr=[localhost]:45646, remote=[localhost]:29500). I0919 09:26:52.354422 34271 TCPStore.cpp:321] [c10d - debug] TCP client connected to host 127.0.0.1:29500 I0919 09:26:52.354558 36903 TCPStoreLibUvBackend.cpp:774] [c10d - trace] validate magic:1015412686 address:[localhost]:45646 I0919 09:26:52.354638 36903 TCPStoreLibUvBackend.cpp:789] [c10d - trace] ping nonce:34271 address:[localhost]:45646 I0919 09:26:52.356122 36903 TCPStoreLibUvBackend.cpp:866] [c10d - trace] add key:init/ val:1 address:[localhost]:45646 I0919 09:26:52.356308 36903 TCPStoreLibUvBackend.cpp:930] [c10d - trace] wait key_count:1 address:[localhost]:45646 I0919 09:26:52.356410 36903 TCPStoreLibUvBackend.cpp:846] [c10d - trace] get key:init/ address:[localhost]:45646 I0919 09:26:52.358688 36903 TCPStoreLibUvBackend.cpp:808] [c10d - trace] set key:/none/torchelastic/role_info/0 address:[localhost]:45646 I0919 09:26:52.360177 36903 TCPStoreLibUvBackend.cpp:930] [c10d - trace] wait key_count:1 address:[localhost]:45646 I0919 09:26:52.360296 36903 TCPStoreLibUvBackend.cpp:1004] [c10d - trace] multi_get key_count:1 address:[localhost]:45646 I0919 09:26:52.362076 36903 TCPStoreLibUvBackend.cpp:1036] [c10d - trace] multi_set key_count:1 address:[localhost]:45646 I0919 09:26:52.364001 36903 TCPStoreLibUvBackend.cpp:930] [c10d - trace] wait key_count:1 address:[localhost]:45646 I0919 09:26:52.364091 36903 TCPStoreLibUvBackend.cpp:846] [c10d - trace] get key:/none/torchelastic/assigned_ranks/0 address:[localhost]:45646 ``` Differential Revision: D62924454 Pull Request resolved: pytorch#136320 Approved by: https://github.com/c-p-i-o, https://github.com/XilunWu
…ers for a better debugging experience (pytorch#135913) Summary: The change involves passing the expired timers to the log_debug_info_for_expired_timers function after to_json() has been applied . This change is made to provide a better debugging experience for the user. Test Plan: unit tests Reviewed By: gag1jain Differential Revision: D62408767 Pull Request resolved: pytorch#135913 Approved by: https://github.com/gag1jain
…ytorch#139659) ### Motivation Today, watchdog only reports that it found a collective timeout: ``` [rank1]:[E1104 14:02:18.767594328 ProcessGroupNCCL.cpp:688] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLREDUCE, NumelIn=200, NumelOut=200, Timeout(ms)=5000) ran for 5096 milliseconds before timing out. ``` While this is nice, it is hard to associate the error with user's program or library stack. ### This PR This PR gives watchdog the ability to report the call-time stack of the collective, so that it would be easier to track the error back to the program's behavior. The call-time stack was recorded by Flight Recorder with minimal overhead (for details, please read this [doc](https://dev-discuss.pytorch.org/t/fast-combined-c-python-torchscript-inductor-tracebacks/1158) written by @zdevito ). In `ProcessGroupNCCL`, we are only tracking / reporting the python part so that it fits most PyTorch users. ### Demo [stack_demo.py](https://gist.github.com/kwen2501/6758e18d305d67fc6f3f926217825c09). ``` TORCH_NCCL_TRACE_BUFFER_SIZE=100 torchrun --nproc-per-node 2 stack_demo.py ``` `TORCH_NCCL_TRACE_BUFFER_SIZE` is for turning on the Flight Recorder. Output: ``` [rank0]:[E1104 14:19:27.591610653 ProcessGroupNCCL.cpp:695] Stack trace of the timedout collective operation: #0 all_reduce from /data/users/kw2501/pytorch/torch/distributed/distributed_c10d.py:2696 #1 wrapper from /data/users/kw2501/pytorch/torch/distributed/c10d_logger.py:83 #2 bar from /data/users/kw2501/sync_async/repro.py:15 #3 foo from /data/users/kw2501/sync_async/repro.py:24 #4 main from /data/users/kw2501/sync_async/repro.py:34 #5 <module> from /data/users/kw2501/sync_async/repro.py:40 [rank1]:[E1104 14:19:27.771430164 ProcessGroupNCCL.cpp:695] Stack trace of the timedout collective operation: #0 all_gather_into_tensor from /data/users/kw2501/pytorch/torch/distributed/distributed_c10d.py:3630 #1 wrapper from /data/users/kw2501/pytorch/torch/distributed/c10d_logger.py:83 #2 baz from /data/users/kw2501/sync_async/repro.py:20 #3 foo from /data/users/kw2501/sync_async/repro.py:26 #4 main from /data/users/kw2501/sync_async/repro.py:34 #5 <module> from /data/users/kw2501/sync_async/repro.py:40 ``` From the log above, we can tell that `bar()` and `baz()` are the places where the two ranks divert. Pull Request resolved: pytorch#139659 Approved by: https://github.com/wconstab, https://github.com/fduwjj
See pytorch#140725 (comment) Running `torch.mps.synchronize()` after metal kernel resulted in infinite wait inside `[_MTLCommandBuffer waitUntilCompleted]` ``` (lldb) bt * thread #1, queue = 'com.apple.main-thread', stop reason = signal SIGSTOP * frame #0: 0x00000001aa919084 Metal`pthread_cond_wait + 12 frame #1: 0x00000001aa78b1b4 Metal`-[_MTLCommandBuffer waitUntilCompleted] + 84 frame #2: 0x00000001032bf358 libtorch_python.dylib`torch::mps::MPSModule_deviceSynchronize(_object*, _object*) + 40 frame #3: 0x0000000100e94c20 Python`cfunction_vectorcall_NOARGS + 100 frame #4: 0x0000000100e389b8 Python`PyObject_Vectorcall + 92 frame #5: 0x0000000100f61e38 Python`_PyEval_EvalFrameDefault + 19040 frame #6: 0x0000000100f5d180 Python`PyEval_EvalCode + 200 frame #7: 0x0000000100fcd1a4 Python`run_eval_code_obj + 104 frame #8: 0x0000000100fccbe4 Python`run_mod + 168 frame #9: 0x0000000100fcb518 Python`pyrun_file + 164 frame #10: 0x0000000100fca854 Python`_PyRun_SimpleFileObject + 256 frame pytorch#11: 0x0000000100fca4e8 Python`_PyRun_AnyFileObject + 80 frame pytorch#12: 0x0000000100ff2028 Python`pymain_run_file_obj + 164 frame pytorch#13: 0x0000000100ff1ce4 Python`pymain_run_file + 72 frame pytorch#14: 0x0000000100ff0f74 Python`Py_RunMain + 988 frame pytorch#15: 0x0000000100ff1564 Python`pymain_main + 304 frame pytorch#16: 0x0000000100ff1604 Python`Py_BytesMain + 40 frame pytorch#17: 0x000000019f630274 dyld`start + 2840 ``` Pull Request resolved: pytorch#141296 Approved by: https://github.com/huydhn
No description provided.