internal review #6

Chao1Han · 2024-09-23T08:37:53Z

Fixes #ISSUE_NUMBER

zhangxiaoli73 · 2024-09-30T06:27:53Z

torch/csrc/distributed/c10d/ProcessGroupXCCL.cpp

+      auto currentTimepoint = std::chrono::steady_clock::now();
+      auto timeElapsed = std::chrono::duration_cast<std::chrono::milliseconds>(
+          currentTimepoint - workStartTime_);
+      std::chrono::milliseconds opTimeout = std::chrono::milliseconds(60000);


where do you use it?

zhangxiaoli73 · 2024-10-10T01:31:54Z

torch/csrc/distributed/c10d/ProcessGroupXCCL.cpp

@@ -67,17 +67,15 @@ ccl::reduction getXcclReduceOp(const ReduceOp& reduceOp, at::Tensor& input) {
    return xcclOps.at(reduceOp);
  } catch (const std::out_of_range&) {
    switch (reduceOp) {


No need to switch

zhangxiaoli73 · 2024-10-12T01:41:22Z

torch/csrc/distributed/c10d/ProcessGroupXCCL.cpp

+    : Work(rank, opType, "profilingTitle", inputs),
+      device_(device),
+      workStartTime_(std::chrono::steady_clock::now()) {
+  unsigned char enable_timing = 0;


If you always set it as 0, then we don't need to keep it, right?

Yes, Defining this variable serves as a form of annotation, informing reviewers and users that 0 represents the state of enable_timing, which is meaningful.

zhangxiaoli73 · 2024-10-12T01:42:34Z

torch/csrc/distributed/c10d/ProcessGroupXCCL.cpp

+            "Work ran for ",
+            timeElapsed.count(),
+            " milliseconds before timing out.");
+        TORCH_CHECK(false, exceptionMsg)


TORCH_CHECK(false, exceptionMsg); abort();

zhangxiaoli73 · 2024-10-12T01:44:51Z

torch/csrc/distributed/c10d/ProcessGroupXCCL.cpp

+} // namespace
+
+static std::mutex xcclCommDevIdxMapMutex;
+static std::unordered_map<std::shared_ptr<xcclComm_t>, int> xcclCommDevIdxMap;


Those static variables are not used in your code. Please check.

zhangxiaoli73 · 2024-10-12T01:46:03Z

torch/csrc/distributed/c10d/ProcessGroupXCCL.cpp

+  blockingWait_ = getCvarBool(TORCH_XCCL_BLOCKING_WAIT, false);
+  init();
+
+  // Intel oneCCL requires passing CCL_LOCAL_RANK and CCL_LOCAL_SIZE for non-MPI


More comment for why we use LOCAL_RANK and LOCAL_WORLD_SIZE.

zhangxiaoli73 · 2024-10-12T01:46:21Z

torch/csrc/distributed/c10d/ProcessGroupXCCL.cpp

+std::shared_ptr<xcclComm_t> ProcessGroupXCCL::getXCCLComm(
+    const std::string& deviceKey,
+    at::Device& device) {
+  if (deviceKey.empty()) {


C10_THROW_ERROR_WITH

zhangxiaoli73 · 2024-10-12T01:48:12Z

torch/csrc/distributed/c10d/ProcessGroupXCCL.cpp

+    devXCCLCommMap_.emplace(deviceKey, XCCLComm);
+  }
+
+  xcclStreamsMap_.emplace(deviceKey, std::move(stream));


so xcclEventsMap does not needed?

zhangxiaoli73 · 2024-10-12T01:49:06Z

torch/csrc/distributed/c10d/ProcessGroupXCCL.cpp

+    PreProcess pre,
+    PostProcess post,
+    OpType opType) {
+  using traits = function_traits<Fn>;


which collective need attribute as a must?

Yes, allgather meet build error

zhangxiaoli73 · 2024-10-12T01:50:03Z

torch/csrc/distributed/c10d/ProcessGroupXCCL.cpp

+  for (const auto i : c10::irange(inputs.size())) {
+    c10::xpu::XPUCachingAllocator::recordStream(
+        inputs[i].storage().data_ptr(), stream);
+    fn(inputs[i], outputs[i], attr, *comm, stream);


add comment for output record stream.

zhangxiaoli73 · 2024-10-12T01:52:29Z

torch/csrc/distributed/c10d/ProcessGroupXCCL.hpp

+          false, "ProcessGroupXCCL::WorkXCCL::isSuccess not implemented");
+    }
+
+    void abort() override {


abort here?

zhangxiaoli73 · 2024-10-12T01:52:38Z

torch/csrc/distributed/c10d/ProcessGroupXCCL.hpp

+
+    bool isCompleted() override;
+
+    bool isSuccess() const override {


zhangxiaoli73 · 2024-10-13T03:17:33Z

torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp

      case ReduceOp::BXOR:
-        C10_THROW_ERROR(ValueError, "Cannot use ReduceOp.BXOR with NCCL");
+        C10_THROW_ERROR(


don't change NCCL now.

zhangxiaoli73 · 2024-10-13T12:38:08Z

torch/csrc/distributed/c10d/ProcessGroupXCCL.cpp

+
+  c10::impl::VirtualGuardImpl impl(device.type());
+  c10::Stream stream = impl.getStream(device);
+  sycl::queue& q = c10::xpu::XPUStream(stream).queue();


It's a big bug to use current stream as communication stream.

zhangxiaoli73 · 2024-10-17T06:19:39Z

torch/csrc/distributed/c10d/ProcessGroupXCCL.cpp

+    int rank,
+    OpType opType,
+    const std::optional<std::vector<at::Tensor>>& inputs)
+    : Work(rank, opType, "profilingTitle", inputs),


Need change

zhangxiaoli73 · 2024-10-21T01:10:16Z

torch/csrc/distributed/c10d/ProcessGroup.hpp

@@ -126,6 +131,13 @@ class TORCH_API ProcessGroup : public torch::CustomClassHolder {
    return backendType_;
  };

+  inline bool backendSupportsSequenceNumbers(BackendType backendType) {
+    if (backendType == BackendType::GLOO || backendType == BackendType::NCCL ||
+        backendType == BackendType::XCCL || backendType == BackendType::UCC)


Do you make sure that we need to support this sequence number?

Sequence number used by RECORD_PARAM_COMMS. so we need it

zhangxiaoli73 · 2024-10-21T01:15:04Z

torch/testing/_internal/common_distributed.py

@@ -180,7 +181,8 @@ def skip_if_lt_x_gpu(x):
    def decorator(func):
        @wraps(func)
        def wrapper(*args, **kwargs):
-            if torch.cuda.is_available() and torch.cuda.device_count() >= x:
+            if (torch.cuda.is_available() and torch.cuda.device_count() >= x) or \


Don't use if for accelerator related check

See pytorch#140725 (comment) Running `torch.mps.synchronize()` after metal kernel resulted in infinite wait inside `[_MTLCommandBuffer waitUntilCompleted]` ``` (lldb) bt * thread #1, queue = 'com.apple.main-thread', stop reason = signal SIGSTOP * frame #0: 0x00000001aa919084 Metal`pthread_cond_wait + 12 frame #1: 0x00000001aa78b1b4 Metal`-[_MTLCommandBuffer waitUntilCompleted] + 84 frame #2: 0x00000001032bf358 libtorch_python.dylib`torch::mps::MPSModule_deviceSynchronize(_object*, _object*) + 40 frame #3: 0x0000000100e94c20 Python`cfunction_vectorcall_NOARGS + 100 frame #4: 0x0000000100e389b8 Python`PyObject_Vectorcall + 92 frame #5: 0x0000000100f61e38 Python`_PyEval_EvalFrameDefault + 19040 frame #6: 0x0000000100f5d180 Python`PyEval_EvalCode + 200 frame #7: 0x0000000100fcd1a4 Python`run_eval_code_obj + 104 frame #8: 0x0000000100fccbe4 Python`run_mod + 168 frame #9: 0x0000000100fcb518 Python`pyrun_file + 164 frame #10: 0x0000000100fca854 Python`_PyRun_SimpleFileObject + 256 frame pytorch#11: 0x0000000100fca4e8 Python`_PyRun_AnyFileObject + 80 frame pytorch#12: 0x0000000100ff2028 Python`pymain_run_file_obj + 164 frame pytorch#13: 0x0000000100ff1ce4 Python`pymain_run_file + 72 frame pytorch#14: 0x0000000100ff0f74 Python`Py_RunMain + 988 frame pytorch#15: 0x0000000100ff1564 Python`pymain_main + 304 frame pytorch#16: 0x0000000100ff1604 Python`Py_BytesMain + 40 frame pytorch#17: 0x000000019f630274 dyld`start + 2840 ``` Pull Request resolved: pytorch#141296 Approved by: https://github.com/huydhn

Chao1Han added 8 commits September 20, 2024 03:02

Xccl process group for Pytorch

652da01

Merge remote-tracking branch 'upstream/main' into xccl-bak

0cb0016

Align latest

a71d69a

hidden env

af6f03c

refine findccl code

88bea25

Add comments for build xccl

f6ea934

refine workxccl

1226e3b

refine timeout

d62e0be

zhangxiaoli73 reviewed Sep 30, 2024

View reviewed changes

Chao1Han added 5 commits September 30, 2024 06:33

rm head

714de2a

update

0923781

minor fix

31d092d

rm duplicate code and refine cmake

cbea299

update cmake

ef261c6

zhangxiaoli73 reviewed Oct 10, 2024

View reviewed changes

Chao1Han added 5 commits October 10, 2024 02:12

hidden xccl specific

6c648cd

fix ci fail

e621fe6

rm vir fun and modify tensor check

3f225d9

Merge branch 'xccl-bak' into xccl-bak2

1138a4a

refine collective, getcomm

8e5e78a

zhangxiaoli73 reviewed Oct 12, 2024

View reviewed changes

Chao1Han added 5 commits October 12, 2024 02:23

accept comments

1267963

rm attr

3d55b85

add default ccl root dir

f69059a

update

bed720c

update

fd44abe

zhangxiaoli73 reviewed Oct 13, 2024

View reviewed changes

code refine

d12b922

zhangxiaoli73 reviewed Oct 13, 2024

View reviewed changes

Chao1Han added 6 commits October 14, 2024 02:38

minor fix

b57e812

update

5968f0f

update

edba8aa

Refine specific code

56a5e7f

accept comments

a062f9f

Merge branch 'xccl-bak' into xccl-bak2

ae90994

zhangxiaoli73 reviewed Oct 17, 2024

View reviewed changes

Chao1Han added 3 commits October 17, 2024 08:17

rm header and refine profilehead

ab04fc0

add get_device_count

4ee49fb

add backendSupportsSequenceNumbers

1a2c9c2

zhangxiaoli73 reviewed Oct 21, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

internal review #6

internal review #6

Chao1Han commented Sep 23, 2024

zhangxiaoli73 Sep 30, 2024

zhangxiaoli73 Oct 10, 2024

zhangxiaoli73 Oct 12, 2024

Chao1Han Oct 12, 2024

zhangxiaoli73 Oct 12, 2024

zhangxiaoli73 Oct 12, 2024

zhangxiaoli73 Oct 12, 2024

zhangxiaoli73 Oct 12, 2024

zhangxiaoli73 Oct 12, 2024

Chao1Han Oct 12, 2024

zhangxiaoli73 Oct 12, 2024

Chao1Han Oct 12, 2024

zhangxiaoli73 Oct 12, 2024

zhangxiaoli73 Oct 12, 2024

zhangxiaoli73 Oct 12, 2024

zhangxiaoli73 Oct 13, 2024

zhangxiaoli73 Oct 13, 2024

zhangxiaoli73 Oct 17, 2024

zhangxiaoli73 Oct 21, 2024

Chao1Han Oct 21, 2024

zhangxiaoli73 Oct 21, 2024


		bool isCompleted() override;

		bool isSuccess() const override {

internal review #6

Are you sure you want to change the base?

internal review #6

Conversation

Chao1Han commented Sep 23, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment