Add distributed backend (XCCL) #1105

Chao1Han · 2024-11-20T06:48:11Z

Motivation:

As design illustrated in Intel distributed support RFC pytorch/pytorch#141741, Intel GPU distributed Backend integration in PyTorch torch-xpu-ops.

Design:

USE_XCCL is set to ON by default. Users can manually set it to OFF to disable XCCL compilation. The OneCCL path is first searched in /opt/intel/oneapi/ccl/latest. If not found, it uses the CCL_ROOT flag set by the user after sourcing OneCCL. The USE_C10D_XCCL variable is intended to align with other distributed backend environment variables.
Oneccl lib link to torch_xpu align with other distribute backend.

zhangxiaoli73 · 2024-12-11T05:18:34Z

cmake/Modules/FindXCCL.cmake

+
+include(${CMAKE_ROOT}/Modules/FindPackageHandleStandardArgs.cmake)
+
+set(XCCL_ROOT $ENV{CCL_ROOT})


How do you get CCL_ROOT? I think you cannot assume it will be set after oneccl source.

It will auto set after source oneccl env, and I remember oneccl update not affect this flag.

src/xccl/CMakeLists.txt

zhangxiaoli73 · 2024-12-11T05:22:53Z

src/xccl/ProcessGroupXCCL.cpp

+      "Not able to create/get "
+      "XCCL Communicator since the devices are empty ");
+  {
+    // todo: why do we need mutex here?


I think we followed the same code logic as NCCL, right?

zhangxiaoli73 · 2024-12-11T05:26:22Z

src/xccl/Register.cpp

+  m.impl("recv_any_source_", recv_any_source_XPU);
+  m.impl("reduce_", reduce_XPU);
+  m.impl("broadcast_", broadcast_XPU);
+  m.impl("allreduce_", allreduce_XPU);


In this PR, we only have allreduce implemented. Then should we only register allreduce here?

zhangxiaoli73 · 2024-12-13T05:28:58Z

src/xccl/ProcessGroupXCCL.cpp

+    bool is_reduction_op = false) {
+  TORCH_CHECK(
+      !isFloat8Type(type) && is_reduction_op,
+      "Float8 dtypes are not currenlty supported for XCCL reductions");


For non-reduction collective, please add mapping from FP8 to ccl data type.

zhangxiaoli73 · 2024-12-13T05:44:43Z

src/xccl/ProcessGroupXCCL.cpp

+    {at::kDouble, ccl::datatype::float64},
+    {at::kBFloat16, ccl::datatype::bfloat16},
+    {at::kBool, ccl::datatype::uint8},
+    // use for allgather


Please refine the description.

CMakeLists.txt

cmake/Modules/FindXCCL.cmake

cmake/XCCL.cmake

gujinghui · 2024-12-16T01:59:36Z

LGTM

zhangxiaoli73 · 2024-12-17T00:58:29Z

@EikanWang Could you please help review this PR?

EikanWang

Why do we not reuse PyTorch test cases?

CMakeLists.txt

EikanWang · 2024-12-17T02:13:57Z

cmake/Modules/FindXCCL.cmake

+
+include(${CMAKE_ROOT}/Modules/FindPackageHandleStandardArgs.cmake)
+
+set(XCCL_ROOT $ENV{ONEAPI_ROOT}/ccl/latest)


Does it mean the ONEAPI_ROOT is a must-to-have environment variable?

No, we only require oneCCL source in building, so CCL_ROOT is a must-to-have environment variable. Updated in latest.

Chao1Han · 2024-12-18T02:16:58Z

Why do we not reuse PyTorch test cases?

This pr just implement allreduce, so add simple test case. In the long term, once all operations are implemented, we will have one or two test files to validate the basic operations. Other tests, such as FSDP and DTensor, will directly reuse PyTorch's unit tests.

EikanWang · 2024-12-18T06:24:01Z

Why do we not reuse PyTorch test cases?

This pr just implement allreduce, so add simple test case. In the long term, once all operations are implemented, we will have one or two test files to validate the basic operations. Other tests, such as FSDP and DTensor, will directly reuse PyTorch's unit tests.

Then, I suggest reusing PyTorch test cases by disabling some test cases that are not applicable. Pls. check with Daisy.

Chao1Han added 9 commits November 20, 2024 02:17

Happy Init

90a52d3

oneccl private for xccl

0d8bb51

update cmake

f01b173

update

405013c

update cmake

7714885

Merge branch 'main' into chao/xccl

58a64a6

update commit and add register

b770640

update

30f6cd2

Merge branch 'main' into chao/xccl

8fff100

zhangxiaoli73 reviewed Dec 11, 2024

View reviewed changes

src/xccl/CMakeLists.txt Show resolved Hide resolved

zhangxiaoli73 reviewed Dec 11, 2024

View reviewed changes

Chao1Han changed the title ~~[WIP] Add distributed backend (XCCL)~~ Add distributed backend (XCCL) Dec 13, 2024

Chao1Han force-pushed the chao/xccl branch from 89e8c9a to fb851b1 Compare December 13, 2024 03:20

zhangxiaoli73 reviewed Dec 13, 2024

View reviewed changes