Failed to lower a set between `bDID` and `b`. #3488

wujingyue · 2024-11-27T18:55:19Z

(I ran into this issue incidentally but haven't tried to reduce the repros or identify the reasons.)

Symptoms

Below are two minimal repros. Both run the following definition but with different parallelizations. The first test shards y but not x or z, and the second test shards x and z but not y.

x: [D]
y: [1]
z = add(x, y): [D]

diff --git a/tests/cpp/test_multidevice_sharding.cpp b/tests/cpp/test_multidevice_sharding.cpp
index 1e1ff2ea..4c5ba235 100644
--- a/tests/cpp/test_multidevice_sharding.cpp
+++ b/tests/cpp/test_multidevice_sharding.cpp
@@ -413,4 +413,60 @@ TEST_P(MultiDeviceBroadcastTest, Expanded) {

 INSTANTIATE_TEST_SUITE_P(, MultiDeviceBroadcastTest, testing::Bool());

+TEST_F(MultiDeviceTest, AddWithBroadcast_BroadcastIsParallelized) {
+  const auto num_devices = communicator_->size();
+  const auto mesh = DeviceMesh::createForNumDevices(num_devices);
+
+  auto fusion = std::make_unique<Fusion>();
+  FusionGuard fg(fusion.get());
+
+  TensorView* x = makeContigConcreteTensor({num_devices});
+  x->setDeviceMesh(mesh);
+  TensorView* y = makeContigConcreteTensor({1});
+  y->setDeviceMesh(mesh);
+  TensorView* z = add(x, y);
+
+  fusion->addInput(x);
+  fusion->addInput(y);
+  fusion->addOutput(z);
+
+  y->axis(0)->parallelize(ParallelType::DIDx);
+
+  std::vector<c10::IValue> in_tensors(
+      {at::randn({num_devices}, tensor_options),
+       at::randn({1}, tensor_options)});
+
+  FusionExecutorCache executor_cache(std::move(fusion));
+  auto out_tensors = executor_cache.runFusionWithInputs(in_tensors);
+  testValidate(
+      executor_cache.fusion(), out_tensors, in_tensors, __LINE__, __FILE__);
+}
+
+TEST_F(MultiDeviceTest, AddWithBroadcast_BroadcastIsNotParallelized) {
+  const auto num_devices = communicator_->size();
+  const auto mesh = DeviceMesh::createForNumDevices(num_devices);
+
+  auto fusion = std::make_unique<Fusion>();
+  FusionGuard fg(fusion.get());
+
+  TensorView* x = makeContigConcreteTensor({num_devices});
+  x->setDeviceMesh(mesh);
+  TensorView* y = makeContigConcreteTensor({1});
+  y->setDeviceMesh(mesh);
+  TensorView* z = add(x, y);
+
+  fusion->addInput(x);
+  fusion->addInput(y);
+  fusion->addOutput(z);
+
+  x->axis(0)->parallelize(ParallelType::DIDx);
+  z->axis(0)->parallelize(ParallelType::DIDx);
+
+  std::vector<c10::IValue> in_tensors(
+      {at::randn({1}, tensor_options), at::randn({1}, tensor_options)});
+
+  FusionExecutorCache executor_cache(std::move(fusion));
+  executor_cache.runFusionWithInputs(in_tensors);
+}
+
 } // namespace nvfuser

Both tests fail to execute and throw errors like

C++ exception with description " INTERNAL ASSERT FAILED at "/opt/pytorch/nvfuser/csrc/multidevice/communication.cpp":72, please report a bug with repro script to NVFuser at https://github.com/NVIDIA/Fuser/issues. all buffers must have the same number of elements

Reasons for failure

Currently, isResharding doesn't map bDID and b. As a result, after InsertReshardingsPass, a set was added between y and z of two different shardings. This set was lowered to either an Allgather or a Scatter, both of which failed to execute. The failed Allgather tried to concatenate D input tensors of shape [1] to an output tensor of shape [1]. The failed Scatter tried to split an input tensor of [1] to D devices.

Failed attempts

My first reaction is to let isResharding ignore the DID on broadcast dimensions.

diff --git a/csrc/multidevice/utils.cpp b/csrc/multidevice/utils.cpp
index c1943fed..18c51e08 100644
--- a/csrc/multidevice/utils.cpp
+++ b/csrc/multidevice/utils.cpp
@@ -278,8 +278,12 @@ bool haveDifferentShardings(
         return true;
       }

-      if (a == nullptr || b == nullptr) {
-        return false;
+      if (a == nullptr) {
+        return b->isBroadcast();
+      }
+
+      if (b == nullptr) {
+        return a->isBroadcast();
       }

       // Going between bDIDx{1} and iDIDx{N} doesn't trigger resharding, but

This was able to avoid the set and therefore the communication. However, the first test failed with a different error:

C++ exception with description " INTERNAL ASSERT FAILED at "/opt/pytorch/nvfuser/csrc/multidevice/utils.cpp":144, please report a bug with repro script to NVFuser at https://github.com/NVIDIA/Fuser/issues. Found multiple loop IterDomains with the same parallel type (deviceIdx.x): bdeviceIdx.x22{1}, bdeviceIdx.x23{1}, bdeviceIdx.x21{128}

This is because the pointerwise scheduler

picked z as the reference TensorView, which isn't sharded,
loop-split z on intra-GPU parallel types,
tried to split x in the same way and produced multiple DIDx dimensions in its loop domain.

Potential solutions

Let schedulers support inconsistent DIDs on broadcast dimensions.
Disallow b(DID) in favor of b.
Fix the lowering of a set between b(DID) and b, e.g., lower that instead to an alias operation.

The text was updated successfully, but these errors were encountered:

wujingyue · 2024-11-27T18:57:47Z

cc @naoyam, @cowanmeg and @samnordmann

naoyam · 2024-11-27T20:48:40Z

picked z as the reference TensorView, which isn't sharded,

loop-split z on intra-GPU parallel types,

tried to split x in the same way and produced multiple DIDx dimensions in its loop domain.

Can you show the actual scheduling results of these tensors?

wujingyue · 2024-11-27T22:55:52Z

Can you show the actual scheduling results of these tensors?

%kernel {
T3_l_float[ iblockIdx.x26{1}, iUS27{1}, ithreadIdx.x25{128} ] ca_pos( 2 ) (DeviceMesh{0 1})
   = Set( T0_g_float[ iS30{1}, iS31{1}, iS29{128} ] (DeviceMesh{0 1}), cache_op=Streaming )
T4_l_float[ bblockIdx.x18{1}, bUS19{1}, bthreadIdx.x17{128} ] (DeviceMesh{0 1})
   = Set( T1_g_float[ bdeviceIdx.x22{1}, bdeviceIdx.x23{1}, bdeviceIdx.x21{128} ] (DeviceMesh{0 1}), cache_op=AllLevels )
T5_l_float[ iblockIdx.x14{1}, iUS15{1}, ithreadIdx.x13{128} ] ca_pos( 3 ) produce_pos( 2 ) (DeviceMesh{0 1})
   = T3_l_float[ iblockIdx.x26{1}, iUS27{1}, ithreadIdx.x25{128} ] ca_pos( 2 ) (DeviceMesh{0 1})
   + T4_l_float[ bblockIdx.x18{1}, bUS19{1}, bthreadIdx.x17{128} ] (DeviceMesh{0 1});
T2_g_float[ iblockIdx.x10{1}, iUS11{1}, ithreadIdx.x9{128} ] ca_pos( 2 ) produce_pos( 3 ) (DeviceMesh{0 1})
   = Set( T5_l_float[ iblockIdx.x14{1}, iUS15{1}, ithreadIdx.x13{128} ] ca_pos( 3 ) produce_pos( 2 ) (DeviceMesh{0 1}), cache_op=Streaming )

TransformPrinter :
T0_g_float[ iS30{1}, iS31{1}, iS29{128} ] (DeviceMesh{0 1})
 logical domain : (iS0{2})
 contiguity: t
  Split: iS0{2} by factor 128 -> iS28{1}, iS29{128}
  Split: iS28{1} by factor 1 -> iS30{1}, iS31{1}
 loop domain : (iS30{1}, iS31{1}, iS29{128})
T3_l_float[ iblockIdx.x26{1}, iUS27{1}, ithreadIdx.x25{128} ] ca_pos( 2 ) (DeviceMesh{0 1})
 logical domain : (iS5{2})
 contiguity: t
  Split: iS5{2} by factor 128 -> iS24{1}, ithreadIdx.x25{128}
  Split: iS24{1} by factor 1 -> iblockIdx.x26{1}, iUS27{1}
 loop domain : (iblockIdx.x26{1}, iUS27{1}, ithreadIdx.x25{128})
T1_g_float[ bdeviceIdx.x22{1}, bdeviceIdx.x23{1}, bdeviceIdx.x21{128} ] (DeviceMesh{0 1})
 logical domain : (bdeviceIdx.x1{1})
 contiguity: n
  Split: bdeviceIdx.x1{1} by factor 128 -> bdeviceIdx.x20{1}, bdeviceIdx.x21{128}
  Split: bdeviceIdx.x20{1} by factor 1 -> bdeviceIdx.x22{1}, bdeviceIdx.x23{1}
 loop domain : (bdeviceIdx.x22{1}, bdeviceIdx.x23{1}, bdeviceIdx.x21{128})
T4_l_float[ bblockIdx.x18{1}, bUS19{1}, bthreadIdx.x17{128} ] (DeviceMesh{0 1})
 logical domain : (bdeviceIdx.x6{1})
 contiguity: n
  Split: bdeviceIdx.x6{1} by factor 128 -> bdeviceIdx.x16{1}, bthreadIdx.x17{128}
  Split: bdeviceIdx.x16{1} by factor 1 -> bblockIdx.x18{1}, bUS19{1}
 loop domain : (bblockIdx.x18{1}, bUS19{1}, bthreadIdx.x17{128})
T5_l_float[ iblockIdx.x14{1}, iUS15{1}, ithreadIdx.x13{128} ] ca_pos( 3 ) produce_pos( 2 ) (DeviceMesh{0 1})
 logical domain : (iS2{2})
 contiguity: t
  Split: iS2{2} by factor 128 -> iS12{1}, ithreadIdx.x13{128}
  Split: iS12{1} by factor 1 -> iblockIdx.x14{1}, iUS15{1}
 loop domain : (iblockIdx.x14{1}, iUS15{1}, ithreadIdx.x13{128})
T2_g_float[ iblockIdx.x10{1}, iUS11{1}, ithreadIdx.x9{128} ] ca_pos( 2 ) produce_pos( 3 ) (DeviceMesh{0 1})
 logical domain : (iS7{2})
 contiguity: t
  Split: iS7{2} by factor 128 -> iS8{1}, ithreadIdx.x9{128}
  Split: iS8{1} by factor 1 -> iblockIdx.x10{1}, iUS11{1}
 loop domain : (iblockIdx.x10{1}, iUS11{1}, ithreadIdx.x9{128})
} // %kernel

See T1_g_float.

naoyam · 2024-12-06T03:46:16Z

Thanks. Does the second pattern work fine with change of haveDifferentShardings?

For the first pattern, how is the parallelization interpreted? Since z is not sharded, is the add operation executed by all devices?

wujingyue · 2024-12-06T18:57:31Z

For the first pattern, how is the parallelization interpreted? Since z is not sharded, is the add operation executed by all devices?

That's right. I forgot where this pattern actually happened. Probably in dropout which is replicated until sequence parallel is enabled.

wujingyue · 2024-12-06T19:01:48Z

Does the second pattern work fine with change of haveDifferentShardings?

Yes. I guess this is because the pointwise scheduler picked z the reference TV, which has DID in it. So the schedule it proposes skips DID.

a single-device ReduceScatter

wujingyue added the Multi-GPU label Nov 27, 2024

wujingyue assigned naoyam Dec 6, 2024

wujingyue added a commit that referenced this issue Dec 7, 2024

One more repro for #3488

db7b4ee

a single-device ReduceScatter

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Failed to lower a set between `bDID` and `b`. #3488

Failed to lower a set between `bDID` and `b`. #3488

wujingyue commented Nov 27, 2024

wujingyue commented Nov 27, 2024

naoyam commented Nov 27, 2024

wujingyue commented Nov 27, 2024

naoyam commented Dec 6, 2024

wujingyue commented Dec 6, 2024

wujingyue commented Dec 6, 2024

Failed to lower a set between bDID and b. #3488

Failed to lower a set between bDID and b. #3488

Comments

wujingyue commented Nov 27, 2024

Symptoms

Reasons for failure

Failed attempts

Potential solutions

wujingyue commented Nov 27, 2024

naoyam commented Nov 27, 2024

wujingyue commented Nov 27, 2024

naoyam commented Dec 6, 2024

wujingyue commented Dec 6, 2024

wujingyue commented Dec 6, 2024

Failed to lower a set between `bDID` and `b`. #3488

Failed to lower a set between `bDID` and `b`. #3488