Fix bug in scatter #2245

samnordmann · 2024-05-14T18:47:02Z

Fixes a subtle bug, exposed by #2168

samnordmann · 2024-05-14T18:48:52Z

I find this bug really counter-intuitive!

I thought it was already checked here that the buffers were contiguous...

samnordmann · 2024-05-14T18:51:54Z

!build --dist

cowanmeg · 2024-05-14T19:07:27Z

I find this bug really counter-intuitive!

I thought it was already checked here that the buffers were contiguous...

Huh...I don't know how this didn't come up earlier...

cowanmeg · 2024-05-14T19:12:06Z

We have assumed that the input tensor is contiguous when lowering comms (probably should have added this contiguous() call before), but our tests have all have contiguous aten inputs, so I'm not certain where the non-contiguity got introduced...

wujingyue · 2024-05-14T19:52:07Z

Code before #2172 doesn't .contiguous() either, so it has likely existed for a while: https://github.com/NVIDIA/Fuser/pull/2172/files/0cf8e27ac816d7d0751df4e64ec1cb7f442ff442#diff-7312205b6b6416d47bc9ad0e0bc815836e7353d4cbbad00dd88196c2e17baff1L122-L123

tests/cpp/test_multidevice_pipeline.cpp

wujingyue · 2024-05-14T23:21:49Z

We have assumed that the input tensor is contiguous when lowering comms (probably should have added this contiguous() call before), but our tests have all have contiguous aten inputs, so I'm not certain where the non-contiguity got introduced...

This is exposed by #2168. But the root cause I believe is that insertReshardings doesn't set allocation domain properly. IIRC, you and @jjsjann123 noticed this potential problem in another PR, but we never get a chance to fix this properly.

Below are fusion IR for the two stages. Although Set.Permute is inserted, because the output of the first stage doesn't have an allocation domain, Set.Permute is allowed to produce a non-contiguous tensor (in this case, strides=[3, 12, 1]) for speed.

Inputs:
  T0_g[ iS0{3}, iS1{4}, iS2{3}, iS3{5} ] (DeviceMesh{1, }), float
Outputs:
  T4_g[ iS15{4}, iS14{3}, iS16{3} ] (DeviceMesh{1, }), float

%kernel_math {
T1_l[ iS4{3}, iS5{4}, iS6{3}, rS7{5} ] (DeviceMesh{1, })
   = reduction( T0_g[ iS0{3}, iS1{4}, iS2{3}, iS3{5} ] (DeviceMesh{1, }), op = add, initial value = float(0), allreduce = false )
T4_g[ iS15{4}, iS14{3}, iS16{3} ] (DeviceMesh{1, })
   = Set.Permute( T1_l[ iS4{3}, iS5{4}, iS6{3}, rS7{5} ] (DeviceMesh{1, }), cache_op=Streaming )
}

[4, 3, 3]
[3, 12, 1]
Inputs:
  T5_g[ ideviceIdx.x17{4}, iS18{3}, iS19{3} ] (DeviceMesh{0, 1, 2, 3, }), float
Outputs:
  T3_g[ iS11{3}, ideviceIdx.x12{4}, iS13{3} ] (DeviceMesh{0, 1, 2, 3, }), float

%kernel_math {
T6_l[ iS21{3}, ideviceIdx.x20{4}, iS22{3} ] (DeviceMesh{0, 1, 2, 3, })
   = Set.Permute( T5_g[ ideviceIdx.x17{4}, iS18{3}, iS19{3} ] (DeviceMesh{0, 1, 2, 3, }), cache_op=Streaming )
T3_g[ iS11{3}, ideviceIdx.x12{4}, iS13{3} ] (DeviceMesh{0, 1, 2, 3, })
   = T6_l[ iS21{3}, ideviceIdx.x20{4}, iS22{3} ] (DeviceMesh{0, 1, 2, 3, })
   + T6_l[ iS21{3}, ideviceIdx.x20{4}, iS22{3} ] (DeviceMesh{0, 1, 2, 3, });
}

I think our options are:

Merge this PR, which now sounds like a bandaid because it leaves performance on the table. In the above example, doing reduction+permute in one kernel is likely fast than in two kernels.
Fix insertResharding to set allocation domains properly. This sounds like the right fix.
Revert Allocation order refactor #2168, which isn't the root cause but is unfortunately the trigger.

I suspect option 2 will take a while and it's a bad idea to leave CI broken, so I'll cross that out. Option 3 is safest because #2168 could triggered other failure cases that we are not aware just yet. Option 1 is suboptimal, but as long as we (👀 @cowanmeg) fix the root cause soon, we should be fine. Wdyt?

jjsjann123 · 2024-05-15T00:40:04Z

Oops. sorry about that. 😛

@samnordmann ignore the thunder tests. there's something with transformer_engine.

cowanmeg · 2024-05-15T16:36:36Z

Yes, I agree option 2 is the correct fix. Plus we need to set allocation domain for DID parallelism on the leaf domain to work. Hopefully, I can find some time soon to work on this!

Sets allocation domain of sharded tensors during the pass `propagateShardingsAndSetAllocationDomain`. The two passes are merged in attempt to reduce the number of passes over all expressions in the fusion. Allocation domain is set to the tv's leaf domain. Since presegmentation passes and scheduling occur after the sharding passes, the leaf domain is identical to the rfact domain. After DID parallelization of the leaf domain is allowed the leaf and rfactor domain will not be the same. This will avoid issues such as #2245 (comment) and allow the `AllocationDomainPass` presegmentation pass on for distributed matmul tests

fix bug in scatter

a559b5c

samnordmann requested review from wujingyue and cowanmeg May 14, 2024 18:47

lintrunner

bc50c2b

cowanmeg approved these changes May 14, 2024

View reviewed changes

wujingyue reviewed May 14, 2024

View reviewed changes

tests/cpp/test_multidevice_pipeline.cpp Show resolved Hide resolved

samnordmann requested a review from wujingyue May 14, 2024 20:09

wujingyue approved these changes May 14, 2024

View reviewed changes

wujingyue merged commit 1a7c6f6 into NVIDIA:main May 15, 2024
34 of 37 checks passed

cowanmeg mentioned this pull request May 20, 2024

Set allocation domain of sharded tensor #2271

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix bug in scatter #2245

Fix bug in scatter #2245

samnordmann commented May 14, 2024 •

edited by wujingyue

Loading

samnordmann commented May 14, 2024 •

edited

Loading

samnordmann commented May 14, 2024

cowanmeg commented May 14, 2024

cowanmeg commented May 14, 2024

wujingyue commented May 14, 2024

wujingyue commented May 14, 2024

jjsjann123 commented May 15, 2024

cowanmeg commented May 15, 2024

Fix bug in scatter #2245

Fix bug in scatter #2245

Conversation

samnordmann commented May 14, 2024 • edited by wujingyue Loading

samnordmann commented May 14, 2024 • edited Loading

samnordmann commented May 14, 2024

cowanmeg commented May 14, 2024

cowanmeg commented May 14, 2024

wujingyue commented May 14, 2024

wujingyue commented May 14, 2024

jjsjann123 commented May 15, 2024

cowanmeg commented May 15, 2024

samnordmann commented May 14, 2024 •

edited by wujingyue

Loading

samnordmann commented May 14, 2024 •

edited

Loading