-
Notifications
You must be signed in to change notification settings - Fork 54
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add TT, TN, NT, NN tests for HopperMultipleMatmulScheduler #3310
Conversation
I created TN test with |
d8bc1a6
to
7c8f375
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, although it is a fair amount of code duplication that could be helped with parametrization. Also just a note that we could also test allocation domain here.
auto tv0 = makeContigConcreteTensor({-1, -1, 1}, dtype); // A [M, K, b] | ||
auto tv1 = makeContigConcreteTensor({1, -1, -1}, dtype); // B [b, K, N] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this case the MmaOp input order is MKN, and the output gets reordered with a root->logical reordering to MNK. In the TN case there's no such reordering because the logical order of inputs is MNK. Note that we can also have allocation domains set on the inputs. Maybe we could parametrize all the combinations i.e. the orders of the allocation and logical domains of the inputs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we could parametrize all the combinations i.e. the orders of the allocation and logical domains of the inputs?
When the allocation and logical domains are different, would the input operand be not contiguous, concrete tensors?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They can be contiguous, concrete, and have permuted allocation domain. Contiguity is with respect to allocation domain, so e.g. a tensor of logical shape [5, 7] and stride [7, 1] is contiguous, but so is one with logical shape [5, 7] and stride [1, 5]. The latter would correspond to having a swapped allocation domain in nvFuser.
!test |
This PR modifies `schedulePrologues` to use TMA loads to move mma operands to shared memory. Stacked on #3324 and #3310. ## Details 1. Input operands are loaded into shared memory via `CpAsyncBulkTensorTile` LoadStoreOp. 2. Replace `LdMatrix` operation with basic set. 3. Modified `scheduleOperandSmemStores` to apply swizzling to avoid bank conflicts. 4. Refactor `swizzleSharedMemory` by moving the analysis component to a separate function named `analyzeSwizzleSharedMemory`. 5. Create `tmaSwizzleSharedMemory` function that uses `analyzeSwizzleSharedMemory` and then finds the appropriate tma swizzle format. 6. Disable loop rotation. There is an issue with tma loads and circular buffering. Not sure if loop rotation is required for hopper matmul. 7. Expect hopper matmul tests to give incorrect results.
This PR creates four tests for the
HopperMultiMatmulScheduler
. Each tests covers a different matmul layout - TT, TN, NT, and NN where the input arguments are already broadcasted.