Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add TT, TN, NT, NN tests for HopperMultipleMatmulScheduler #3310

Merged
merged 5 commits into from
Nov 3, 2024

Conversation

rdspring1
Copy link
Collaborator

@rdspring1 rdspring1 commented Oct 30, 2024

This PR creates four tests for the HopperMultiMatmulScheduler. Each tests covers a different matmul layout - TT, TN, NT, and NN where the input arguments are already broadcasted.

@rdspring1 rdspring1 marked this pull request as ready for review October 30, 2024 05:26
tests/cpp/test_matmul_scheduler.cpp Show resolved Hide resolved
tests/cpp/test_matmul_scheduler.cpp Show resolved Hide resolved
tests/cpp/test_matmul_scheduler.cpp Outdated Show resolved Hide resolved
@rdspring1 rdspring1 changed the title Add single test for HopperMultipleMatmulScheduler Add TN and NT tests for HopperMultipleMatmulScheduler Oct 30, 2024
@rdspring1
Copy link
Collaborator Author

I created TN test with MNK ordering, added custom MatmulParams, and kept the original NT tests because two tests are better than one.

@rdspring1 rdspring1 requested a review from jacobhinkle October 30, 2024 18:44
@rdspring1 rdspring1 force-pushed the hopper_matmul_tests branch from d8bc1a6 to 7c8f375 Compare November 1, 2024 02:54
@rdspring1 rdspring1 changed the base branch from main to ampere_guard November 1, 2024 02:58
@rdspring1 rdspring1 changed the title Add TN and NT tests for HopperMultipleMatmulScheduler Add TT, TN, NT, NN tests for HopperMultipleMatmulScheduler Nov 1, 2024
Copy link
Collaborator

@jacobhinkle jacobhinkle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, although it is a fair amount of code duplication that could be helped with parametrization. Also just a note that we could also test allocation domain here.

Comment on lines +3086 to +3087
auto tv0 = makeContigConcreteTensor({-1, -1, 1}, dtype); // A [M, K, b]
auto tv1 = makeContigConcreteTensor({1, -1, -1}, dtype); // B [b, K, N]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case the MmaOp input order is MKN, and the output gets reordered with a root->logical reordering to MNK. In the TN case there's no such reordering because the logical order of inputs is MNK. Note that we can also have allocation domains set on the inputs. Maybe we could parametrize all the combinations i.e. the orders of the allocation and logical domains of the inputs?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we could parametrize all the combinations i.e. the orders of the allocation and logical domains of the inputs?

When the allocation and logical domains are different, would the input operand be not contiguous, concrete tensors?

Copy link
Collaborator

@jacobhinkle jacobhinkle Nov 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They can be contiguous, concrete, and have permuted allocation domain. Contiguity is with respect to allocation domain, so e.g. a tensor of logical shape [5, 7] and stride [7, 1] is contiguous, but so is one with logical shape [5, 7] and stride [1, 5]. The latter would correspond to having a swapped allocation domain in nvFuser.

Base automatically changed from ampere_guard to main November 1, 2024 16:28
@rdspring1
Copy link
Collaborator Author

!test

@rdspring1 rdspring1 merged commit 7086d52 into main Nov 3, 2024
47 checks passed
@rdspring1 rdspring1 deleted the hopper_matmul_tests branch November 3, 2024 17:26
rdspring1 added a commit that referenced this pull request Nov 8, 2024
This PR modifies `schedulePrologues` to use TMA loads to move mma
operands to shared memory. Stacked on
#3324 and
#3310.

## Details
1. Input operands are loaded into shared memory via
`CpAsyncBulkTensorTile` LoadStoreOp.
2. Replace `LdMatrix` operation with basic set.
3. Modified `scheduleOperandSmemStores` to apply swizzling to avoid bank
conflicts.
4. Refactor `swizzleSharedMemory` by moving the analysis component to a
separate function named `analyzeSwizzleSharedMemory`.
5. Create `tmaSwizzleSharedMemory` function that uses
`analyzeSwizzleSharedMemory` and then finds the appropriate tma swizzle
format.
6. Disable loop rotation. There is an issue with tma loads and circular
buffering. Not sure if loop rotation is required for hopper matmul.
7. Expect hopper matmul tests to give incorrect results.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants