Fix double-scheduling of dc in HSS hopper matmuls #3590

jacobhinkle · 2024-12-13T18:42:13Z

When we do not have an epilogue (not even a cast), it might be the case that the original MmaOp has output which is a Fusion output. In this case the cached output which we often call dc is actually an mma_result. Currently this causes us to schedule that tensor once in scheduleMmaResults then again in scheduleEpilogue, leading to an esoteric error (see included test). This PR simply skips scheduling those tensors directly if they are already known to be mma results.

jacobhinkle · 2024-12-13T18:42:21Z

!test

jacobhinkle · 2024-12-13T18:44:54Z

We would have hit this eventually but the HSS tests are still guarded against Hopper. I'm posting this now to unblock some internal heuristics work.

jacobhinkle · 2024-12-13T19:02:36Z

tests/cpp/test_matmul_scheduler.cpp

+  // TODO: Currently we use stmatrix whenever this is true. We cannot do that
+  // when the dtype is not 16 bits.


@protonu we need to handle all possible dtypes in the epilogue.

protonu · 2024-12-13T19:39:20Z

csrc/scheduler/hopper_multi_matmul.cpp

+        // not casting back to half-precision in the output
+        tvs_to_schedule.push_back(dc);
+      }
+


Are we still using stmatrix if the output is fp32 and there wasn't a cast?

Yes, if you enable use_smem_epilogue in the included test we hit an error in scheduleStMatrixForMmaOutput.

So we need an if/else here that checks the dtype of d_smem and schedules with vectorized stores instead if not 16bit

Fuser/csrc/scheduler/hopper_multi_matmul.cpp

Lines 584 to 588 in 230f633

MmaInputSmemSwizzle swizzle = mma_utils::tmaSwizzleSharedMemory(d_smem);

// Schedule shared memory cache; Output from StMatrix

mma_utils::scheduleStMatrixForMmaOutput(

d_smem, swizzle, stmatrix_tile_m, stmatrix_tile_n);

Fix double-scheduling of dc in HSS hopper matmuls

fac2938

jacobhinkle added the Matmuls label Dec 13, 2024

jacobhinkle requested a review from rdspring1 December 13, 2024 18:42

rdspring1 approved these changes Dec 13, 2024

View reviewed changes

rdspring1 mentioned this pull request Dec 13, 2024

Schedule epilogue (for Hopper Matmul) by propagation backward from output - smem epilogue not supported. #3580

Merged

jacobhinkle commented Dec 13, 2024

View reviewed changes

protonu reviewed Dec 13, 2024

View reviewed changes

jacobhinkle merged commit cbd628f into main Dec 13, 2024
38 of 39 checks passed

jacobhinkle deleted the fix_hopper_hss_epilogue branch December 13, 2024 21:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix double-scheduling of dc in HSS hopper matmuls #3590

Fix double-scheduling of dc in HSS hopper matmuls #3590

jacobhinkle commented Dec 13, 2024

jacobhinkle commented Dec 13, 2024

jacobhinkle commented Dec 13, 2024

jacobhinkle Dec 13, 2024

protonu Dec 13, 2024

jacobhinkle Dec 13, 2024

jacobhinkle Dec 13, 2024

		// TODO: Currently we use stmatrix whenever this is true. We cannot do that
		// when the dtype is not 16 bits.

	MmaInputSmemSwizzle swizzle = mma_utils::tmaSwizzleSharedMemory(d_smem);

	// Schedule shared memory cache; Output from StMatrix
	mma_utils::scheduleStMatrixForMmaOutput(
	d_smem, swizzle, stmatrix_tile_m, stmatrix_tile_n);

Fix double-scheduling of dc in HSS hopper matmuls #3590

Fix double-scheduling of dc in HSS hopper matmuls #3590

Conversation

jacobhinkle commented Dec 13, 2024

jacobhinkle commented Dec 13, 2024

jacobhinkle commented Dec 13, 2024

jacobhinkle Dec 13, 2024

Choose a reason for hiding this comment

protonu Dec 13, 2024

Choose a reason for hiding this comment

jacobhinkle Dec 13, 2024

Choose a reason for hiding this comment

jacobhinkle Dec 13, 2024

Choose a reason for hiding this comment