Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add reduction_unroll_factor to autotuning script #3487

Merged
merged 9 commits into from
Dec 13, 2024

Conversation

rdspring1
Copy link
Collaborator

This PR renames unroll_factor to iteration_unroll_factor and adds reduction_unroll_factor. reduction_unroll_factor adds unroll factor on top of vectorization factor for the inner reduction domain.

@rdspring1 rdspring1 added the Autotune Generate heuristics through machine learning models. label Nov 27, 2024
@rdspring1 rdspring1 requested a review from liqiangxl November 27, 2024 01:48
@rdspring1 rdspring1 force-pushed the autotune_inner_reduction_2d branch from 7817368 to e7ffb29 Compare December 1, 2024 17:30
@rdspring1 rdspring1 force-pushed the autotune_inner_reduction_2d_update branch from 15bc05e to fdcf6a5 Compare December 1, 2024 17:31
)

# number of reduction elements not handled by a CTA
remaining_reduction = ceil_div(
num_reductions,
(scheduler_config.bdimx * scheduler_config.vectorize_factor),
(scheduler_config.bdimx * vectorize_factor * reduction_unroll_factor),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be ceil_div(ceil_div(num_reductions/vectorize_factor, bdimx), reduction_unroll_factor)

)

if unroll_factor == 1 and remaining_reduction > 1:
if iteration_unroll_factor == 1 and remaining_reduction > 1:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks strange to me. Why grdim = remaining_reduction? We can do serial reduction instread of grid reduction.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nvFuser's default heuristic does:

  // When iteration dim is small, may have unused SMs, to increase SM usage
  // needs to shift from block reduction to grid reduction.
  int64_t grdim = 1;
  while (godim * grdim * 2 <= sm_count && getInnerRemainder() / grdim >= 2) {
    grdim *= 2;
  }

Copy link
Collaborator Author

@rdspring1 rdspring1 Dec 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From inner2dReductionHeuristic, I see this:

  // Cross grid reduction if we haven't hit our target blocks, and we have manyr
  // reduction elements.
  if ((godim < target_blocks && remainder_in_reduction >= 0) ||
      (remainder_in_reduction >= kEight)) {
    grdim = remainder_in_reduction;
  }

   // Try to do some cleanup of ragged waves on device
   { do_something }

   // Grid reductions do not support unrolling iteration dimension, revert if
   // set. Recalculate godim.
   { do_something }

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

another approch is we can add another search para is_block_reduction, if it is true, we only use block reduction, if it is false, we do grid reduction.

Base automatically changed from autotune_inner_reduction_2d to main December 11, 2024 19:43
Copy link
Collaborator

@liqiangxl liqiangxl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@rdspring1
Copy link
Collaborator Author

!build

@rdspring1 rdspring1 merged commit dc96e06 into main Dec 13, 2024
17 checks passed
@rdspring1 rdspring1 deleted the autotune_inner_reduction_2d_update branch December 13, 2024 20:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Autotune Generate heuristics through machine learning models.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants