Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
only do grid split when needed (#2965)
**Issue:** In inner outer persistent scheduler, the last step is doing an outer reduction, the inner dim is parallelized by `vectorization`, `bimdx`, and `gdimy`. Current main branch always do three splits using `vectorization`, `bdimx`, and `gdimy`, however, the last split is not needed if `vectorization * bdimx * gdimy >= inner dim`, for example: ``` T0 logical domain : (iS264{gridDim.y}, iS265{i1}) contiguity: t t Split: iS265{i1} by factor 4 Split: iS997{( ceilDiv(i1, 4) )} by factor blockDim.x Split: iS999{( ceilDiv(( ceilDiv(i1, 4) ), blockDim.x) )} by factor gridDim.y ``` The last split is redundant if `4 * blockDim.x * gridDim.y >= i1` **Fix:** Only split when `vectorization * bdimx * gdimy < inner dim` **Influence:** Removing this extra split saves one loop in the generated code. Performance is increased in some cases but decreased in other cases, all changes are within 10%. see [dashboard](http://nv/ekP). --------- Co-authored-by: jjsjann123 <[email protected]>
- Loading branch information