pr #12

1proprogrammerchant · 2023-09-20T20:00:02Z

No description provided.

Disable tf32 if run on sm75 and below Fix the pattern match to compare the generated ptx against if run on sm75

`if _unwrap_if_constexpr(cond)` then enters `node.body` is wrong when cond is a tensor since we cannot statically evaluate a dynamic tensor's value. The right way to solve the problem is probably: 1. visit the ast of IfExp (do not build IRs) 2. get the type of the last statement 3. initialize the return value and assign it to livein 4. call visit_If

Simplify the code by using inline asm to implement globaltimer and smid instead of relying on bc file.

For warp specialized persistent kernel, the instruction sequence for Warp Groups are ``` // warp group 0 for wave in 0..num_waves: idx = wave * num_inner_loop_steps; for k_tile_idx in 0..num_k_tiles: mbarrier.wait EB[idx]; W0; mbarrier.arrive FB[idx]; idx++; ``` ``` // warp group 1 for wave in 0..num_waves: idx = wave * num_inner_loop_steps; for k_tile_idx in 0..num_k_tiles: mbarrier.wait FB[idx]; R0; mbarrier.arrive EB[idx]; idx++; ``` then this would form a sequence of morally-strong relations W0 -> R0 -> W1 -> R1 in causality order. But if GEMM K is small than K-TileShape, then the num_inner_loop_steps of persistent kernel is 0. The buffer id and mbarrier id will always be 0 in this case. And it may form W0 -> W1 -> R0 -> R1 order, which is contradicts with the atomicity -- "If a read R precedes an overlapping write W in causality order, then R cannot read from W."

1. Optimize the conversion and packing for 2xf32 -> 2xf16. 2. Split TMA store block into multiple slices of size 64x64. 3. Distribute the TMA store to all the warps. 4. Fix some naming issue.

#2138

…2153)

This PR makes the following change to AOT kernel - Allow the client to generate AOT kernels with different sets of constexprs and meta-parameters. Each combination of constexpr set and meta-parameters is referred to an "algo". Within an algo client can still give different hints about integer arguments. - Add a API int ${kernle_name}_get_num_algos() that returns the total number of algos. - Add a algo_id to allow client to the generated kernel to select the algo - Remove gX, gY and gZ from the kernel parameter list. This is because the launch grid is usually different with different algos, and the client should not need to care about how to compute the launch grid for each algo. Instead, we ask the client to pass the expression of computing gX, gY and gZ for compile.py (when AOT kernels are generated). The expression can only use kernel parameter or const values. - We also change the testing flow. Now we first build the kernels into a shared library libkernel.so, then the client test.c code is built and link with libkernel.so. This is closer to a typical AOT kernel usage flow.

Chars shown at the 08/23 community meetup

Added Triton Conference Registration details.

Added recording link and minutes for meeting.

… device backend in JITFucntion (#2130) The default values used by JITFunction for num_warps and num_stages are coupled with Nvidia GPU architecture. We should use the proper default values based on the device backend for the kernel to be compiled to. 1. Add two functions to return the default num_warps and num_stages for the specific device backend. 2. JITFunction uses the proper default num_warps and num_stages based on the specific device backend. Co-authored-by: Wang Weihan <[email protected]>

Added speaker details

Revise the logics to assign MMA layout for serialized dots. This is a hueristic rule for FlashAttention.

#2156

…2114) Co-authored-by: Keren Zhou <[email protected]>

…prs (#2188)

… correct swizzling code (#2180) fix bug #1937 Co-authored-by: Philippe Tillet <[email protected]>

…ation (#2292)

- Support memory space for pointers (e.g., `!tt.ptr<f32, 1>`). - Support parsing function attribute, though not used yet.

) This fixes few problems that were preventing me to use lld linker.

Triton conf registration closed.

Change the dot to allow taking an initial accumulator and add a flag that will allow the compiler to accumulate in a lower precision than the output type. On Hopper this flag is on by default which allows accumualting with lower precision. This only affect Hopper fp8 dot.

Move the optimization to remove phi of struct later in the optimization pipeline to avoid interfering with CFG optimization.

Otherwise, these files show up in `git status` under python/triton/third_party/cuda/bin/.

On my machine, when I try to `pip install cmake` outside a virtualenv, it gets mad at me and tells me to use apt. Which doesn't quite work for some reason. Anyway maybe this is simple to Python people, but perhaps worth mentioning. Especially because we have `.venv` in gitignore already.

reverts #2310 as recent changes to Triton-IR have broken third-party backends

Fixes: #2302

Previously on matmul, if inputs are int8, output was also int8. This commit fixes the overflow problem with int32 output. #2296

This is a new interpreter mode that shares semantic analysis with the JIT'ed codepath and that the Triton core team is committed to maintain

…2325)

Some duplicate functions on `scf.for` have been removed in llvm/llvm-project#66512. This PR works with and without llvm/llvm-project#66512.

…here's a single warp on the axis (#2330) 1. On the axis, using `getAxisNumWarpsWithUniqueData` instead of getting the raw number of warps to avoid communication among warps that handle the same piece of data. 2. When there's a single warp on the axis, using warp Intrinsics for communication and skip shared memory. Need a follow up PR for code clean up.

) Improve patterns that sync broadcast to reduce the arithmetic density and also hoist convert on top of expand_dims to do less work. This address comments in #2274

This was regressed by #2185 because we didn't realise CUDA_CHECK macro could do Python calls (similar to what led to #2225). I think the PyErr_Occurred got removed in that PR because there was missing error handling before the call to _launch, so it looked like it was just in the wrong place. It looks like there are also potentially a couple places in cuda.c that can return with error set, e.g. getDeviceProperties, memAlloc, memcpyHtoD, memFree, tensorMapEncodeTiled etc, but those are all pre-existing and not affected by recent changes.

llvm/llvm-project#66754 extends the `LoopLikeOpInterface`: the signature of `getLoopBody` has changed. `ForOp::getRegion` can be used instead. This change works with and without llvm/llvm-project#66754.

Google uses clang-format at LLVM HEAD. clang-format's formatting is not stable, so we want to minimize the difference between the pre-commit clang-format and HEAD to minimize differences with Google's formatter. In practice, it appears that there are no relevant changes to the formatting, so this is a nop. 🤷 Tested by running `pre-commit run --all-files`.

I tested these locally, seems to work for me.

Co-authored-by: dongdongl <[email protected]>

… small (#2345) #2298

alexander-zinoviev and others added 30 commits August 19, 2023 22:21

[TESTS] Fix tl.dot test on sm75 (#2140)

a7b40a1

Disable tf32 if run on sm75 and below Fix the pattern match to compare the generated ptx against if run on sm75

[BACKEND] Remove dead code related to old libhopper_helpers.bc (#2145)

ad3e363

[FRONTEND] Use inline asm for global timer and smid functions (#2143)

54ca7fc

Simplify the code by using inline asm to implement globaltimer and smid instead of relying on bc file.

[FRONTEND] name mangling fixup (#2148)

ea84161

[BACKEND] Optimize performance for f16 epilogue with TMA store (#2135)

ec801ce

1. Optimize the conversion and packing for 2xf32 -> 2xf16. 2. Split TMA store block into multiple slices of size 64x64. 3. Distribute the TMA store to all the warps. 4. Fix some naming issue.

[DOCS] update meetup/08-22-2023.md (#2149)

0410652

[FRONTEND] fix a typo (#2152)

c4a9006

[FRONTEND] Emit warning if the result of tl.advance is unused (#2155)

5fa1fa1

#2138

[TUTORIALS] Skip running TMA tutorials on non-hopper architectures (#…

6a65c89

…2153)

[CI] Add back pre-commit to nvidia CI job (#2159)

5282ed8

[DOCS] Add Intel XPU status update doc to Meetup Doc (#2160)

5d47054

[DOCS] Add AMD GPU 08/23 status update doc to /docs/meetups (#2167)

9ad6327

Chars shown at the 08/23 community meetup

[BACKEND] Don't do dead code elimination on volatile load (#2165)

3116933

[DOCS] Update README.md (#2168)

2231403

Added Triton Conference Registration details.

[DOCS] update 08-22-2023.md (#2164)

a1c5ef3

Added recording link and minutes for meeting.

[DOCS] fix typo in readme

b43c28f

[DOCS] update README.md (#2174)

387c8d9

Added speaker details

[DOCS] Fixing docs (#2175)

120ce0a

[Optimizer][Hopper] change mmaV3InstrN for flash attention (#2169)

22a2fe3

Revise the logics to assign MMA layout for serialized dots. This is a hueristic rule for FlashAttention.

[FRONTEND] drop the GIL around more CUDA ops (#2173)

7083dae

[FRONTEND] handle errors from launch_enter_hook (#2178)

64d8df4

[BACKEND] Fix BF16 dot operand type mismatch (#2162)

f6cdcf1

#2156

[FRONTEND] Fix benchmark plotting (#2177)

56fee37

[FRONTEND] fix for undefined dtypes in jit during loading defaults (#…

ebfe0ff

…2114) Co-authored-by: Keren Zhou <[email protected]>

[DOCS] Fixing docs by ignoring examples 09/10 in sphinx-gallery (#2189)

4ea94a8

[FRONTEND] fix handling of do_not_specialize with interior constantex…

ab3e8b0

…prs (#2188)

BinFan and others added 29 commits September 13, 2023 12:52

[OPTIMIZER] Fix Shared layout in OptimizeDotOperands pass to generate…

38a2ecd

… correct swizzling code (#2180) fix bug #1937 Co-authored-by: Philippe Tillet <[email protected]>

[BACKEND] Fixing assert in shared encoding swizzling addresses calcul…

a301502

…ation (#2292)

[FRONTEND] Added SASS to asm dict (#2280)

36087a1

[FRONTEND] Accommodate new triton IR format (#2294)

08c1658

- Support memory space for pointers (e.g., `!tt.ptr<f32, 1>`). - Support parsing function attribute, though not used yet.

[BUILD] Fix few dependencies and layering issues to make lld work (#2307

976aabd

) This fixes few problems that were preventing me to use lld linker.

[DOCS] update README.md (#2311)

ac1c216

Triton conf registration closed.

[FRONTEND] Add sass to asm dict with lazy evaluation (#2309)

db5c793

[CI] update integration-tests.yml (#2310)

78a0b5d

[BACKEND] Move struct optimization down the LLVM pipeline (#2312)

bb949d1

Move the optimization to remove phi of struct later in the optimization pipeline to avoid interfering with CFG optimization.

Add cuobjdump and nvsisasm to gitignore. (#2319)

41584c7

Otherwise, these files show up in `git status` under python/triton/third_party/cuda/bin/.

Revert "Update integration-tests.yml" (#2323)

c98671c

reverts #2310 as recent changes to Triton-IR have broken third-party backends

[BUILD] use ninja (#2318)

073aa16

[FRONTEND] Explicitly forbid dot(.., out_dtype=bfloat16) (#2308)

4f2d995

Fixes: #2302

[FRONTEND] fix xpu stages logic (#2305)

68e1bd1

[FRONTEND] fix matmul int8 overflow issue (#2297)

2b06600

Previously on matmul, if inputs are int8, output was also int8. This commit fixes the overflow problem with int32 output. #2296

[FRONTEND] interpreter rewrite (#2321)

e686b4d

This is a new interpreter mode that shares semantic analysis with the JIT'ed codepath and that the Triton core team is committed to maintain

[RUNTIME][INTERPRETER] now also override __str__ method for tensors (#…

894fa9e

…2325)

Integration fixes for llvm/llvm-project#66512 (#2328)

a9ae988

Some duplicate functions on `scf.for` have been removed in llvm/llvm-project#66512. This PR works with and without llvm/llvm-project#66512.

[DOCS] improved fused attention tutorial (bwd pass) (#2332)

73dae77

[BACKEND] Relax patterns to move sink broadcast and hoist convert (#2331

3a848e2

) Improve patterns that sync broadcast to reduce the arithmetic density and also hoist convert on top of expand_dims to do less work. This address comments in #2274

Integration fixes for llvm/llvm-project#66754 (#2338)

ae07b7b

llvm/llvm-project#66754 extends the `LoopLikeOpInterface`: the signature of `getLoopBody` has changed. `ForOp::getRegion` can be used instead. This change works with and without llvm/llvm-project#66754.

Add instructions for building with custom LLVM (#2344)

3631829

I tested these locally, seems to work for me.

[TESTS] fix flash attention (#2086)

e5eda09

Co-authored-by: dongdongl <[email protected]>

[BACKEND] Handle repetitive threads in scan op when the tensor dim is…

ed5a530

… small (#2345) #2298

1proprogrammerchant merged commit 879a916 into 1proprogrammerchant:main Sep 20, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pr #12

pr #12

1proprogrammerchant commented Sep 20, 2023

pr #12

pr #12

Conversation

1proprogrammerchant commented Sep 20, 2023