forked from triton-lang/triton
-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
pr #12
Merged
1proprogrammerchant
merged 118 commits into
1proprogrammerchant:main
from
triton-lang:main
Sep 20, 2023
Merged
pr #12
1proprogrammerchant
merged 118 commits into
1proprogrammerchant:main
from
triton-lang:main
Sep 20, 2023
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Disable tf32 if run on sm75 and below Fix the pattern match to compare the generated ptx against if run on sm75
`if _unwrap_if_constexpr(cond)` then enters `node.body` is wrong when cond is a tensor since we cannot statically evaluate a dynamic tensor's value. The right way to solve the problem is probably: 1. visit the ast of IfExp (do not build IRs) 2. get the type of the last statement 3. initialize the return value and assign it to livein 4. call visit_If
Simplify the code by using inline asm to implement globaltimer and smid instead of relying on bc file.
For warp specialized persistent kernel, the instruction sequence for Warp Groups are ``` // warp group 0 for wave in 0..num_waves: idx = wave * num_inner_loop_steps; for k_tile_idx in 0..num_k_tiles: mbarrier.wait EB[idx]; W0; mbarrier.arrive FB[idx]; idx++; ``` ``` // warp group 1 for wave in 0..num_waves: idx = wave * num_inner_loop_steps; for k_tile_idx in 0..num_k_tiles: mbarrier.wait FB[idx]; R0; mbarrier.arrive EB[idx]; idx++; ``` then this would form a sequence of morally-strong relations W0 -> R0 -> W1 -> R1 in causality order. But if GEMM K is small than K-TileShape, then the num_inner_loop_steps of persistent kernel is 0. The buffer id and mbarrier id will always be 0 in this case. And it may form W0 -> W1 -> R0 -> R1 order, which is contradicts with the atomicity -- "If a read R precedes an overlapping write W in causality order, then R cannot read from W."
1. Optimize the conversion and packing for 2xf32 -> 2xf16. 2. Split TMA store block into multiple slices of size 64x64. 3. Distribute the TMA store to all the warps. 4. Fix some naming issue.
This PR makes the following change to AOT kernel - Allow the client to generate AOT kernels with different sets of constexprs and meta-parameters. Each combination of constexpr set and meta-parameters is referred to an "algo". Within an algo client can still give different hints about integer arguments. - Add a API int ${kernle_name}_get_num_algos() that returns the total number of algos. - Add a algo_id to allow client to the generated kernel to select the algo - Remove gX, gY and gZ from the kernel parameter list. This is because the launch grid is usually different with different algos, and the client should not need to care about how to compute the launch grid for each algo. Instead, we ask the client to pass the expression of computing gX, gY and gZ for compile.py (when AOT kernels are generated). The expression can only use kernel parameter or const values. - We also change the testing flow. Now we first build the kernels into a shared library libkernel.so, then the client test.c code is built and link with libkernel.so. This is closer to a typical AOT kernel usage flow.
Chars shown at the 08/23 community meetup
Added Triton Conference Registration details.
Added recording link and minutes for meeting.
… device backend in JITFucntion (#2130) The default values used by JITFunction for num_warps and num_stages are coupled with Nvidia GPU architecture. We should use the proper default values based on the device backend for the kernel to be compiled to. 1. Add two functions to return the default num_warps and num_stages for the specific device backend. 2. JITFunction uses the proper default num_warps and num_stages based on the specific device backend. Co-authored-by: Wang Weihan <[email protected]>
Added speaker details
Revise the logics to assign MMA layout for serialized dots. This is a hueristic rule for FlashAttention.
…2114) Co-authored-by: Keren Zhou <[email protected]>
… correct swizzling code (#2180) fix bug #1937 Co-authored-by: Philippe Tillet <[email protected]>
- Support memory space for pointers (e.g., `!tt.ptr<f32, 1>`). - Support parsing function attribute, though not used yet.
Triton conf registration closed.
Change the dot to allow taking an initial accumulator and add a flag that will allow the compiler to accumulate in a lower precision than the output type. On Hopper this flag is on by default which allows accumualting with lower precision. This only affect Hopper fp8 dot.
Move the optimization to remove phi of struct later in the optimization pipeline to avoid interfering with CFG optimization.
Otherwise, these files show up in `git status` under python/triton/third_party/cuda/bin/.
On my machine, when I try to `pip install cmake` outside a virtualenv, it gets mad at me and tells me to use apt. Which doesn't quite work for some reason. Anyway maybe this is simple to Python people, but perhaps worth mentioning. Especially because we have `.venv` in gitignore already.
reverts #2310 as recent changes to Triton-IR have broken third-party backends
Previously on matmul, if inputs are int8, output was also int8. This commit fixes the overflow problem with int32 output. #2296
This is a new interpreter mode that shares semantic analysis with the JIT'ed codepath and that the Triton core team is committed to maintain
Some duplicate functions on `scf.for` have been removed in llvm/llvm-project#66512. This PR works with and without llvm/llvm-project#66512.
…here's a single warp on the axis (#2330) 1. On the axis, using `getAxisNumWarpsWithUniqueData` instead of getting the raw number of warps to avoid communication among warps that handle the same piece of data. 2. When there's a single warp on the axis, using warp Intrinsics for communication and skip shared memory. Need a follow up PR for code clean up.
) Improve patterns that sync broadcast to reduce the arithmetic density and also hoist convert on top of expand_dims to do less work. This address comments in #2274
This was regressed by #2185 because we didn't realise CUDA_CHECK macro could do Python calls (similar to what led to #2225). I think the PyErr_Occurred got removed in that PR because there was missing error handling before the call to _launch, so it looked like it was just in the wrong place. It looks like there are also potentially a couple places in cuda.c that can return with error set, e.g. getDeviceProperties, memAlloc, memcpyHtoD, memFree, tensorMapEncodeTiled etc, but those are all pre-existing and not affected by recent changes.
llvm/llvm-project#66754 extends the `LoopLikeOpInterface`: the signature of `getLoopBody` has changed. `ForOp::getRegion` can be used instead. This change works with and without llvm/llvm-project#66754.
Google uses clang-format at LLVM HEAD. clang-format's formatting is not stable, so we want to minimize the difference between the pre-commit clang-format and HEAD to minimize differences with Google's formatter. In practice, it appears that there are no relevant changes to the formatting, so this is a nop. 🤷 Tested by running `pre-commit run --all-files`.
I tested these locally, seems to work for me.
Co-authored-by: dongdongl <[email protected]>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
No description provided.