Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pr #12

Merged
merged 118 commits into from
Sep 20, 2023
Merged

pr #12

merged 118 commits into from
Sep 20, 2023

Conversation

1proprogrammerchant
Copy link
Owner

No description provided.

alexander-zinoviev and others added 30 commits August 19, 2023 22:21
Disable tf32 if run on sm75 and below
Fix the pattern match to compare the generated ptx against if run on
sm75
`if _unwrap_if_constexpr(cond)` then enters `node.body` is wrong when
cond is a tensor since we cannot statically evaluate a dynamic tensor's
value.

The right way to solve the problem is probably:

1. visit the ast of IfExp (do not build IRs)
2. get the type of the last statement
3. initialize the return value and assign it to livein
4. call visit_If
Simplify the code by using inline asm to implement globaltimer and smid
instead of relying on bc file.
For warp specialized persistent kernel, the instruction sequence for
Warp Groups are
```
// warp group 0
for wave in 0..num_waves:
    idx = wave * num_inner_loop_steps;
    for k_tile_idx in 0..num_k_tiles:
        mbarrier.wait EB[idx];
        W0;
        mbarrier.arrive FB[idx];
        idx++;
```
```
// warp group 1
for wave in 0..num_waves:
    idx = wave * num_inner_loop_steps;
    for k_tile_idx in 0..num_k_tiles:
        mbarrier.wait FB[idx];
        R0;
        mbarrier.arrive EB[idx];
        idx++;
```
then this would form a sequence of morally-strong relations W0 -> R0 ->
W1 -> R1 in causality order.
But if GEMM K is small than K-TileShape, then the num_inner_loop_steps
of persistent kernel is 0. The buffer id and mbarrier id will always be
0 in this case. And it may form W0 -> W1 -> R0 -> R1 order, which is
contradicts with the atomicity --
"If a read R precedes an overlapping write W in causality order, then R
cannot read from W."
1. Optimize the conversion and packing for 2xf32 -> 2xf16.
2. Split TMA store block into multiple slices of size 64x64.
3. Distribute the TMA store to all the warps.
4. Fix some naming issue.
This PR makes the following change to AOT kernel

- Allow the client to generate AOT kernels with different sets of
constexprs and meta-parameters. Each combination of constexpr set and
meta-parameters is referred to an "algo". Within an algo client can
still give different hints about integer arguments.
- Add a API int ${kernle_name}_get_num_algos() that returns the total
number of algos.
- Add a algo_id to allow client to the generated kernel to select the
algo
- Remove gX, gY and gZ from the kernel parameter list. This is because
the launch grid is usually different with different algos, and the
client should not need to care about how to compute the launch grid for
each algo. Instead, we ask the client to pass the expression of
computing gX, gY and gZ for compile.py (when AOT kernels are generated).
The expression can only use kernel parameter or const values.
- We also change the testing flow. Now we first build the kernels into a
shared library libkernel.so, then the client test.c code is built and
link with libkernel.so. This is closer to a typical AOT kernel usage
flow.
Added Triton Conference Registration details.
Added recording link and minutes for meeting.
… device backend in JITFucntion (#2130)

The default values used by JITFunction for num_warps and num_stages are
coupled with Nvidia GPU architecture. We should use the proper default
values based on the device backend for the kernel to be compiled to.
1. Add two functions to return the default num_warps and num_stages for
the specific device backend.
2. JITFunction uses the proper default num_warps and num_stages based on
the specific device backend.

Co-authored-by: Wang Weihan <[email protected]>
Added speaker details
Revise the logics to assign MMA layout for serialized dots. This is a
hueristic rule for FlashAttention.
BinFan and others added 29 commits September 13, 2023 12:52
… correct swizzling code (#2180)

fix bug #1937

Co-authored-by: Philippe Tillet <[email protected]>
- Support memory space for pointers (e.g., `!tt.ptr<f32, 1>`).
- Support parsing function attribute, though not used yet.
)

This fixes few problems that were preventing me to use lld linker.
Triton conf registration closed.
Change the dot to allow taking an initial accumulator and add a flag
that will allow the compiler to accumulate in a lower precision than the
output type.
On Hopper this flag is on by default which allows accumualting with
lower precision.
This only affect Hopper fp8 dot.
Move the optimization to remove phi of struct later in the optimization
pipeline to avoid interfering with CFG optimization.
Otherwise, these files show up in `git status` under
python/triton/third_party/cuda/bin/.
On my machine, when I try to `pip install cmake` outside a virtualenv,
it gets mad at me and tells me to use apt.  Which doesn't quite work for
some reason.  Anyway maybe this is simple to Python people, but perhaps
worth mentioning.  Especially because we have `.venv` in gitignore
already.
reverts #2310 as recent changes to Triton-IR have broken third-party backends
Previously on matmul, if inputs are int8, output was also int8.
This commit fixes the overflow problem with int32 output.
#2296
This is a new interpreter mode that shares semantic analysis with the
JIT'ed codepath and that the Triton core team is committed to maintain
Some duplicate functions on `scf.for` have been removed in
llvm/llvm-project#66512. This PR works with and without
llvm/llvm-project#66512.
…here's a single warp on the axis (#2330)

1. On the axis, using `getAxisNumWarpsWithUniqueData` instead of getting
the raw number of warps to avoid communication among warps that handle
the same piece of data.
2. When there's a single warp on the axis, using warp Intrinsics for
communication and skip shared memory.

Need a follow up PR for code clean up.
)

Improve patterns that sync broadcast to reduce the arithmetic density
and also hoist convert on top of expand_dims to do less work.

This address comments in #2274
This was regressed by #2185 because we didn't realise CUDA_CHECK macro
could do Python calls (similar to what led to #2225). I think the
PyErr_Occurred got removed in that PR because there was missing error
handling before the call to _launch, so it looked like it was just in
the wrong place.

It looks like there are also potentially a couple places in cuda.c that
can return with error set, e.g. getDeviceProperties, memAlloc,
memcpyHtoD, memFree, tensorMapEncodeTiled etc, but those are all
pre-existing and not affected by recent changes.
llvm/llvm-project#66754 extends the `LoopLikeOpInterface`: the signature
of `getLoopBody` has changed. `ForOp::getRegion` can be used instead.

This change works with and without llvm/llvm-project#66754.
Google uses clang-format at LLVM HEAD.  clang-format's formatting is not
stable, so we want to minimize the difference between the pre-commit
clang-format and HEAD to minimize differences with Google's formatter.

In practice, it appears that there are no relevant changes to the
formatting, so this is a nop.  🤷

Tested by running `pre-commit run --all-files`.
I tested these locally, seems to work for me.
@1proprogrammerchant 1proprogrammerchant merged commit 879a916 into 1proprogrammerchant:main Sep 20, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.