pp #8

1proprogrammerchant · 2023-08-11T21:51:44Z

No description provided.

The initial code merge of Nvidia Hopper features support. Please be aware that the code merge is not finished yet and the trouble-shooting is still ongoing. The new hardware features (GMMA, TMA, STMATRIX etc.) and automatic warp-specialization are experimental for now and turned off by default. It is recommended for a trial when version 3.0 is released. The work is contributed by: ben-zhang-609, bealwang, donproc, qliu93, jsh20, allatit23, LyricZhao, ivanyinwz, goostavz & yangjunpro from Nvidia, in cooperation with: ptillet, Jokeren, ThomasRaoux & zahimoud from OpenAI. Co-authored-by: Goostav Zhu <[email protected]>

@EikanWang

cc @EikanWang . I'm disabling this for now since it broke with the H100 merge, but please feel free to fix the compilation errors and submit a PR.

Also fixes a bug exposed in convertLayout lowering for float16. We shouldn't be using cvt.pack.sat.u16.s32 to pack 16bits values as this needs to take a 32bits register. Also this prevented optimization at llvm ir level.

Issue #1973 Co-authored-by: Philippe Tillet <[email protected]>

Make sure that other threads within CTA do not operate on mbarrier until it is initialized by thread 0. Co-authored-by: Philippe Tillet <[email protected]>

Use camel case accessors ("getStaticOffsets" etc.) for `ExtractSliceOp`. This change works with and without the changes from D156857. After D156857 has landed, only camel case accessors work for ops that implement the `OffsetSizeAndStrideOpInterface`. https://reviews.llvm.org/D156857 Co-authored-by: Philippe Tillet <[email protected]>

@ptillet

We are interested in having python wheels for triton built for Linux arm64 platforms, such as NVIDIA's Grace CPU. This change is fairly simple, however: - It requires a linux arm64 build of LLVM to be available (see MR here: ptillet/triton-llvm-releases#15) - For now my changes use the LLVM build hosted here: https://github.com/acollins3/triton-llvm-releases/releases/tag/llvm-17.0.0-c5dede880d17 - The Triton release process will need to be updated to include arm64 wheels. Is this something you have time to work on @ptillet? It would be difficult for me to update this part without more access permissions. With these changes, I managed to build a set of python wheels and have hosted them here for us to use in the meantime: https://github.com/acollins3/triton/releases/tag/triton-2.1.0-arm64

Co-authored-by: Philippe Tillet <[email protected]>

…r than Q's (#2033) Implemented this situation with and without causal mask. My implementation with causal mask looks like: 111000 111100 111110 Where only the right upper triangle part will be masked. I added `P_SEQ` for the notation of extra sequence length for KV. Co-authored-by: Philippe Tillet <[email protected]>

This allows the AOT client to tune the number of stages for the generated kernel. set the default number to 3 to match the triton compiler.

…in hopper tests (#2041) Co-authored-by: goostavz <[email protected]> Co-authored-by: Philippe Tillet <[email protected]> Co-authored-by: ben-zhang-609 <[email protected]>

Co-authored-by: Allen Zhao <[email protected]>

Improve error messaging for block shape and value shape mismatch.

Co-authored-by: Philippe Tillet <[email protected]>

Rename "rocm" -> "hip", to comply with other uses in compiler.py.

…2057) Co-authored-by: Biao Wang <[email protected]>

…m. (#2068) No functional changes intended, and it might slightly speed up the build. This allows a downstream Bazel build of Triton to avoid building a number of dialects and passes that Triton doesn't need.

`getScratchSizeInBytes` was assuming that the size of all types in bits is a multiple of 8. If it is not, it would return 0. This caused a bug for boolean (i1) type, where the reduction lowering would attempt to use shared memory, which was not assigned to the op. Fix this issue by setting the number of bytes per element to `ceil(bits / 8)`.

libtriton.so is pretty large these days and hashing it is slow. Switching the hash from md5 to sha1 shaves close to 300ms off the time for me (as well as being a better hash, for whatever that's worth). As far as I could tell, sha1 is the fastest stable hash in the Python standard library, including things like zlib.crc32

Realised I could do this right after my first PR got merged. This saves another 100ms

remove unnecessary skips. decompose UTs in persistent-warp-specialized-gemm into vintage and stylish

… codegen (#2047)

This was causing the IR to fail verification in the intermediate steps. Also remove another unnecessary cast.

August meetup agenda.

goostavz and others added 30 commits August 7, 2023 09:53

[CI] disable XPU tests (not compiling) (#2044)

223c2d3

cc @EikanWang . I'm disabling this for now since it broke with the H100 merge, but please feel free to fix the compilation errors and submit a PR.

[CI] disable AMD CI (#2045)

54f1ac9

[CI] disabled float32 perf regression tests

521cfae

[BACKEND] Support MMA V3 with float16 accumulator (#2049)

98523bc

Also fixes a bug exposed in convertLayout lowering for float16. We shouldn't be using cvt.pack.sat.u16.s32 to pack 16bits values as this needs to take a 32bits register. Also this prevented optimization at llvm ir level.

[FRONTEND] Support jit functions without arguments (#2043)

30a331e

Issue #1973 Co-authored-by: Philippe Tillet <[email protected]>

[CI] H100 tests always use ENABLE_TMA=1 ENABLE_MMA_V3=1 (#2051)

3ec05fb

[FRONTEND] improve error message for type mismatch (#2038)

6a1ac65

[BACKEND] Add BarrierOp after AllocMBarrierOp when numCTAs == 1 (#2040)

341f5b6

Make sure that other threads within CTA do not operate on mbarrier until it is initialized by thread 0. Co-authored-by: Philippe Tillet <[email protected]>

[TESTS] remove get_proper_err, get_variant_golden (#2039)

31e79aa

Co-authored-by: Philippe Tillet <[email protected]>

add num_stages parameter to aot compile.py (#2000)

a76ecd7

This allows the AOT client to tune the number of stages for the generated kernel. set the default number to 3 to match the triton compiler.

[Backend] Fix CTA->warp ordering for MMAv3 and fix dot-chain scripts …

b525880

…in hopper tests (#2041) Co-authored-by: goostavz <[email protected]> Co-authored-by: Philippe Tillet <[email protected]> Co-authored-by: ben-zhang-609 <[email protected]>

[hopper][ws] use per-agent thread idx by default (#2054)

11cf334

Co-authored-by: Allen Zhao <[email protected]>

[FRONTEND] remove ptxas from git (#2055)

658747f

[FRONTEND] improve error message for shape mismatch (#2031)

bb47f89

Improve error messaging for block shape and value shape mismatch.

[Clean]: remove skip for num_ctas > 1 and num_warps == 8 (#2050)

2a95d9b

Co-authored-by: Philippe Tillet <[email protected]>

[HOPPER][WS] fix TMA store hang in ws mode (#2056)

6dee55c

[ROCM] fix device_type name (#2061)

1c45836

Rename "rocm" -> "hip", to comply with other uses in compiler.py.

[HOPPER][WS] fix missing WS attrs when lowering to llvm (#2063)

6d98a08

[OPTIMIZER] Fix the load and store fallback issue of test_persisten… (#…

de47bba

…2057) Co-authored-by: Biao Wang <[email protected]>

[HOPPER][WS] remove numCTAs = 1 check in guard pass (#2066)

8a610f7

[HOPPER][WS] support tt.reduce as dependent op in guard pass (#2067)

a58e6ef

Include only necessary MLIR conversion passes, rather than all of the…

3be74fa

…m. (#2068) No functional changes intended, and it might slightly speed up the build. This allows a downstream Bazel build of Triton to avoid building a number of dialects and passes that Triton doesn't need.

[FRONTEND] further improve version_key speed (#2073)

776b378

Realised I could do this right after my first PR got merged. This saves another 100ms

[TESTS] refactor test-persistent-warp-specialized-gemm UTs (#2075)

d1ce4c4

remove unnecessary skips. decompose UTs in persistent-warp-specialized-gemm into vintage and stylish

zahimoud and others added 4 commits August 10, 2023 15:52

[BACKEND] Remove HopperHelpers.c and replace with inline ptx and LLVM…

4d373aa

… codegen (#2047)

[BACKEND] Remove invalid indexCast ops (#2083)

4828f61

This was causing the IR to fail verification in the intermediate steps. Also remove another unnecessary cast.

[FRONTEND] Remove cache key from metadata (#2082)

b62b6d6

[DOCS] create 08-22-2023.md (#2087)

0f91775

August meetup agenda.

1proprogrammerchant merged commit 9d340b0 into 1proprogrammerchant:main Aug 11, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pp #8

pp #8

1proprogrammerchant commented Aug 11, 2023

pp #8

pp #8

Conversation

1proprogrammerchant commented Aug 11, 2023