Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pp #8

Merged
merged 34 commits into from
Aug 11, 2023
Merged

pp #8

merged 34 commits into from
Aug 11, 2023

Conversation

1proprogrammerchant
Copy link
Owner

No description provided.

goostavz and others added 30 commits August 7, 2023 09:53
The initial code merge of Nvidia Hopper features support. Please be
aware that the code merge is not finished yet and the trouble-shooting
is still ongoing. The new hardware features (GMMA, TMA, STMATRIX etc.)
and automatic warp-specialization are experimental for now and turned
off by default. It is recommended for a trial when version 3.0 is
released.

The work is contributed by:
ben-zhang-609, bealwang, donproc, qliu93, jsh20, allatit23, LyricZhao,
ivanyinwz, goostavz & yangjunpro
from Nvidia, in cooperation with:
ptillet, Jokeren, ThomasRaoux & zahimoud
from OpenAI.

Co-authored-by: Goostav Zhu <[email protected]>
cc @EikanWang . I'm disabling this for now since it broke with the H100
merge, but please feel free to fix the compilation errors and submit a
PR.
Also fixes a bug exposed in convertLayout lowering for float16. We
shouldn't be using cvt.pack.sat.u16.s32 to pack 16bits values as this
needs to take a 32bits register. Also this prevented optimization at
llvm ir level.
Make sure that other threads within CTA do not operate on mbarrier until
it is initialized by thread 0.

Co-authored-by: Philippe Tillet <[email protected]>
Use camel case accessors ("getStaticOffsets" etc.) for `ExtractSliceOp`.
This change works with and without the changes from D156857. After
D156857 has landed, only camel case accessors work for ops that
implement the `OffsetSizeAndStrideOpInterface`.

https://reviews.llvm.org/D156857

Co-authored-by: Philippe Tillet <[email protected]>
We are interested in having python wheels for triton built for Linux
arm64 platforms, such as NVIDIA's Grace CPU.

This change is fairly simple, however:
- It requires a linux arm64 build of LLVM to be available (see MR here:
ptillet/triton-llvm-releases#15)
- For now my changes use the LLVM build hosted here:
https://github.com/acollins3/triton-llvm-releases/releases/tag/llvm-17.0.0-c5dede880d17
- The Triton release process will need to be updated to include arm64
wheels. Is this something you have time to work on @ptillet? It would be
difficult for me to update this part without more access permissions.

With these changes, I managed to build a set of python wheels and have
hosted them here for us to use in the meantime:
https://github.com/acollins3/triton/releases/tag/triton-2.1.0-arm64
…r than Q's (#2033)

Implemented this situation with and without causal mask.
My implementation with causal mask looks like:
111000
111100
111110
Where only the right upper triangle part will be masked.
I added `P_SEQ` for the notation of extra sequence length for KV.

Co-authored-by: Philippe Tillet <[email protected]>
This allows the AOT client to tune the number of stages for the
generated kernel. set the default number to 3 to match the triton
compiler.
…in hopper tests (#2041)

Co-authored-by: goostavz <[email protected]>
Co-authored-by: Philippe Tillet <[email protected]>
Co-authored-by: ben-zhang-609 <[email protected]>
Improve error messaging for block shape and value shape mismatch.
Rename "rocm" -> "hip", to comply with other uses in compiler.py.
…m. (#2068)

No functional changes intended, and it might slightly speed up the
build.

This allows a downstream Bazel build of Triton to avoid building a
number of dialects and passes that Triton doesn't need.
`getScratchSizeInBytes` was assuming that the size of all types in bits
is
a multiple of 8. If it is not, it would return 0. This caused a bug for
boolean
(i1) type, where the reduction lowering would attempt to use shared
memory,
which was not assigned to the op.

Fix this issue by setting the number of bytes per element to `ceil(bits
/ 8)`.
libtriton.so is pretty large these days and hashing it is slow.
Switching the hash from md5 to sha1 shaves close to 300ms off the time
for me (as well as being a better hash, for whatever that's worth).

As far as I could tell, sha1 is the fastest stable hash in the Python
standard library, including things like zlib.crc32
Realised I could do this right after my first PR got merged. This saves
another 100ms
remove unnecessary skips. decompose UTs in
persistent-warp-specialized-gemm into vintage and stylish
zahimoud and others added 4 commits August 10, 2023 15:52
This was causing the IR to fail verification in the intermediate steps.
Also remove another unnecessary cast.
@1proprogrammerchant 1proprogrammerchant merged commit 9d340b0 into 1proprogrammerchant:main Aug 11, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.