Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pp #8

Merged
merged 34 commits into from
Aug 11, 2023
Merged

pp #8

Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
34 commits
Select commit Hold shift + click to select a range
f1512bd
Initial code merge of Hopper support (#2036)
goostavz Aug 7, 2023
223c2d3
[CI] disable XPU tests (not compiling) (#2044)
ptillet Aug 7, 2023
54f1ac9
[CI] disable AMD CI (#2045)
ptillet Aug 7, 2023
521cfae
[CI] disabled float32 perf regression tests
ptillet Aug 7, 2023
98523bc
[BACKEND] Support MMA V3 with float16 accumulator (#2049)
ThomasRaoux Aug 7, 2023
30a331e
[FRONTEND] Support jit functions without arguments (#2043)
Jokeren Aug 8, 2023
3ec05fb
[CI] H100 tests always use ENABLE_TMA=1 ENABLE_MMA_V3=1 (#2051)
ptillet Aug 8, 2023
6a1ac65
[FRONTEND] improve error message for type mismatch (#2038)
daemyung Aug 8, 2023
341f5b6
[BACKEND] Add BarrierOp after AllocMBarrierOp when numCTAs == 1 (#2040)
qliu93 Aug 8, 2023
3cec89e
[NFC] Integration fix for upstream D156857 (#2021)
matthias-springer Aug 8, 2023
4ed8381
Linux arm64 support (#2003)
acollins3 Aug 8, 2023
31e79aa
[TESTS] remove get_proper_err, get_variant_golden (#2039)
ben-zhang-609 Aug 8, 2023
f21a053
[TUTORIALS] support flash attention 2 with KV's sequence length longe…
BoxiangW Aug 8, 2023
a76ecd7
add num_stages parameter to aot compile.py (#2000)
BinFan Aug 8, 2023
b525880
[Backend] Fix CTA->warp ordering for MMAv3 and fix dot-chain scripts …
goostavz Aug 8, 2023
11cf334
[hopper][ws] use per-agent thread idx by default (#2054)
allatit23 Aug 8, 2023
658747f
[FRONTEND] remove ptxas from git (#2055)
ptillet Aug 8, 2023
bb47f89
[FRONTEND] improve error message for shape mismatch (#2031)
daemyung Aug 8, 2023
2a95d9b
[Clean]: remove skip for num_ctas > 1 and num_warps == 8 (#2050)
ben-zhang-609 Aug 8, 2023
6dee55c
[HOPPER][WS] fix TMA store hang in ws mode (#2056)
allatit23 Aug 8, 2023
1c45836
[ROCM] fix device_type name (#2061)
binarman Aug 9, 2023
6d98a08
[HOPPER][WS] fix missing WS attrs when lowering to llvm (#2063)
allatit23 Aug 9, 2023
de47bba
[OPTIMIZER] Fix the load and store fallback issue of test_persisten… …
bealwang Aug 9, 2023
8a610f7
[HOPPER][WS] remove numCTAs = 1 check in guard pass (#2066)
allatit23 Aug 9, 2023
a58e6ef
[HOPPER][WS] support tt.reduce as dependent op in guard pass (#2067)
allatit23 Aug 9, 2023
3be74fa
Include only necessary MLIR conversion passes, rather than all of the…
hawkinsp Aug 9, 2023
29bfdb6
[BACKEND] Fix crash in reductions on i1 (#1996)
gflegar Aug 9, 2023
0e11257
[FRONTEND] improve speed of computing version_key (#2071)
hauntsaninja Aug 9, 2023
776b378
[FRONTEND] further improve version_key speed (#2073)
hauntsaninja Aug 9, 2023
d1ce4c4
[TESTS] refactor test-persistent-warp-specialized-gemm UTs (#2075)
bealwang Aug 10, 2023
4d373aa
[BACKEND] Remove HopperHelpers.c and replace with inline ptx and LLVM…
zahimoud Aug 10, 2023
4828f61
[BACKEND] Remove invalid indexCast ops (#2083)
ThomasRaoux Aug 11, 2023
b62b6d6
[FRONTEND] Remove cache key from metadata (#2082)
zahimoud Aug 11, 2023
0f91775
[DOCS] create 08-22-2023.md (#2087)
kshama-msft Aug 11, 2023
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
22 changes: 19 additions & 3 deletions .github/workflows/integration-tests.yml
Original file line number Diff line number Diff line change
Expand Up @@ -27,7 +27,7 @@ jobs:
run: |
if [ x"${{ github.repository }}" == x"openai/triton" ]; then
echo '::set-output name=matrix-required::[["self-hosted", "A100"], ["self-hosted", "H100"]]'
echo '::set-output name=matrix-optional::[["self-hosted", "gfx908"], ["self-hosted", "arc770"]]'
echo '::set-output name=matrix-optional::[]'
else
echo '::set-output name=matrix-required::["ubuntu-latest"]'
echo '::set-output name=matrix-optional::["ubuntu-latest"]'
Expand All @@ -50,6 +50,8 @@ jobs:
if: ${{(matrix.runner[0] == 'self-hosted') && (matrix.runner[1] == 'V100' || matrix.runner[1] == 'A100' || matrix.runner[1] == 'H100')}}
run: |
echo "BACKEND=CUDA" >> "${GITHUB_ENV}"
echo "ENABLE_TMA=0" >> "${GITHUB_ENV}"
echo "ENABLE_MMA_V3=0" >> "${GITHUB_ENV}"

- name: Clear cache
run: |
Expand Down Expand Up @@ -79,8 +81,22 @@ jobs:
fi
lit -v "${LIT_TEST_DIR}"

- name: Run python tests on CUDA
if: ${{ env.BACKEND == 'CUDA'}}
- name: Enable MMAV3 and TMA
if: ${{(matrix.runner[0] == 'self-hosted') && (matrix.runner[1] == 'H100')}}
run: |
echo "ENABLE_TMA=1" >> "${GITHUB_ENV}"
echo "ENABLE_MMA_V3=1" >> "${GITHUB_ENV}"

- name: Run python tests on CUDA with ENABLE_TMA=1 and ENABLE_MMA_V3=1
if: ${{ env.BACKEND == 'CUDA' && env.ENABLE_TMA == '1' && env.ENABLE_MMA_V3 == '1'}}
run: |
cd python/test/unit
python3 -m pytest -n 8 --ignore=runtime
# run runtime tests serially to avoid race condition with cache handling.
python3 -m pytest runtime/

- name: Run python tests on CUDA with ENABLE_TMA=0 and ENABLE_MMA_V3=0
if: ${{ env.BACKEND == 'CUDA' && env.ENABLE_TMA == '0' && env.ENABLE_MMA_V3 == '0'}}
run: |
cd python/test/unit
python3 -m pytest -n 8 --ignore=runtime
Expand Down
3 changes: 3 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -24,3 +24,6 @@ venv.bak/
# JetBrains project files
.idea
cmake-build-*

# Third-party binaries
ptxas
1 change: 1 addition & 0 deletions CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -209,6 +209,7 @@ if(TRITON_BUILD_PYTHON_MODULE)
TritonAnalysis
TritonTransforms
TritonGPUTransforms
TritonNvidiaGPUTransforms
TritonLLVMIR
TritonPTX
TritonHSACO
Expand Down
3 changes: 3 additions & 0 deletions bin/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,7 @@ target_link_libraries(triton-opt PRIVATE
TritonAnalysis
TritonTransforms
TritonGPUTransforms
TritonNvidiaGPUTransforms
${dialect_libs}
${conversion_libs}
# tests
Expand All @@ -29,6 +30,7 @@ target_link_libraries(triton-reduce PRIVATE
TritonAnalysis
TritonTransforms
TritonGPUTransforms
TritonNvidiaGPUTransforms
${dialect_libs}
${conversion_libs}
# tests
Expand All @@ -48,6 +50,7 @@ llvm_update_compile_flags(triton-translate)
TritonAnalysis
TritonTransforms
TritonGPUTransforms
TritonNvidiaGPUTransforms
TritonLLVMIR
TritonPTX
TritonHSACO
Expand Down
6 changes: 6 additions & 0 deletions bin/RegisterTritonDialects.h
Original file line number Diff line number Diff line change
@@ -1,10 +1,13 @@
#pragma once
#include "triton/Dialect/Triton/IR/Dialect.h"
#include "triton/Dialect/TritonGPU/IR/Dialect.h"
#include "triton/Dialect/TritonNvidiaGPU/IR/Dialect.h"

#include "triton/Dialect/Triton/Transforms/Passes.h"
#include "triton/Dialect/TritonGPU/Transforms/Passes.h"
#include "triton/Dialect/TritonNvidiaGPU/Transforms/Passes.h"

#include "triton/Conversion/NVGPUToLLVM/Passes.h"
#include "triton/Conversion/TritonGPUToLLVM/Passes.h"
#include "triton/Conversion/TritonToTritonGPU/Passes.h"

Expand All @@ -23,15 +26,18 @@ inline void registerTritonDialects(mlir::DialectRegistry &registry) {
mlir::registerAllPasses();
mlir::registerTritonPasses();
mlir::registerTritonGPUPasses();
mlir::registerTritonNvidiaGPUPasses();
mlir::test::registerTestAliasPass();
mlir::test::registerTestAlignmentPass();
mlir::test::registerTestAllocationPass();
mlir::test::registerTestMembarPass();
mlir::triton::registerConvertTritonToTritonGPUPass();
mlir::triton::registerConvertTritonGPUToLLVMPass();
mlir::triton::registerConvertNVGPUToLLVMPass();

// TODO: register Triton & TritonGPU passes
registry.insert<mlir::triton::TritonDialect, mlir::cf::ControlFlowDialect,
mlir::triton::nvidia_gpu::TritonNvidiaGPUDialect,
mlir::triton::gpu::TritonGPUDialect, mlir::math::MathDialect,
mlir::arith::ArithDialect, mlir::scf::SCFDialect,
mlir::gpu::GPUDialect>();
Expand Down
8 changes: 6 additions & 2 deletions bin/triton-translate.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,7 @@
#include "triton/Conversion/TritonToTritonGPU/TritonToTritonGPUPass.h"
#include "triton/Dialect/Triton/IR/Dialect.h"
#include "triton/Dialect/TritonGPU/IR/Dialect.h"
#include "triton/Dialect/TritonNvidiaGPU/IR/Dialect.h"
#include "triton/Target/HSACO/HSACOTranslation.h"
#include "triton/Target/LLVMIR/LLVMIRTranslation.h"
#include "triton/Target/PTX/PTXTranslation.h"
Expand All @@ -38,6 +39,7 @@ OwningOpRef<ModuleOp> loadMLIRModule(llvm::StringRef inputFilename,
mlir::DialectRegistry registry;
registry
.insert<TritonDialect, triton::gpu::TritonGPUDialect,
triton::nvidia_gpu::TritonNvidiaGPUDialect,
mlir::math::MathDialect, arith::ArithDialect, scf::SCFDialect>();

context.appendDialectRegistry(registry);
Expand Down Expand Up @@ -121,8 +123,10 @@ LogicalResult tritonTranslateMain(int argc, char **argv,
}

llvm::LLVMContext llvmContext;
auto llvmir = translateTritonGPUToLLVMIR(&llvmContext, *module,
SMArch.getValue(), false /*isRocm*/);
mlir::triton::gpu::TMAMetadataTy tmaInfos;
auto llvmir = translateTritonGPUToLLVMIR(
&llvmContext, *module, SMArch.getValue(), tmaInfos, false /*isRocm*/);

if (!llvmir) {
llvm::errs() << "Translate to LLVM IR failed";
}
Expand Down
9 changes: 9 additions & 0 deletions docs/meetups/08-22-2023.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
#### Agenda:

##### Announcements:
1. Triton conference registration opening soon. Conference on 20th September at the Microsoft Silicon Valley Campus.

##### Items:
1. H100 updates
2. Linalg updates
3. Open discussion
1 change: 1 addition & 0 deletions docs/python-api/triton.language.rst
Original file line number Diff line number Diff line change
Expand Up @@ -192,3 +192,4 @@ Iterators
:nosignatures:

static_range
multiple_of
13 changes: 7 additions & 6 deletions include/triton/Analysis/Allocation.h
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,7 @@

#include "triton/Dialect/Triton/IR/Dialect.h"
#include "triton/Dialect/TritonGPU/IR/Dialect.h"
#include "triton/Dialect/TritonNvidiaGPU/IR/Dialect.h"
#include <atomic>
#include <limits>

Expand Down Expand Up @@ -147,17 +148,17 @@ class Allocation {
BufferKind kind;
BufferId id;
size_t size;
size_t alignment;
size_t offset;

bool operator==(const BufferT &other) const { return id == other.id; }
bool operator<(const BufferT &other) const { return id < other.id; }

BufferT() : BufferT(BufferKind::Explicit) {}
BufferT(BufferKind kind)
: kind(kind), id(InvalidBufferId), size(0), offset(0) {}
BufferT(BufferKind kind, size_t size) : BufferT(kind, size, 0) {}
BufferT(BufferKind kind, size_t size, size_t offset)
: kind(kind), id(nextId++), size(size), offset(offset) {}
BufferT() : BufferT(BufferKind::Explicit, 0) {}
BufferT(BufferKind kind, size_t size, size_t alignment = 4,
size_t offset = 0)
: kind(kind), id(nextId++), size(size), alignment(alignment),
offset(offset) {}
};

/// Op -> Scratch Buffer
Expand Down
2 changes: 1 addition & 1 deletion include/triton/Analysis/AxisInfo.h
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,7 @@ namespace mlir {
/// This lattice value represents known information on the axes of a lattice.
class AxisInfo {
public:
typedef SmallVector<int64_t, 4> DimVectorT;
typedef SmallVector<int64_t> DimVectorT;

public:
/// Default constructor
Expand Down
11 changes: 10 additions & 1 deletion include/triton/Analysis/Utility.h
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,7 @@

#include "mlir/Analysis/DataFlowFramework.h"
#include "mlir/Analysis/SliceAnalysis.h"
#include "triton/Dialect/Triton/IR/Dialect.h"
#include "triton/Dialect/TritonGPU/IR/Dialect.h"
#include <algorithm>
#include <numeric>
Expand Down Expand Up @@ -121,7 +122,11 @@ bool isSingleValue(Value value);

bool isMmaToDotShortcut(RankedTensorType &srcTy, RankedTensorType &dstTy);

Type getElementType(Value value);
bool isMmaToMmaShortcut(RankedTensorType &srcTy, RankedTensorType &dstTy);

// TODO: Move utility functions that belong to ConvertLayoutOp to class
// ConvertLayoutOpHelper in the future
bool shouldUseDistSmem(Attribute srcLayout, Attribute dstLayout);

template <typename T_OUT, typename T_IN>
inline SmallVector<T_OUT> convertType(ArrayRef<T_IN> in) {
Expand Down Expand Up @@ -324,6 +329,10 @@ template <typename T> class CallGraph {
FuncDataMapT funcMap;
SmallVector<FunctionOpInterface> roots;
};
// Create a basic DataFlowSolver with constant and dead code analysis included.
std::unique_ptr<DataFlowSolver> createDataFlowSolver();

triton::MakeTensorPtrOp getMakeTensorPtrOp(Value v);

} // namespace mlir

Expand Down
1 change: 1 addition & 0 deletions include/triton/Conversion/CMakeLists.txt
Original file line number Diff line number Diff line change
@@ -1,2 +1,3 @@
add_subdirectory(TritonToTritonGPU)
add_subdirectory(TritonGPUToLLVM)
add_subdirectory(NVGPUToLLVM)
3 changes: 3 additions & 0 deletions include/triton/Conversion/NVGPUToLLVM/CMakeLists.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
set(LLVM_TARGET_DEFINITIONS Passes.td)
mlir_tablegen(Passes.h.inc -gen-pass-decls --name NVGPUToLLVM)
add_public_tablegen_target(NVGPUConversionPassIncGen)
19 changes: 19 additions & 0 deletions include/triton/Conversion/NVGPUToLLVM/NVGPUToLLVMPass.h
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
#ifndef TRITON_CONVERSION_NVGPU_TO_LLVM_PASS_H
#define TRITON_CONVERSION_NVGPU_TO_LLVM_PASS_H

#include <memory>

namespace mlir {

class ModuleOp;
template <typename T> class OperationPass;

namespace triton {

std::unique_ptr<OperationPass<ModuleOp>> createConvertNVGPUToLLVMPass();

} // namespace triton

} // namespace mlir

#endif
16 changes: 16 additions & 0 deletions include/triton/Conversion/NVGPUToLLVM/Passes.h
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
#ifndef NVGPU_CONVERSION_PASSES_H
#define NVGPU_CONVERSION_PASSES_H

#include "mlir/Dialect/LLVMIR/LLVMDialect.h"
#include "triton/Conversion/NVGPUToLLVM/NVGPUToLLVMPass.h"

namespace mlir {
namespace triton {

#define GEN_PASS_REGISTRATION
#include "triton/Conversion/NVGPUToLLVM/Passes.h.inc"

} // namespace triton
} // namespace mlir

#endif
20 changes: 20 additions & 0 deletions include/triton/Conversion/NVGPUToLLVM/Passes.td
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
#ifndef NVGPU_CONVERSION_PASSES
#define NVGPU_CONVERSION_PASSES

include "mlir/Pass/PassBase.td"


def ConvertNVGPUToLLVM : Pass<"convert-nv-gpu-to-llvm", "mlir::ModuleOp"> {
let summary = "Convert NVGPU to LLVM";
let description = [{

}];
let constructor = "mlir::triton::createConvertNVGPUToLLVMPass()";

let dependentDialects = ["mlir::arith::ArithDialect",
"mlir::LLVM::LLVMDialect",
"mlir::NVVM::NVVMDialect",
"mlir::triton::nvgpu::NVGPUDialect"];
}

#endif
6 changes: 6 additions & 0 deletions include/triton/Conversion/TritonGPUToLLVM/PTXAsmFormat.h
Original file line number Diff line number Diff line change
Expand Up @@ -151,6 +151,12 @@ struct PTXBuilder {
// aggressive optimizations that may lead to incorrect results.
Operand *newOperand(StringRef constraint, bool init = false);

// Create a new operand that is tied to a previous operand. In this case the
// asm would be permitted to write to an input register. Instead of providing
// constraint code for this operand, the constraint code of the tied operand
// is used.
Operand *newOperand(unsigned operandIndex);

// Create a constant integer operand.
Operand *newConstantOperand(int64_t v);
// Create a constant operand with explicit code specified.
Expand Down
4 changes: 4 additions & 0 deletions include/triton/Conversion/TritonGPUToLLVM/Passes.td
Original file line number Diff line number Diff line change
Expand Up @@ -19,13 +19,17 @@ def ConvertTritonGPUToLLVM : Pass<"convert-triton-gpu-to-llvm", "mlir::ModuleOp"
"mlir::tensor::TensorDialect",
"mlir::triton::TritonDialect",
"mlir::triton::gpu::TritonGPUDialect",
"mlir::triton::nvidia_gpu::TritonNvidiaGPUDialect",
"mlir::ROCDL::ROCDLDialect",
"mlir::NVVM::NVVMDialect"];

let options = [
Option<"computeCapability", "compute-capability",
"int32_t", /*default*/"80",
"device compute capability">,
Option<"TmaMetadata", "tma-metadata",
"mlir::triton::gpu::TMAMetadataTy*", /*default*/"nullptr",
"tma metadata to the runtime">,
Option<"isROCM", "is-rocm",
"bool", /*default*/"false",
"compile for ROCM-compatible LLVM">,
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,8 @@

#include "mlir/Conversion/LLVMCommon/TypeConverter.h"
#include "mlir/Transforms/DialectConversion.h"
#include "triton/Target/PTX/TmaMetadata.h"

#include <memory>

namespace mlir {
Expand All @@ -12,9 +14,10 @@ template <typename T> class OperationPass;

namespace triton {

std::unique_ptr<OperationPass<ModuleOp>>
createConvertTritonGPUToLLVMPass(int computeCapability = 80,
bool isROCM = false);
std::unique_ptr<OperationPass<ModuleOp>> createConvertTritonGPUToLLVMPass(
int computeCapability = 80,
mlir::triton::gpu::TMAMetadataTy *tmaMetadata = nullptr,
bool isROCM = false);

} // namespace triton

Expand Down
1 change: 1 addition & 0 deletions include/triton/Conversion/TritonToTritonGPU/Passes.h
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,7 @@
#define TRITON_CONVERSION_PASSES_H

#include "triton/Conversion/TritonToTritonGPU/TritonToTritonGPUPass.h"
#include "triton/Target/PTX/TmaMetadata.h"

namespace mlir {
namespace triton {
Expand Down
6 changes: 6 additions & 0 deletions include/triton/Conversion/TritonToTritonGPU/Passes.td
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,12 @@ def ConvertTritonToTritonGPU: Pass<"convert-triton-to-tritongpu", "mlir::ModuleO
Option<"threadsPerWarp", "threads-per-warp",
"int32_t", /*default*/"32",
"number of threads per warp">,
Option<"numCTAs", "num-ctas",
"int32_t", /*default*/"1",
"number of ctas in a cga">,
Option<"computeCapability", "compute-capability",
"int32_t", /*default*/"80",
"compute capability">
];
}

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,9 @@ template <typename T> class OperationPass;
namespace triton {

constexpr static char AttrNumWarpsName[] = "triton_gpu.num-warps";
constexpr static char AttrNumCTAsName[] = "triton_gpu.num-ctas";
constexpr static char AttrComputeCapabilityName[] =
"triton_gpu.compute-capability";

constexpr static char AttrNumThreadsPerWarp[] = "triton_gpu.threads-per-warp";

Expand All @@ -19,7 +22,8 @@ std::unique_ptr<OperationPass<ModuleOp>> createConvertTritonToTritonGPUPass();

// Create the pass with numWarps set explicitly.
std::unique_ptr<OperationPass<ModuleOp>>
createConvertTritonToTritonGPUPass(int numWarps, int threadsPerWarp = 32);
createConvertTritonToTritonGPUPass(int numWarps, int threadsPerWarp = 32,
int numCTAs = 1, int computeCapability = 80);

} // namespace triton
} // namespace mlir
Expand Down
2 changes: 2 additions & 0 deletions include/triton/Dialect/CMakeLists.txt
Original file line number Diff line number Diff line change
@@ -1,2 +1,4 @@
add_subdirectory(Triton)
add_subdirectory(TritonGPU)
add_subdirectory(TritonNvidiaGPU)
add_subdirectory(NVGPU)
Loading