-
Notifications
You must be signed in to change notification settings - Fork 23
Guide for XPU native functions registration using Codegen
Yutao Xu edited this page Oct 15, 2024
·
1 revision
torch-xpu-ops
- src
- ATen
- native
-xpu
-sycl
-ker1.cpp
-ker2.cpp
op1.cpp
op2.cpp
build
- xpu/ATen/ # generated code for XPU
RegisterXPU.cpp
-ops # include headers in this file
log_native.h
log.h
xxxx
- aten/src/ATen # generated code for CPU/CUDA/MPS....
RegisterCPU.cpp
RegisterCUDA.cpp
RegisterXPU.cpp // no use currently
-ops # can include headers here, but rarely
log_native.h
log.h
xxxx
- Find reference declartions in
pytorch/aten/src/ATen/native/native_functions.yaml
For example, the kernel need to be port isthreshold
andgelu
The reference delcartions in PyTorch is
# TODO: namespace threshold in 'nn'
- func: threshold(Tensor self, Scalar threshold, Scalar value) -> Tensor
device_check: NoCheck # TensorIterator
variants: function
structured_delegate: threshold.out
dispatch:
QuantizedCPU: threshold_quantized_cpu
- func: threshold_(Tensor(a!) self, Scalar threshold, Scalar value) -> Tensor(a!)
device_check: NoCheck # TensorIterator
variants: function
structured_delegate: threshold.out
- func: threshold.out(Tensor self, Scalar threshold, Scalar value, *, Tensor(a!) out) -> Tensor(a!)
device_check: NoCheck # TensorIterator
structured: True
structured_inherits: TensorIteratorBase
dispatch:
CPU, CUDA: threshold_out
MPS: threshold_out_mps
- func: gelu_backward.grad_input(Tensor grad_output, Tensor self, *, str approximate='none', Tensor(a!) grad_input) -> Tensor(a!)
structured: True
structured_inherits: TensorIteratorBase
python_module: nn
dispatch:
CPU: gelu_backward_out_cpu
CUDA: gelu_backward_out_cuda
MPS: gelu_backward_out_mps
- func: gelu_backward(Tensor grad_output, Tensor self, *, str approximate='none') -> Tensor
structured_delegate: gelu_backward.grad_input
python_module: nn
dispatch:
MkldnnCPU: mkldnn_gelu_backward
NestedTensorCPU, NestedTensorCUDA: gelu_backwards_nested
tags: pointwise
- Port it into
third_party/torch-xpu-ops/yaml/native/native_functions.yaml
, removeCPU&CUDA
Dispatch. Add XPU dispatch. All variants need to be copied, following only show parts of declartion.
- func: threshold.out(Tensor self, Scalar threshold, Scalar value, *, Tensor(a!) out) -> Tensor(a!)
device_check: NoCheck # TensorIterator
structured: True
structured_inherits: TensorIteratorBase
dispatch:
XPU: threshold_out # If cuda&cpu share func, we also shares. Op would use TORCH_IMPL_FUNC or REGISTER_XPU_DISPATCH
- func: gelu_backward.grad_input(Tensor grad_output, Tensor self, *, str approximate='none', Tensor(a!) grad_input) -> Tensor(a!)
structured: True
structured_inherits: TensorIteratorBase
python_module: nn
dispatch:
XPU: gelu_backward_out_xpu # If cuda&cpu has separate, we name ours func. You need add definition for this func, or use TORCH_IMPL_FUNC, just align to cuda code.
- Modify ops in
torch-xpu-ops/src/ATen/native/xxx.cpp
// Step 1
// a. Include op required headers in `pytorch/aten/src/ATen/nateive`, like Activation.h. You usually need do this when you want find a `op_stub`
// b. If you need REGISTER_XPU_DISPATCH like cuda/cpu, #include<ATen/native/DispatchStub.h>
#include <ATen/native/Activation.h>
#include <ATen/native/DispatchStub.h>
// Step 2: Optional, include op_native.h
// Inlude the generated headers for your op located at `build/aten/src/ATen/xpu/ops/xxx_native.h>
#include <xpu/ATen/ops/gelu_backward_native.h>
#include <xpu/ATen/ops/gelu_native.h>
// Step3: Optional, include op.h
// if you need redispatch, like call at::empty, at::zeros, or even at::threshold
#include <ATen/ops/gelu.h>
// Step3.1: Optional, delcartion for at::xpu::ops [Uupdate at 7/17]
// if you need a call for at::xpu::xxxx, please copy the delcartion from RegisterXPU.cpp
namespace at::xpu{
ops(xxxxx);// declartion
}
namespace at::native {
// Step4: Optional, manually add declartion for other variants in cpu.
// Sometimes, you want to define op1, but you want call at::native::op1_other_variants (like at::native::op1_slow).
// You need copy the declartion in `ATen/ops/op1_native.h` into your source files.
// Please DO NOT directly `#include <ATen/ops/op1_native.h>` as it may violates ODR rule
// Step5: Register or define your func in namespace `at::native`
// Case 1: For TensorIterator based ops, REGISTER_XPU_DISPATCH is usually used.
// Case 2: For structured operators, TORCH_IMPL_FUNC is usually used. Please note that, in the marco TORCH_IMPL_FUNC, you should not return any tensor.
// Case 3: For unstructured operators, please directly define you functions, like define `layer_norm_xpu`
REGISTER_XPU_DISPATCH(threshold_stub, xpu::threshold_kernel);
TORCH_IMPL_FUNC(gelu_backward_out_xpu)
(const Tensor& /*grad*/,
const Tensor& /*self*/,
c10::string_view approximate,
const Tensor& /*grad_input*/
) {
xpu::gelu_backward_kernel(*this, approximate);
}
TORCH_IMPL_FUNC(gelu_out_xpu)
(const Tensor& /*self*/, c10::string_view approximate, const Tensor& /*result*/
) {
xpu::gelu_kernel(*this, approximate);
}
} // namespace at::native
-
Do not
#include <ATen/ATen.h>
. This would implicitly include much much much headers in ops. An WORKAROUND is#include<comm/xpu_aten.h>
. But, I stll recommend not use this, as this is much more like a header for out-of-tree backend. -
[sycl kernel] Align to cuda&cpu function declartion, use
const Tensor& outputs
instead ofTensor& outputs
. Returnvoid
instead returnTensor&
.
template <bool LogSoftMax>
void host_softmax_backward( // use void, do not return tensor
const Tensor& grad_,
const Tensor& output_, // use const&, instead of Tensor&
int64_t dim_,
bool half_to_float,
const Tensor& gI) {