Skip to content

Guide for XPU native functions registration using Codegen

Yutao Xu edited this page Oct 15, 2024 · 1 revision

Align codegen in torch-xpu-ops with PyTorch

Code structure change

torch-xpu-ops
  - src
     - ATen
      - native
         -xpu
           -sycl
            -ker1.cpp
            -ker2.cpp
           op1.cpp
           op2.cpp

Generated Files location

build
  - xpu/ATen/    # generated code for XPU
     RegisterXPU.cpp
     -ops   # include headers in this file
       log_native.h
       log.h
       xxxx
  - aten/src/ATen  # generated code for CPU/CUDA/MPS....
     RegisterCPU.cpp
     RegisterCUDA.cpp
     RegisterXPU.cpp // no use currently
     -ops  # can include headers here, but rarely
       log_native.h
       log.h
       xxxx

Modify existent kernels

  1. Find reference declartions in pytorch/aten/src/ATen/native/native_functions.yaml For example, the kernel need to be port is threshold and gelu The reference delcartions in PyTorch is
# TODO: namespace threshold in 'nn'
- func: threshold(Tensor self, Scalar threshold, Scalar value) -> Tensor
  device_check: NoCheck   # TensorIterator
  variants: function
  structured_delegate: threshold.out
  dispatch:
    QuantizedCPU: threshold_quantized_cpu

- func: threshold_(Tensor(a!) self, Scalar threshold, Scalar value) -> Tensor(a!)
  device_check: NoCheck   # TensorIterator
  variants: function
  structured_delegate: threshold.out

- func: threshold.out(Tensor self, Scalar threshold, Scalar value, *, Tensor(a!) out) -> Tensor(a!)
  device_check: NoCheck   # TensorIterator
  structured: True
  structured_inherits: TensorIteratorBase
  dispatch:
    CPU, CUDA: threshold_out
    MPS: threshold_out_mps

- func: gelu_backward.grad_input(Tensor grad_output, Tensor self, *, str approximate='none', Tensor(a!) grad_input) -> Tensor(a!)
  structured: True
  structured_inherits: TensorIteratorBase
  python_module: nn
  dispatch:
    CPU: gelu_backward_out_cpu
    CUDA: gelu_backward_out_cuda
    MPS: gelu_backward_out_mps

- func: gelu_backward(Tensor grad_output, Tensor self, *, str approximate='none') -> Tensor
  structured_delegate: gelu_backward.grad_input
  python_module: nn
  dispatch:
    MkldnnCPU: mkldnn_gelu_backward
    NestedTensorCPU, NestedTensorCUDA: gelu_backwards_nested
  tags: pointwise
  1. Port it into third_party/torch-xpu-ops/yaml/native/native_functions.yaml, remove CPU&CUDA Dispatch. Add XPU dispatch. All variants need to be copied, following only show parts of declartion.
- func: threshold.out(Tensor self, Scalar threshold, Scalar value, *, Tensor(a!) out) -> Tensor(a!)
  device_check: NoCheck   # TensorIterator
  structured: True
  structured_inherits: TensorIteratorBase
  dispatch:
    XPU: threshold_out  # If cuda&cpu share func, we also shares. Op would use TORCH_IMPL_FUNC or REGISTER_XPU_DISPATCH

- func: gelu_backward.grad_input(Tensor grad_output, Tensor self, *, str approximate='none', Tensor(a!) grad_input) -> Tensor(a!)
  structured: True
  structured_inherits: TensorIteratorBase
  python_module: nn
  dispatch:
    XPU: gelu_backward_out_xpu # If cuda&cpu has separate, we name ours func. You need add definition for this func, or use TORCH_IMPL_FUNC, just align to cuda code.
  1. Modify ops in torch-xpu-ops/src/ATen/native/xxx.cpp
// Step 1
// a. Include op required headers in `pytorch/aten/src/ATen/nateive`, like Activation.h. You usually need do this when you want find a `op_stub`
// b. If you need REGISTER_XPU_DISPATCH like cuda/cpu, #include<ATen/native/DispatchStub.h>
#include <ATen/native/Activation.h> 
#include <ATen/native/DispatchStub.h>

// Step 2: Optional, include op_native.h
// Inlude the generated headers for your op located at `build/aten/src/ATen/xpu/ops/xxx_native.h>
#include <xpu/ATen/ops/gelu_backward_native.h>
#include <xpu/ATen/ops/gelu_native.h>

// Step3: Optional, include op.h
// if you need redispatch, like call at::empty, at::zeros, or even at::threshold
#include <ATen/ops/gelu.h>

// Step3.1: Optional, delcartion for at::xpu::ops [Uupdate at 7/17]
// if you need a call for at::xpu::xxxx, please copy the delcartion from RegisterXPU.cpp
namespace at::xpu{
   ops(xxxxx);// declartion
}

namespace at::native {

// Step4: Optional, manually add declartion for other variants in cpu.
// Sometimes, you want to define op1, but you want call at::native::op1_other_variants (like at::native::op1_slow). 
// You need copy the declartion in `ATen/ops/op1_native.h` into your source files.
// Please DO NOT directly `#include <ATen/ops/op1_native.h>` as it may violates ODR rule


// Step5: Register or define your func in namespace `at::native`
// Case 1: For TensorIterator based ops, REGISTER_XPU_DISPATCH is usually used.
// Case 2: For structured operators, TORCH_IMPL_FUNC is usually used. Please note that, in the marco TORCH_IMPL_FUNC, you should not return any tensor. 
// Case 3: For unstructured operators, please directly define you functions, like define `layer_norm_xpu`

REGISTER_XPU_DISPATCH(threshold_stub, xpu::threshold_kernel);
TORCH_IMPL_FUNC(gelu_backward_out_xpu)
(const Tensor& /*grad*/,
 const Tensor& /*self*/,
 c10::string_view approximate,
 const Tensor& /*grad_input*/
) {
  xpu::gelu_backward_kernel(*this, approximate);
}

TORCH_IMPL_FUNC(gelu_out_xpu)
(const Tensor& /*self*/, c10::string_view approximate, const Tensor& /*result*/
) {
  xpu::gelu_kernel(*this, approximate);
}

} // namespace at::native

Other changes need be noticed

  1. Do not #include <ATen/ATen.h>. This would implicitly include much much much headers in ops. An WORKAROUND is #include<comm/xpu_aten.h>. But, I stll recommend not use this, as this is much more like a header for out-of-tree backend.

  2. [sycl kernel] Align to cuda&cpu function declartion, use const Tensor& outputs instead of Tensor& outputs. Return void instead return Tensor&.

template <bool LogSoftMax>
void host_softmax_backward(  // use void, do not return tensor
    const Tensor& grad_,
    const Tensor& output_, // use const&, instead of Tensor&
    int64_t dim_,
    bool half_to_float,
    const Tensor& gI) {