linear_int4_kernel for XPU #1130

sunjiweiswift · 2024-11-29T09:29:31Z

Pure SYCL path for. int4 gemm

Benchmark results on PVC-1100. The remaining gaps are lack of usage of 2D load.

M	K	N	SrcT	WeiT	DstT	Bandwidth usage (BW usage)
1	4096	4096	float16	float16	float16	53.7%
1	4096	11008	float16	float16	float16	57.4%
1	4096	16384	float16	float16	float16	59.7%
1	12288	4096	float16	float16	float16	77.3%

Besides PVC, the kernel can achieve
92.7% bandwidth usage on MTL
84.7% bandwidth usage on A750

Reset to bfdbaf4 --------- Co-authored-by: mengfei25 <[email protected]> Co-authored-by: LuFengqing <[email protected]> Co-authored-by: Ratnam Parikh <[email protected]> Co-authored-by: Feng Yuan <[email protected]>

mingfeima

the biggest question should be why we need post op fusion here? does pytorch have it with cuda?

src/ATen/native/xpu/sycl/LinearInt4.cpp

mingfeima · 2024-12-02T01:57:45Z

src/ATen/native/xpu/sycl/LinearInt4.cpp

+    auto aptr = A;
+    auto cptr = C + g_n;
+    if constexpr (std::is_same_v<scalar_t, sycl::half>) {
+      sycl::half2 tmpAcc = {0.f, 0.f};


is it safe to use half as acc type?

usually, the acc type for both float16 and bfloat16 are float32

ref: https://github.com/pytorch/pytorch/blob/795f28ac552eb61d02ea02fd64637ba814133bd8/aten/src/ATen/native/cuda/int4mm.cu#L727

mingfeima · 2024-12-02T01:59:03Z

src/ATen/native/xpu/sycl/LinearInt4.cpp

+        *cptr = sum[0] + sum[1];
+      }
+    } else {
+      scalar_t tmpAcc = 0.f;


you need to be VERY careful about the acc type. Slight difference between CUDA may lead to accuracy errors that are very very difficult to debug in a finla e2e model, especially in LLM

mingfeima · 2024-12-02T02:01:02Z

src/ATen/native/xpu/sycl/LinearInt4.cpp

+          for (int ikk = 0; ikk < TileK; ikk += 2) {
+            sycl::half2 tmpA = *(sycl::half2*)&aptr[sg_id * TileK + ikk];
+            sycl::half2 tmpB = {
+                static_cast<int8_t>((tmps8[ikk / 2] & 0x0f) - 8),
+                static_cast<int8_t>((tmps8[ikk / 2] >> 4) - 8)};
+            tmpAcc += tmpA * tmpB * scale;


is it possible to do vectorized load and shift with sycl? i don't know.
if not, i guess this is best perf that we can get so far. this line should be the major bottlenecks.

depend on the IGC auto vectorization

mingfeima · 2024-12-02T02:07:22Z

test/xpu/test_int4_linear.py

+        zero_points = torch.Tensor([8]).to(torch.int8).to("xpu")
+        weight_ba = weight.transpose(0, 1).contiguous()
+
+        out_onednn =torch._weight_int4pack_mm_with_scales_and_zeros(


a more general question is that where are we placing _weight_int4pack_mm_with_scales_and_zeros, pytorch does not have this right now.

will be added in pytorch/pytorch#137566

mingfeima · 2024-12-02T02:08:34Z

test/xpu/test_int4_linear.py

+        )
+
+        # check gemm + bias + gelu
+        out_onednn_gelu = torch._weight_int4pack_mm_with_scales_and_zeros(


where was the signature with "tanh" defined?
does pytorch has a packed int4 gemm with post op?

mingfeima · 2024-12-02T02:11:04Z

@liangan1 CC

mingfeima · 2024-12-02T02:18:00Z

@sunjiweiswift for the perf benchmarking, please include other configs expect M=1. This would serve as a reference of final decision making. I expect that big M would have worse perf, but that's fine, we still need to know the numbers.

#### Bugfix - [add lazy init for empty_xpu](#1115) - [nan propagation for soft_shrink](https://github.com/intel/torch-xpu-ops/pull/1116/files#diff-b7cb5876d000db957286c8b0e72badb2b7502402c8955334f1cc21c34c98a5b9) --------- Co-authored-by: Yu, Guangye <[email protected]> Co-authored-by: ZhiweiYan-96 <[email protected]>

Resolve: pytorch/pytorch#142102

Sync main into release/2.6 branch (#1117)

1e32bbc

Reset to bfdbaf4 --------- Co-authored-by: mengfei25 <[email protected]> Co-authored-by: LuFengqing <[email protected]> Co-authored-by: Ratnam Parikh <[email protected]> Co-authored-by: Feng Yuan <[email protected]>

sunjiweiswift changed the title ~~Fp zp~~ linear_int4_kernel for XPU Nov 29, 2024

mingfeima requested changes Dec 2, 2024

View reviewed changes

xytintel and others added 2 commits December 3, 2024 14:55

[Release-2.6] Capture rrelu_with_noise noise mutation in compile (#1145)

7ecb0b1

Resolve: pytorch/pytorch#142102

sunjiweiswift force-pushed the fp_zp branch 2 times, most recently from faa79b7 to 5a08d2e Compare December 9, 2024 05:25

airMeng and others added 15 commits December 11, 2024 09:07

contiguous layout for sycl int4 kernel

5410f51

push without compile

e9311a3

update linearkernel

e3eaffa

fix some comiple error(not all)

2a664af

add sycl_ker_config_convention

0156ba5

reg kernel for pytorch

a58afec

add yaml for int4mm

f487b20

update yaml file

ce1c894

Modified some review comments

d61b198

modify fun name

d76a0ce

autogen: _weight_int4pack_mm_with_scales_and_zeros.out

870a3b5

param int->int64_t(python int is int64)

a9627f6

use AT_DISPATCH_FLOATING_TYPES_AND

952ead9

Keep the same name as pytorch's _weight_int4pack_mm

93804f9

modify UT for int4

9e50b68

sunjiweiswift force-pushed the fp_zp branch from 2424d54 to 4dfd8bd Compare December 12, 2024 07:13

sync UT with pytoch UT(linalg)

81a72f1

sunjiweiswift force-pushed the fp_zp branch from 4dfd8bd to 81a72f1 Compare December 12, 2024 07:15

sunjiweiswift added 3 commits December 12, 2024 07:23

col-major

a70df0a

UT pass for B ones

c08382c

update gemv

14bb4e0

sunjiweiswift added 3 commits December 17, 2024 03:10

fix scale and zp address

70a3e13

fix K large than 1024 UT

a590ad6

bug fix for FP16(BF16 maybe incorrect)

d6a2f3a

sunjiweiswift force-pushed the fp_zp branch from 78433cb to d6a2f3a Compare December 18, 2024 09:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

linear_int4_kernel for XPU #1130

linear_int4_kernel for XPU #1130

sunjiweiswift commented Nov 29, 2024 •

edited

Loading

mingfeima left a comment

mingfeima Dec 2, 2024

mingfeima Dec 2, 2024

mingfeima Dec 2, 2024

airMeng Dec 18, 2024

mingfeima Dec 2, 2024

airMeng Dec 2, 2024

mingfeima Dec 2, 2024

mingfeima commented Dec 2, 2024

mingfeima commented Dec 2, 2024

linear_int4_kernel for XPU #1130

Are you sure you want to change the base?

linear_int4_kernel for XPU #1130

Conversation

sunjiweiswift commented Nov 29, 2024 • edited Loading

mingfeima left a comment

Choose a reason for hiding this comment

mingfeima Dec 2, 2024

Choose a reason for hiding this comment

mingfeima Dec 2, 2024

Choose a reason for hiding this comment

mingfeima Dec 2, 2024

Choose a reason for hiding this comment

airMeng Dec 18, 2024

Choose a reason for hiding this comment

mingfeima Dec 2, 2024

Choose a reason for hiding this comment

airMeng Dec 2, 2024

Choose a reason for hiding this comment

mingfeima Dec 2, 2024

Choose a reason for hiding this comment

mingfeima commented Dec 2, 2024

mingfeima commented Dec 2, 2024

sunjiweiswift commented Nov 29, 2024 •

edited

Loading