rfc: graph: propose to support Grouped-query Attention #2018

gyhintel · 2024-07-31T13:05:46Z

Description

This is to propose to support Grouped-query Attention in oneDNN Graph API.
Link to the rendered document.

dzarukin · 2024-07-31T20:00:15Z

@gyhintel, thanks for the RFC. Some questions:

If the user specifies a graph with different dimensions for K x Q, H dim versus G dim as in the example, do you see any issues with expanding pattern matcher / underneath implementation to perform an extra check for H dividing G, and if yes, perform the logic as described in Option 2 in the document?
If (1) can be done, would it mean that the user must modify their Graph, or dnnl::graph, to match our pattern for GQA?
If (2) forces the user to do modification, will adding API from Option 1 in the document help to expand out pattern matcher so that there're no extra actions needed from the user side?

gyhintel · 2024-08-01T07:58:45Z

If the user specifies a graph with different dimensions for K x Q, H dim versus G dim as in the example, do you see any issues with expanding pattern matcher / underneath implementation to perform an extra check for H dividing G, and if yes, perform the logic as described in Option 2 in the document?

If there is no Reshape above the typical SDPA, Query(in shape (N, H, S, D)) and Key/Value (in shape (N, G, S, D)) have different head number dimension and cannot perform dot-product directly. This is the doc: "For example src can be broadcasted to wei, if the corresponding dimension in src is 1 (and vice versa). ".
We can extend the MatMul broadcasting rules to support group broadcast. This is option 3. In this situation, we need to perform an extra check for H dividing G. There should be no issues.

If (1) can be done, would it mean that the user must modify their Graph, or dnnl::graph, to match our pattern for GQA?

Yes, it means that the pattern cannot be used to optimize a framework graph directly. Users will have to map their GQA implementation graph to our pattern. This is the second cons of option 2.

If (2) forces the user to do modification, will adding API from Option 1 in the document help to expand out pattern matcher so that there're no extra actions needed from the user side?

In the current Pytorch implementation, there are no extra actions from their side. But if the implementation in the community changes, still needs to handle the new implementation. This is the second cons of option1.

petercad · 2024-08-05T15:29:27Z

rfcs/20240722-group-query-attention/README.md

+1. The pattern is less intuitive from GQA definition.
+2. The pattern cannot be used to optimize a framework graph directly. Frameworks
+   will have to implement GQA fusion by themselves and leverage this option to
+   optimized the fused GQA.


If this turns out to be a serious con, it would be reasonable to add a pass to match the Option 1 subgraph and convert it to the Option 2 subgraph, right?

If it is a serious con, we need to implement option 1 adding new ops and new patterns. It is a backend implementation that matches the Option 1 subgraph and converts it to the Option 2 subgraph. We can also implement it in other ways in the backend.

If the pass can be done on the framework side, we only need to implement option 2.

We will have to support and match the subgraph in Option 1 once the request pops up. With that, oneDNN will support and maintain several different patterns for the same GQA functionality. Maybe it's not an issue as even for now we choose to Option 1 as the initial step, the pattern may still change in the future as mentioned in the cons of Option 1.

TaoLv · 2024-08-08T05:29:07Z

rfcs/20240722-group-query-attention/README.md

+(see broadcasting in
+[ONNX](https://github.com/onnx/onnx/blob/main/docs/Broadcasting.md) and
+[NumPy](https://numpy.org/doc/stable/user/basics.broadcasting.html#general-broadcasting-rules).),
+but actually it's added to the MatMul operation of cuDNN in order to support


I'd like to add the link here for reference: https://docs.nvidia.com/deeplearning/cudnn/latest/api/cudnn-graph-library.html#cudnn-backend-operation-matmul-descriptor .

Added, thanks.

TaoLv · 2024-08-08T05:33:23Z

rfcs/20240722-group-query-attention/README.md

+   that the new broadcasting rule is only supported by the fused attention.
+2. Same as option 2, still the pattern cannot be used to optimize a framework
+   graph directly. Frameworks will have to implement GQA fusion by themselves
+   and leverage this option to optimized the fused GQA.


Another con here maybe that we rely on oneDNN matmul primitive kernels for reference implementation and testing in benchdnn which do not support the new broadcasting rule. Extending the broadcast semantics on graph side will also request additional effort for reference implementation and testing.

Added, thanks.

TaoLv · 2024-08-11T15:14:40Z

rfcs/20240722-group-query-attention/README.md

+
+## GQA in PyTorch
+
+Unlike SDPA, PyTorch does not support GQA as a fused operations. In Huggingface


FYI - the PyTorch PR just got merged this week: pytorch/pytorch#132689

dzarukin · 2024-08-20T18:56:28Z

Any clarity at this point if users are fine with implementing Option 2 on their side, or Option 1 must be implemented instead?

gyhintel · 2024-08-23T08:17:17Z

@chunyuan-w, @sanchitintel, Could you help take a look at this RFC? thanks!

ElaineBao · 2024-09-03T04:34:43Z

rfcs/20240722-group-query-attention/README.md

+
+| Matrix A              | Matrix B              | Matrix C = A x B      |
+| --                    | --                    | --                    |
+| B1 x 1 x B3 x M x K   | B1 x B2 x 1 x M x K   | B1 x B2 x B3 x M x N  |


Fixed, thanks.

rfcs: graph: propose to support Grouped-query Attention

7c3caf5

gyhintel added the RFC A design document label Jul 31, 2024

rfcs: graph: add recommended option

48aedf7

petercad reviewed Aug 5, 2024

View reviewed changes

TaoLv reviewed Aug 8, 2024

View reviewed changes

rfcs: graph: add more descriptions for the option 3

9099287

TaoLv reviewed Aug 11, 2024

View reviewed changes

rfcs: graph: fix the pytorch GQA implementation description

509d1b0

gyhintel requested review from sanjivmshah and removed request for sanjivmshah August 23, 2024 08:08

chunyuan-w approved these changes Aug 28, 2024

View reviewed changes

TaoLv approved these changes Aug 29, 2024

View reviewed changes

wzt1997 approved these changes Aug 29, 2024

View reviewed changes

xiang1guo approved these changes Aug 29, 2024

View reviewed changes

ElaineBao reviewed Sep 3, 2024

View reviewed changes

gyhintel requested a review from a team as a code owner September 3, 2024 04:42

ElaineBao approved these changes Sep 3, 2024

View reviewed changes

rfcs: graph: fix the shape in the description

5b57d70

vpirogov added this to the v3.7 milestone Sep 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

rfc: graph: propose to support Grouped-query Attention #2018

rfc: graph: propose to support Grouped-query Attention #2018

gyhintel commented Jul 31, 2024

dzarukin commented Jul 31, 2024 •

edited

Loading

gyhintel commented Aug 1, 2024

petercad Aug 5, 2024 •

edited

Loading

gyhintel Aug 6, 2024

TaoLv Aug 8, 2024

TaoLv Aug 8, 2024

gyhintel Aug 8, 2024

TaoLv Aug 8, 2024

gyhintel Aug 8, 2024

TaoLv Aug 11, 2024

dzarukin commented Aug 20, 2024

gyhintel commented Aug 23, 2024

ElaineBao Sep 3, 2024

gyhintel Sep 3, 2024


		## GQA in PyTorch

		Unlike SDPA, PyTorch does not support GQA as a fused operations. In Huggingface

	\| B1 x 1 x B3 x M x K \| B1 x B2 x 1 x M x K \| B1 x B2 x B3 x M x N \|
	\| B1 x 1 x B3 x M x K \| B1 x B2 x 1 x K x N \| B1 x B2 x B3 x M x N \|

rfc: graph: propose to support Grouped-query Attention #2018

Are you sure you want to change the base?

rfc: graph: propose to support Grouped-query Attention #2018

Conversation

gyhintel commented Jul 31, 2024

Description

dzarukin commented Jul 31, 2024 • edited Loading

gyhintel commented Aug 1, 2024

petercad Aug 5, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dzarukin commented Aug 20, 2024

gyhintel commented Aug 23, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dzarukin commented Jul 31, 2024 •

edited

Loading

petercad Aug 5, 2024 •

edited

Loading