Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Streamlining of Scaled Dot-Product Attention #901

Draft
wants to merge 17 commits into
base: dev
Choose a base branch
from

Conversation

iksnagreb
Copy link
Contributor

@iksnagreb iksnagreb commented Sep 30, 2023

Trying ideas and bug fixes for streamlining the scaled dot-product attention operator. Related to issue/discussion #878

  • Refactor MoveScalarMulPastMatMul for two-input join-node matmuls
  • Validate that these changes do not break something or change the behavior in subtle ways
  • Fix Absorb1BitMulIntoMatMul and Absorb1BitMulIntoConv test for the presence of weight initializers
  • Debug InferShapes fails after FoldTransposeIntoQuantInit
  • Circumvent MoveScalarAddPastMatMul by preferring AbsorbSignBiasIntoMultiThreshold
  • Fix the FoldQuantWeights transformation currently propagating shapes backwards and maybe generating the inverse of the scale factor
  • Fix the AbsorbAddIntoMultiThreshold transformation assuming input and initializer order which might not always hold true
  • Fix (and include?) the MoveLinearPastEltwiseAdd transformation which does not correctly propagate the shapes (Seems to be fixed by fixing one of the other issues, was probably caused by faulty rewiring of the graph in FoldQuantWeights, transformation seems not to be required anymore, maybe reopen)
  • Suggest Brevitas to change all the quantizers to signed quantizers to be finn compatible
  • Suggest Brevitas to change order of quantizer and transpose of the key matrix to make detecting the pattern easier and treat all three inputs the same
  • Streamlining of scale multiplication through multi-head slicing operations
  • Debug streamlining support for packed input projections
  • Fix RemoveIdentityOps handling fork-node producer

@iksnagreb
Copy link
Contributor Author

The "InferShapes fails after FoldTransposeIntoQuantInit" problem probably needs to be fixed within QONNX, see PR fastmachinelearning/qonnx#78.

@iksnagreb
Copy link
Contributor Author

In addition to the unit tests included here, I have now rebuilt all finn-examples up to the convert_to_hls step (or rather until the step_create_dataflow_partition which is required for some of the expected outputs). I have also completely rebuilt the bnn-pynq example for the Pynq-Z1 target, as well as the kws example which seems to be the only one including verification steps. I did not see any errors and the verification steps reported SUCCESS, is there anything else I could or should do?

@iksnagreb iksnagreb marked this pull request as ready for review October 20, 2023 12:05
@iksnagreb iksnagreb marked this pull request as draft November 10, 2023 11:51
@iksnagreb
Copy link
Contributor Author

Reverted this to a draft as we have found more issues while streamlining more realistic dummy models. I will add the issues to the list in #878 and start working on fixes soon.

@iksnagreb
Copy link
Contributor Author

Might depend on this PR at QONNX fastmachinelearning/qonnx#85, introducing a new cleanup transformation ensuring input order assumptions.

@iksnagreb
Copy link
Contributor Author

The RemoveIdentityOps handling fork-node producer (currently failing when the graph contains packed input projections) will be fixed by this PR at QONNX: fastmachinelearning/qonnx#87

@iksnagreb
Copy link
Contributor Author

We need this one as well: fastmachinelearning/qonnx#89

@iksnagreb
Copy link
Contributor Author

We might want to have this one included as well to support python-mode execution of the graph at any step for verification purposes: fastmachinelearning/qonnx#92

@iksnagreb
Copy link
Contributor Author

This one is relevant as well to properly streamline attention head splitting/merging: fastmachinelearning/qonnx#107

Flips the order of AbsorbSignBiasIntoMultiThreshold and
MoveScalarLinearPastInvariants streamlining transforms to prefer
absorbing adds into multi-thresholds instead of propagating them
downwards. This should prevent accumulation of scalar adds in front of
two-input matmuls in scaled dot-product attention operators (they cannot
be moved past the matmul operation in that case).
The MoveScalarMulPastMatMul transformation can now handle matmul
operations with both inputs preceded by a scalar multiplication.

This change is required for streamlining scaled dot-product attention
operations, which are essentially two-input matmuls.
Assertions are to restrictive, causing the program to terminate in cases
the streamlining simply encounters nodes to which the transforms are not
applicable: Just skip those nodes.

Only the two transforms currently affecting the streamlining of scaled
dot-product attention have been changed.
This is pretty much copy and paste of the existing test case, just
replacing the MatMul initializer by a second top-input followed by a
scalar Mul.
Folding quantized initializers into add-like nodes did not repsect the
order of inputs to the add node correctly. This is fixed by testing for
one of the two possible orders and selecting the following indices
accordingly.

Shape inference following the transformation is fixed by deleting the
annotations instead of propagating them incorrectly. Deleting the shape
annotations should not hurt, as these are redone by running shape
inference after each transformation anyways.
Add is commutative and thus the export does not always generate the
initializer as the second input. However, this was always assumed by
this transformation, failing via assertion if the inputs were simply
ordered differently. The transformation now handles both of the two
possible input orderings.
This is required for streamlining packed input projections of multi-head
scaled dot-product attention. Adds support for Squeeze and Unsqueeze as
well. Skip moving of fork-node producers as this is not handled
correctly. However, the same effect can be attained by applying the
MoveLinearPastFork transformation first.
Explicitly rejects absorbing into fork-nodes. Previously, this probably
would have failed, silently resulting in a wrong model. Not sure whether
this happened in any practically relevant models?
This probably is still rather sketchy, but at least it tries to check
the data layout annotation. For now seems to be enough for getting the
thresholds of multi-head attention right, IF qonnx properly annotates
the 3D layouts.
@iksnagreb iksnagreb force-pushed the feature/attention-streamline branch from 2b638d0 to a4fc498 Compare April 4, 2024 08:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant