Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question about the Tree Attention Mechanism #127

Open
chansonzhang opened this issue Nov 18, 2024 · 0 comments
Open

Question about the Tree Attention Mechanism #127

chansonzhang opened this issue Nov 18, 2024 · 0 comments

Comments

@chansonzhang
Copy link

Suppose the first MEDUSA head generates the top-2 predictions "It is" and "It's", while the second MEDUSA head generates the top-3 predictions "difficult", "a", and "not". This results in a total of 2 × 3 = 6 candidates.

The tree-structured attention mechanism ensures that each token can only attend to its predecessors within the same continuation. For instance, the token "difficult" can only attend to "It is" or "It's", but not to "not" or "a", as they belong to different continuations.

So,

  • "difficult" can attend to "It is".
  • "difficult" is generated by MEDUSA head 2, and "It is" is generated by MEDUSA head 1.
  • head 2 and head 1 are running in parallel.

This means when head 2 is generating "difficult", "It is" has not necessarily already been generated by head 1. If "It is" has not been generated at that moment "difficult" is being generated, how can "difficult" attend to the not yet exist "It is"?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant