Question about the Tree Attention Mechanism #127

chansonzhang · 2024-11-18T08:48:03Z

Suppose the first MEDUSA head generates the top-2 predictions "It is" and "It's", while the second MEDUSA head generates the top-3 predictions "difficult", "a", and "not". This results in a total of 2 × 3 = 6 candidates.

The tree-structured attention mechanism ensures that each token can only attend to its predecessors within the same continuation. For instance, the token "difficult" can only attend to "It is" or "It's", but not to "not" or "a", as they belong to different continuations.

So,

"difficult" can attend to "It is".
"difficult" is generated by MEDUSA head 2, and "It is" is generated by MEDUSA head 1.
head 2 and head 1 are running in parallel.

This means when head 2 is generating "difficult", "It is" has not necessarily already been generated by head 1. If "It is" has not been generated at that moment "difficult" is being generated, how can "difficult" attend to the not yet exist "It is"?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question about the Tree Attention Mechanism #127

Question about the Tree Attention Mechanism #127

chansonzhang commented Nov 18, 2024

Question about the Tree Attention Mechanism #127

Question about the Tree Attention Mechanism #127

Comments

chansonzhang commented Nov 18, 2024