Passing Attention Masks #24

leffff · 2023-09-14T09:50:59Z

Hi! Is there a way to pass an attention mask like in transformers library or src_key_padding_mask in nn.Transformer? So that the model wouldn't "pay attention" to paddings?

leffff · 2023-09-14T11:41:12Z

Moreover how do you recommend to pool the output embeddings into a single vector? For example BERT uses a [CLS] token, that aggregates the information from the whole sequence. As I understood the last vector in the sequence encodes the information (like in RNNs).

Jamie-Stirling · 2023-09-14T12:04:10Z

Hi!

Thanks for your interest in this implementation. Please see the work of the original authors for more information, I'm best-placed to answer implementation-specific questions since I'm not an author. That said I can have a go at answering your questions.

Regarding padding, if your padding is placed after the input tokens, there's no need to mask the retention mechanism itself, since information can only flow forwards anyway. You'll probably want to mask out the losses during training though.

Regarding getting an embedding of an entire sequence, the recurrent state S (for the recurrent representation) and R (for chunk-wise) should share a large amount of mutual information with the preceding tokens, so they may serve as a useful vector (or rather matrix) representation of a sequence. That remains to be further investigated though.

leffff · 2023-09-14T14:07:47Z

Thanks for the answer, now it's clear to me! I'll just take the last non PAD token!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Passing Attention Masks #24

Passing Attention Masks #24

leffff commented Sep 14, 2023

leffff commented Sep 14, 2023

Jamie-Stirling commented Sep 14, 2023

leffff commented Sep 14, 2023

Passing Attention Masks #24

Passing Attention Masks #24

Comments

leffff commented Sep 14, 2023

leffff commented Sep 14, 2023

Jamie-Stirling commented Sep 14, 2023

leffff commented Sep 14, 2023