.

sedol1339 · Dec 11, 2024 · daad625 · daad625
1 parent d6b642e
commit daad625
Showing 1 changed file with 1 addition and 1 deletion.
diff --git a/papers.md b/papers.md
@@ -6859,7 +6859,7 @@ McDermott, E. (2018). A Deep Generative Acoustic Model for Compositional Automat
 	doi = {10.48550/arXiv.2110.15343}
 }
   - According to Long range arena and our experiments, low-rank-based attention might be less effective on hierarchically structured data or language modeling tasks, while sparse-based variants do not perform well on classification tasks.
-  - In fig. 1 (more detailed in fig. 7) we present the approximation error of the attention matrices from a Transformer trained on (i) IMDb reviews classification, (ii) WikiText103, and (iii) from BigGAN-ImageNet. Sparse and low-rank approximation are complementary: sparse excels when the softmax has low entropy, and low-rank excels when the softmax has high entropy. Their errors are negatively correlated. An ideal combination of sparse and low-rank, obtained with robust PCA (denoted as orange), achieves lower error than both.
+  - In fig. 1 (more detailed in fig. 7) we present the approximation error of the attention matrices from a Transformer trained on (i) IMDb reviews classification, (ii) WikiText103, and (iii) from BigGAN-ImageNet. Sparse and low-rank approximation are complementary: sparse excels when the softmax has low entropy, and low-rank excels when the softmax has high entropy. Their errors are negatively correlated. An ideal combination of sparse and low-rank, obtained with robust PCA (denoted as orange), achieves lower error than both. (IMO these results are obvious: the more diffuse is attention, the less efficient is sparse approximation)
   - We describe a generative model of how the sparse + low-rank structure in attention matrices could arise when the elements of the input sequence form clusters. The model is paremetrized by the intra-cluster distance (fig. 2 shows different values) and the inverse softmax temperature β, so that unnormalized attention matrix M = exp(β Q Q^T). If β is small, the softmax distribution is diffuse, and we can approximate it with a low-rank matrix. In the middle regime of β, we need the sparse part to cover the intra-cluster attention and the low-rank part to approximate the inter-cluster attention.
   - How to decompose the attention matrices into sparse and low-rank components? The sparse + low-rank matrix structure has been well studied in statistics and signal processing since the late 2000s. Classical Robust PCA presents a polynomial algorithm to approximate such a structure. However, Robust PCA is orders of magnitude too slow and requires materializing the full attention, which defeats the main purpose of reducing compute and memory requirements. On the other side, straightforward addition of sparse (say, from Reformer) and low-rank (say, from Performer) attention will be inaccurate due to double counting.
   - To this end, we present a Scatterbrain algorithm. We first construct a low-rank approximation X, then construct a sparse matrix S such that S + X matches A on the support of S, then combine the results (sec. 4.2). So, our approximation is exact for entries on the support of S (which are likely to be large). On other entries our approximation matches the low-rank part.