.

sedol1339 · Jan 1, 2025 · 7af71d6 · 7af71d6
1 parent e58b84b
commit 7af71d6
Showing 1 changed file with 3 additions and 3 deletions.
diff --git a/papers.md b/papers.md
@@ -7696,6 +7696,6 @@ McDermott, E. (2018). A Deep Generative Acoustic Model for Compositional Automat
   - We then take the 100 values with highest probability across all layers and dimensions (97/100 are in the upper layers), and for each value analyze the top-50 trigger examples for the corresponding key. In 46/100 values, there is at least one trigger example that agrees with the value’s top prediction (table 2).
   - This suggests that memory cells often store information on how to directly predict the output (the distribution of the next word) from the input (patterns in the prefix). However, the lower layers probably do not operate in the same embedding space.
   - A typical example triggers hundreds of memories per layer (10%-50% of 4096 dimensions), but the majority of cells remain inactive (fig. 7).
-  - Fig. 8. shows the fraction of examples where the layer’s prediction is different from the prediction of all of its memories. So, the layer-level prediction is typically not the result of a single dominant memory cell.
-  - In contrast, where at least one memory cell agrees with the layer’s prediction, usually the target token is a common stop word in the vocabulary. This suggests that very common patterns in the training data might be "cached" in individual memory cells, and do not require compositionality.
-  - We also measure fraction of examples in each layer, where the residual’s top prediction matches the model’s output. Roughly a third of the model’s predictions are determined in the bottom few layers. This number grows rapidly from layer 10 onwards, implying that the majority of "hard" decisions occur before the final layer.
+  - Fig. 8. shows the fraction of examples where the layer’s prediction is different from the prediction of all of its memories. So, the layer-level prediction is typically not the result of a single dominant memory cell. In contrast, where at least one memory cell agrees with the layer’s prediction, usually the target token is a common stop word in the vocabulary. This suggests that very common patterns in the training data might be "cached" in individual memory cells, and do not require compositionality (fig. 1).
+  - When the residual’s top prediction ends up being the model’s prediction, the FFN usually predicts something different. When the residual’s top prediction changes after incorporating signal from FFN, it rarely changes to the FFN top prediction, instead it produces a "compromise" prediction, which is equal to neither. A possible conjecture is that the FFN acts as an elimination mechanism to "veto" the top prediction in the residual. FFN sometimes is able to tune the residual predictions even in the last layer, changing the top prediction to something different, e.g. "people" -> "same", or "later" -> "earlier".
+  - We also measure fraction of examples in each layer, where the residual’s top prediction matches the model’s output. Roughly a third of the model’s predictions are determined in the bottom few layers. This number grows rapidly from layer 10 onwards, implying that the majority of "hard" decisions occur before the final layer. (IMO this is similar to the idea behind LayerSkip).