Collection of Reverse Engineering in Large Model (and Human Brain...)
Mechanistic Interpretability
Explainer & Glossary from Neel Nanda "MI/mech int/mech interp/mechanistic interpretability: The field of study of reverse engineering neural networks from the learned weights down to human-interpretable algorithms. Analogous to reverse engineering a compiled program binary back to source code"
Features and Circuits
-
Visualizing Representations: Deep Learning and Human Beings (Jan. 16, 2015)
-
Feature visualization (Nov. 7, 2017) "Feature visualization answers questions about what a network or parts of a network are looking for by generating examples.
Neural networks are, generally speaking, differentiable with respect to their inputs. If we want to find out what kind of input would cause a certain behavior,whether that’s an internal neuron firing or the final output behavior,we can use derivatives to iteratively tweak the input towards that goal."
-
Zoom In: An Introduction to Circuits (March 10, 2020) "By studying the connections between neurons, we can find meaningful algorithms in the weights of neural networks."
Open Problems in Mechanistic Interpretability
- 200 Concrete Open Problems in Mechanistic Interpretability: Introduction
- The Case for Analysing Toy Language Models
- Looking for Circuits in the Wild
- Interpreting Algorithmic Problems
- Exploring Polysemanticity and Superposition
- Analysing Training Dynamics
- Techniques, Tooling and Automation
- Image Model Interpretability
- Reinforcement Learning
- Studying Learned Features in Language Models
- 200 Concrete Open Problems in Mechanistic Interpretability google document Residual stream -- how to understand a transformer
- Residual Flows for Invertible Generative Modeling
- Transformer Feed-Forward Layers Are Key-Value Memories MLPs as key-value pairs.
- A Mathematical Framework for Transformer Circuits
- Exploring the Residual Stream of Transformers
Hebbian theory
"Neurons that fire together wire together."(connect to the activations in neural network, 3B1B post)
Hebbian theory is a neuropsychological theory claiming that an increase in synaptic efficacy arises from a presynaptic cell's repeated and persistent stimulation of a postsynaptic cell. It is an attempt to explain synaptic plasticity, the adaptation of brain neurons during the learning process. It was introduced by Donald Hebb in his 1949 book The Organization of Behavior.
Information, Entropy and KL divergence
Elements of Information Theory by Thomas M. Cover
- The Linear Representation Hypothesis and the Geometry of Large Language
- Not All Language Model Features Are Linear
- Actually, Othello-GPT Has A Linear Emergent World Representation
Superposition
- Two kinds of superposition
- Bottleneck superposition --used for “storage”.
- Neuron superposition --more features represented in neuron activation space than there are neurons.
- Softmax Linear Units, "background on how to think about superposition".
- Toy Models of Superposition
Polysemanticity, Monosemanticity and Superposition
-
Sparse Autoencoders Find Highly Interpretable Model Directions
-
Towards Monosemanticity: Decomposing Language Models With Dictionary Learning
-
Decomposing The Dark Matter of Sparse Autoencoders Current SAEs fall short of completely explaining model performance, resulting in "dark matter": unexplained variance in activations.
-
Gated SAE Improving Dictionary Learning with Gated Sparse Autoencoders
-
Top-k SAE Using a TopK activation function gets rid of the need for a sparsity penalty.
-
Switch Sparse Autoencoders Efficient Dictionary Learning with Switch Sparse Autoencoders
“Crosscoders produce shared features across layers and even models.”
"We can think of autoencoders and transcoders as special cases of the general family of crosscoders "
open questions:
- Life circle of feature. How do features change over model training? When do they form? Do they form abruptly, or gradually grow? Do their directions drift over training, or have a relatively fixed direction from early on?
- If we train a model twice, to what extent do we get the same features?
- As we make a model wider, do we just get more features? Or are they largely the same features, packed less densely? Do some features get thrown away in favor of more useful features available to larger models?
- To what extent do different architectures (eg. vision transformers vs conv nets) learn the same features?
- Sparse Crosscoders for Cross-Layer Features and Model Diffing.
- Residual Stream Analysis with Multi-Layer SAEs
Solution for shrinkage--how and why SAEs have a reconstruction gap due to ‘feature suppression’. Addressing Feature Suppression in SAEs.
Stitching SAEs of different sizes When you scale up an SAE, the features in the larger SAE can be categorized in two groups: 1) “novel features” with new information not in the small SAE and 2) “reconstruction features” that sparsify information that already exists in the small SAE. You can stitch SAEs by adding the novel features to the smaller SAE.
SAE evaluation
Auto-explain
- Language models can explain neurons in language models
- Open Source Automated Interpretability for Sparse Autoencoder Features Building and evaluating an open-source pipeline for auto-interpretability, July 30, 2024 · Caden Juang, Gonçalo Paulo, Jacob Drori, Nora Belrose
- Automatically Interpreting Millions of Features in Large Language Models
- Patchscopes: A Unifying Framework for Inspecting Hidden Representations of Language Models use a target prompt comprised of fewshot demonstrations of string repetitions to encourage the LLM to explain its internal representation.
Steer evaluation
- SAE Landscape – A collection of useful publications and tools
- Neuronpedia https://www.neuronpedia.org/
Social bias of LLM
Knowledge conflict of LLM
Personality of LLM
Protein Language Models