Awesome-Sparse-Autoencoder

Collection of Reverse Engineering in Large Model (and Human Brain...)

Basic Definitions

Mechanistic Interpretability

Explainer & Glossary from Neel Nanda "MI/mech int/mech interp/mechanistic interpretability: The field of study of reverse engineering neural networks from the learned weights down to human-interpretable algorithms. Analogous to reverse engineering a compiled program binary back to source code"

Features and Circuits

Visualizing Representations: Deep Learning and Human Beings (Jan. 16, 2015)
Feature visualization (Nov. 7, 2017) "Feature visualization answers questions about what a network or parts of a network are looking for by generating examples.

Neural networks are, generally speaking, differentiable with respect to their inputs. If we want to find out what kind of input would cause a certain behavior,whether that’s an internal neuron firing or the final output behavior,we can use derivatives to iteratively tweak the input towards that goal."
Zoom In: An Introduction to Circuits (March 10, 2020) "By studying the connections between neurons, we can find meaningful algorithms in the weights of neural networks."

Open Problems in Mechanistic Interpretability

200 Concrete Open Problems in Mechanistic Interpretability: Introduction
- The Case for Analysing Toy Language Models
- Looking for Circuits in the Wild
- Interpreting Algorithmic Problems
- Exploring Polysemanticity and Superposition
- Analysing Training Dynamics
- Techniques, Tooling and Automation
- Image Model Interpretability
- Reinforcement Learning
- Studying Learned Features in Language Models
200 Concrete Open Problems in Mechanistic Interpretability google document Residual stream -- how to understand a transformer
Residual Flows for Invertible Generative Modeling
Transformer Feed-Forward Layers Are Key-Value Memories MLPs as key-value pairs.
A Mathematical Framework for Transformer Circuits
Exploring the Residual Stream of Transformers

Hebbian theory

"Neurons that fire together wire together."(connect to the activations in neural network, 3B1B post)

Hebbian theory is a neuropsychological theory claiming that an increase in synaptic efficacy arises from a presynaptic cell's repeated and persistent stimulation of a postsynaptic cell. It is an attempt to explain synaptic plasticity, the adaptation of brain neurons during the learning process. It was introduced by Donald Hebb in his 1949 book The Organization of Behavior.

Information, Entropy and KL divergence

Elements of Information Theory by Thomas M. Cover

Linear Representation Hypothesis

The Linear Representation Hypothesis and the Geometry of Large Language
Not All Language Model Features Are Linear
Actually, Othello-GPT Has A Linear Emergent World Representation

Superposition

Superposition

Two kinds of superposition
- Bottleneck superposition --used for “storage”.
- Neuron superposition --more features represented in neuron activation space than there are neurons.
Softmax Linear Units, "background on how to think about superposition".
- Neuroscope: A Website for Mechanistic Interpretability of Language Models
Toy Models of Superposition
- related code:
  - [1.3.1] in Transformer_lens doc : Superposition & Sparse Autoencoders

Polysemanticity, Monosemanticity and Superposition

Towards Monosemanticity: Decomposing Language Models With Dictionary Learning

Sparse Autoencoder

Sparse Autoencoders Find Highly Interpretable Model Directions
Towards Monosemanticity: Decomposing Language Models With Dictionary Learning
Scaling and evaluating sparse autoencoders
- code https://arxiv.org/abs/2406.04093v1
Decomposing The Dark Matter of Sparse Autoencoders Current SAEs fall short of completely explaining model performance, resulting in "dark matter": unexplained variance in activations.
Gated SAE Improving Dictionary Learning with Gated Sparse Autoencoders
Top-k SAE Using a TopK activation function gets rid of the need for a sparsity penalty.
Switch Sparse Autoencoders Efficient Dictionary Learning with Switch Sparse Autoencoders

Transcoders

Transcoders Find Interpretable LLM Feature Circuits
- code https://github.com/jacobdunefsky/transcoder_circuits

Sparse Cross-coders

“Crosscoders produce shared features across layers and even models.”

"We can think of autoencoders and transcoders as special cases of the general family of crosscoders "

open questions:

Life circle of feature. How do features change over model training? When do they form? Do they form abruptly, or gradually grow? Do their directions drift over training, or have a relatively fixed direction from early on?
If we train a model twice, to what extent do we get the same features?
As we make a model wider, do we just get more features? Or are they largely the same features, packed less densely? Do some features get thrown away in favor of more useful features available to larger models?
To what extent do different architectures (eg. vision transformers vs conv nets) learn the same features?

Sparse Crosscoders for Cross-Layer Features and Model Diffing.
Residual Stream Analysis with Multi-Layer SAEs

Limitations / improvement

Solution for shrinkage--how and why SAEs have a reconstruction gap due to ‘feature suppression’. Addressing Feature Suppression in SAEs.

Stitching SAEs of different sizes When you scale up an SAE, the features in the larger SAE can be categorized in two groups: 1) “novel features” with new information not in the small SAE and 2) “reconstruction features” that sparsify information that already exists in the small SAE. You can stitch SAEs by adding the novel features to the smaller SAE.

Train, Auto-explain and Evaluate

SAE evaluation

Evaluating Sparse Autoencoders with Board Games
Scaling and evaluating sparse autoencoders

Auto-explain

Language models can explain neurons in language models
Open Source Automated Interpretability for Sparse Autoencoder Features Building and evaluating an open-source pipeline for auto-interpretability, July 30, 2024 · Caden Juang, Gonçalo Paulo, Jacob Drori, Nora Belrose
Automatically Interpreting Millions of Features in Large Language Models
Patchscopes: A Unifying Framework for Inspecting Hidden Representations of Language Models use a target prompt comprised of fewshot demonstrations of string repetitions to encourage the LLM to explain its internal representation.

Steer evaluation

Evaluating feature steering: A case study in mitigating social biases

Resource

SAE Landscape – A collection of useful publications and tools
Neuronpedia https://www.neuronpedia.org/

Use Dictionary Learning to Interpret xx of Large Model

Social bias of LLM

Evaluating feature steering: A case study in mitigating social biases

Knowledge conflict of LLM

Steering Knowledge Selection Behaviours in LLMs via SAE-Based Representation Engineering

Personality of LLM

What makes your model a low-empathy or warmth person: Exploring the Origins of Personality in LLMs

Protein Language Models

InterProt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Awesome-Sparse-Autoencoder

Basic Definitions

Linear Representation Hypothesis

Superposition

Sparse Autoencoder

Transcoders

Sparse Cross-coders

Limitations / improvement

Train, Auto-explain and Evaluate

Resource

Use Dictionary Learning to Interpret xx of Large Model

Files

README.md

Latest commit

History

README.md

File metadata and controls

Awesome-Sparse-Autoencoder

Basic Definitions

Linear Representation Hypothesis

Superposition

Sparse Autoencoder

Transcoders

Sparse Cross-coders

Limitations / improvement

Train, Auto-explain and Evaluate

Resource

Use Dictionary Learning to Interpret xx of Large Model