Skip to content

Latest commit

 

History

History
134 lines (99 loc) · 9.87 KB

README.md

File metadata and controls

134 lines (99 loc) · 9.87 KB

Awesome-Sparse-Autoencoder

Collection of Reverse Engineering in Large Model (and Human Brain...)

Basic Definitions

Mechanistic Interpretability

Explainer & Glossary from Neel Nanda "MI/mech int/mech interp/mechanistic interpretability: The field of study of reverse engineering neural networks from the learned weights down to human-interpretable algorithms. Analogous to reverse engineering a compiled program binary back to source code"

Features and Circuits

  • Visualizing Representations: Deep Learning and Human Beings (Jan. 16, 2015)

  • Feature visualization (Nov. 7, 2017) "Feature visualization answers questions about what a network or parts of a network are looking for by generating examples.

    Neural networks are, generally speaking, differentiable with respect to their inputs. If we want to find out what kind of input would cause a certain behavior,whether that’s an internal neuron firing or the final output behavior,we can use derivatives to iteratively tweak the input towards that goal."

  • Zoom In: An Introduction to Circuits (March 10, 2020) "By studying the connections between neurons, we can find meaningful algorithms in the weights of neural networks."

Open Problems in Mechanistic Interpretability

Hebbian theory

"Neurons that fire together wire together."(connect to the activations in neural network, 3B1B post)

Hebbian theory is a neuropsychological theory claiming that an increase in synaptic efficacy arises from a presynaptic cell's repeated and persistent stimulation of a postsynaptic cell. It is an attempt to explain synaptic plasticity, the adaptation of brain neurons during the learning process. It was introduced by Donald Hebb in his 1949 book The Organization of Behavior.

Information, Entropy and KL divergence

Elements of Information Theory by Thomas M. Cover

Linear Representation Hypothesis

Superposition

Superposition

Polysemanticity, Monosemanticity and Superposition

Sparse Autoencoder

Transcoders

Sparse Cross-coders

“Crosscoders produce shared features across layers and even models.”

"We can think of autoencoders and transcoders as special cases of the general family of crosscoders "

open questions:

  • Life circle of feature. How do features change over model training? When do they form? Do they form abruptly, or gradually grow? Do their directions drift over training, or have a relatively fixed direction from early on?
  • If we train a model twice, to what extent do we get the same features?
  • As we make a model wider, do we just get more features? Or are they largely the same features, packed less densely? Do some features get thrown away in favor of more useful features available to larger models?
  • To what extent do different architectures (eg. vision transformers vs conv nets) learn the same features?

Limitations / improvement

Solution for shrinkage--how and why SAEs have a reconstruction gap due to ‘feature suppression’. Addressing Feature Suppression in SAEs.

Stitching SAEs of different sizes When you scale up an SAE, the features in the larger SAE can be categorized in two groups: 1) “novel features” with new information not in the small SAE and 2) “reconstruction features” that sparsify information that already exists in the small SAE. You can stitch SAEs by adding the novel features to the smaller SAE.

Train, Auto-explain and Evaluate

SAE evaluation

Auto-explain

Steer evaluation

Resource

Use Dictionary Learning to Interpret xx of Large Model

Social bias of LLM

Knowledge conflict of LLM

Personality of LLM

Protein Language Models