Skip to content

dr-darryl-wright/toy_model_interpretability_penalised

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Two-Layer Toy Model

The file model3.py implements a two-layer two model that can be trained on feature decoding tasks. This is primarily meant for interpretability experiments.

Calling run(...) trains a model and returns the loss curve, the final model, checkpointed models (uniformly spaced in log-trainings-steps), and a dictionary containing details of the model and task setup.

The parameters N, m, and k specify the number of features, the embedding dimension, and the nonlinear dimension of the model. The parameter eps specifies the mean feature frequency (feature sparsity S=1-eps). The parameter sample_kind specifies whether to use uniform ("equal") or power-law ("power_law") feature frequencies. init_bias is the initial mean bias of the nonlinear units. nonlinearity specifies which nonlinearity to use, from "ReLU", "GeLU", and "SoLU". task specifies the task, either "decoder" (for the feature decoder) or "abs" for the absolute-value feature decoder. decay specifies the weight decay rate on the biases. This gets multiplied by the learning rate to determine the per-step weight decay.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 100.0%