We released our models on huggingface here for various models. To use one of the models:
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
model_id = "baulab/elm-zephyr-7b-beta"
device = 'cuda:0'
dtype = torch.float32
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=dtype)
model = model.to(device)
model.requires_grad_(False)
tokenizer = AutoTokenizer.from_pretrained(model_id, use_fast=False)
# generate text
inputs = tokenizer(prompt, return_tensors='pt', padding=True)
inputs = inputs.to(device).to(dtype)
outputs = model.generate(**inputs,
max_new_tokens=300,
do_sample=True,
top_p=.95,
temperature=1.2)
outputs = tokenizer.batch_decode(outputs, skip_special_tokens = True)
print(outputs[0])
To set up your python environment:
conda create -n elm python=3.9
conda activate elm
git clone https://github.com/rohitgandikota/erasing-llm.git
cd erasing-llm
pip install -r requirements.txt
The bio forget dataset is gated by wmdp team. You should request it separately. See here for more details.
You can use this official link from WMDP team!
If you want to train multiple ELM models for the same base model, it is benificial if you pregenerate the consistency training data so that you don't have to regenerate it every time (it is expensive time-wise). So we provide a way to pre-generate it and store it so that you can use the generated text file directly in training which will makke the training much faster. Pre-generation code
cd trainscripts
python prepare_consistency_data.py --model_id 'HuggingFaceH4/zephyr-7b-beta' --num_samples 5000 --dataset_idx '0,1' --pregenerated_consistency_path '../consistency_data' --device 'cuda:0'
cd trainscripts
python erase.py --dataset_idx '0,0,1' --model_id 'HuggingFaceH4/zephyr-7b-beta' --num_samples 3000 --eta 1000 --experiment_name 'zephyr-elm-wmdp'
The trained elm peft model will be saved to ./elm_models/zephyr-elm-wmdp/checkpoint-final
folder.
cd trainscripts
python erase.py --lora_rank 256 --eta 1000 --num_samples 5000 --dataset_idx '2,2,2' --model_id 'meta-llama/Llama-2-7b-chat-hf' --experiment_name 'llama2-elm-hp'
The trained elm peft model will be saved to ./elm_models/llama2-elm-hp/checkpoint-final
folder.
To test the pre-trained models we provided or your own trained model, you can use the notebooks/inference.ipynb
notebook.
When erasing a piece of knowledge from language model, it is easy to destroy the model or not erase anything at all. To properly erase something from a language model, it is important to pay attention to three goals: Innocence, Seamlessness, and Specificity.
Innocence: the erased model should not exhibit any traces of knowledge. Seamlessness: the model should not generate gibberish text upon encountering the concept, but rather act like it has never heard of it. Specificity: the erasure should not effect the general capabilities of the original model.
We introduce a new method called Erasure of Language Memory (ELM).
To erase a concept from language model, we contruct multiple objectives in the following way:
The unlearnt model should act in a way that it's probability to generate a sequence of text is same as the original model, but with a reduced likelihood that the text contains the concept c_n
(e.g. "expert in bioweapons") and increased likelihood of concept c_p
(e.g. "novice in bioweapons") .
P'(x) = P(x) (P(c_p|x)/ P(c_n|x))^eta
where x
is the text and c
is the concept we wish to erase. This equation can be simplified using bayes theorem and taking log:
log P'(x) α log P(x) + eta * (log P(x|c_p) - log P(x|c_n))
Where P(x|c)
can be characterised as the probability of the text x
being generated by the model with the context c
as prefix. For an autoregressive model, this can be expanded as
log P'(xi|x1:xi-1) α log P(xi|x1:xi-1) + eta * (log P(xi|c_p, x1:xi-1) - log P(xi|c_n, x1:xi-1))
Similarly we train the model to maintain the general capabilities by using a retain set.
Finally, we train the model to maintain fluency under attack. i.e. we want the model to be fluent when prompted for the erased concept. Please refer the paper for more details.
The preprint can be cited as follows
@article{gandikota2024elm,
title={Erasing Conceptual Knowledge from Language Models},
author={Rohit Gandikota and Sheridan Feucht and Samuel Marks and David Bau},
journal={arXiv preprint arXiv:2410.02760},
year={2024}
}