LoRD: Locality Reinforced Distillation

This repository contains the source code of our pre-printed paper Alignment-Aware Model Extraction Attacks on Large Language Models.

Feel free to give me any feedback via issues or email ([email protected]) when you reproduce our work.

Introduction of LoRD

Model extraction attacks (MEAs) on large language models (LLMs) have received increasing attention in recent research. However, existing attack methods typically adapt the extraction strategies originally developed for deep neural networks (DNNs). They neglect the underlying inconsistency between the training tasks of MEA and LLM alignment, leading to suboptimal attack performance. To tackle this issue, we propose Locality Reinforced Distillation (LoRD), a novel model extraction algorithm specifically designed for LLMs. In particular, LoRD employs a newly defined policy-gradient-style training task that utilizes the responses of victim model as the signal to guide the crafting of preference for the local model. Theoretical analyses demonstrate that i) The convergence procedure of LoRD in model extraction is consistent with the alignment procedure of LLMs, and ii) LoRD can reduce query complexity while mitigating watermark protection through exploration-based stealing. Extensive experiments on domain-specific extractions validate the superiority of our method in extracting various state-of-the-art commercial LLMs.

This figure provides a comparison between vanilla MEAs on conventional DNNs (left) and MEAs on LLMs with alignments (right).

Consistent with the training procedure of conventional DNNs, the vanilla extracting procedure employs a supervised loss. But when extra training tasks like reinforcement learning are integrated and play an important role in the training of LLMs, such consistency no longer exists, which challenges the effectiveness of vanilla MEAs on LLMs. The question is: is a supervised loss (e.g., MLE) compatible to extract a RL-aligned LLM?

In our paper, we show that the answer is yes. However, stealing LLMs suffers from two potential drawbacks:

Low query efficiency. Ideally, a supervised loss requires a level of $O(V^N_{q}⋅ V^N_{r})$ to learn from a LLM, where $V$ is the vocabulary size, and $N_q$ and $N_r$ denote the sequence lengths of the query and the response.
Vulnerable to text watermarks. Current MEAs will learn a watermarked local model when stealing.

We aim to address this two drawbacks in our research.

LoRD

The core idea of LoRD is to let the local model explore the correct responses under the gold response of the victim model. The victim model is the “Lord”. There are three advantages for that:

Query Efficiency. It can provide multiple responses under the same input query, reducing the complexity from $O(V^N_{q}⋅ V^N_{r})$ to $O(V^N_{q}⋅ C)$ with $C$ a constant.
Watermark resistance. It achieves a trade-off between the stealing performance and the watermark residue.

Stealing consistency. Its stealing procedure is consistent to the RL alignment procedure of LLMs.

Evaluation

Environments

You may require a new environment with python>3.8 and Nvidia GPU (cuda) environments.

Clone this repository, cd where you cloned to, and then pip install -r re.txt. The environment setting will be done.

Look the source Code/Use LoRD

The most convenient way is to reuse train_pod2.py.

Explanations of the Source Code

scripts: all of the commands to evaluate a method. Use it by bash XXXX.sh.
.py: core codes.
- with eval_ prefix: for evaluation
- with draw_ or plot_: for drawing
- with _process: data process
- with train: different training methods
watermark: code for watermark experiments.
All of the other directories are dirty for storing checkpoints or results.

Experiments

Effectiveness comparison

The above experiments can be reproduced by running 6.X.xxxxx.sh in ./scripts. Here is an example:

#!/bin/bash

echo "HOME: ${HOME}"
export python=${HOME}/anaconda3/envs/align/bin/python3
export CUDA_VISIBLE_DEVICES="1"
export TORCH_USE_CUDA_DSA="1"
export root_dir="${HOME}/alignmentExtraction/"
export POD_save_dir="${root_dir}/wmt16_ckpts/"
export from_path="meta-llama/Meta-Llama-3-8B-Instruct"
export TRAIN_NUMS=(16)
export train_times=(1 2 3 4 5)
export msl=256
export task_ls=("cs-en" "de-en" "fi-en")
export train_taskls=("LoRD-II")

export is_black_box=1
export use_lora=1

export epoch=2
export period=1

export sub_set_num=1
export sub_stage_num=256
export max_new_tokens=64
export infer_batch_size=1
export batch_size=1

export beta=-1
export temperature=-1

export use_old_logits=1
export use_vic_logits=1
export use_kld=0
export use_entropy=0

# export tau1=0.85
export tau1=0.80
export tau2=0.85

for train_num in ${TRAIN_NUMS[*]}
do
    for train_time in ${train_times[*]}
    do
        for task in ${task_ls[*]}
        do
            for train_task in ${train_taskls[*]}
            do
                echo "====================================================="
                echo "+++++++train_num: ${train_num}+++++++"
                echo "+++++++train_time: ${train_time}+++++++"
                echo "+++++++task: ${task}+++++++"
                echo "+++++++train_task: ${train_task}+++++++"
                echo "====================================================="

                export save_path="${POD_save_dir}WMTTT0519${task}${train_num}${train_time}${train_task}"

                $python ${root_dir}lord_train.py\
                    --use_lora=$use_lora \
                    --from_path=$from_path \
                    --is_black_box=$is_black_box \
                    --sub_set_num=$sub_set_num \
                    --sub_stage_num=$sub_stage_num\
                    --infer_batch_size=$infer_batch_size\
                    --tau1=$tau1 \
                    --tau2=$tau2 \
                    --task=$train_task \
                    --device="cuda" \
                    --epoch=$epoch \
                    --period_num=$period \
                    --acc_step=1 \
                    --log_step=50 \
                    --train_num=$train_num \
                    --max_new_tokens=$max_new_tokens \
                    --LR="3e-5" \
                    --save_step=$sub_stage_num \
                    --beta=$beta \
                    --temperature=$temperature \
                    --batch_size=$batch_size \
                    --use_old_logits=$use_old_logits\
                    --use_vic_logits=$use_vic_logits\
                    --use_kld=$use_kld\
                    --max_length=$msl \
                    --dataset_task=$task \
                    --save_path=$save_path
                echo "DONE FOR ONE TRAIN NUMBERS...."
            done
        done
    done
done


$python ${root_dir}wmt_process.py

In the above script, you can simply replace your dataset with others, as shown in ./lord_train.py.

tasks_glue = [
    "cola", "mnli",
    "mrpc",
    "qnli", "qqp", "rte", "sst2",
    "wnli",]

tasks_wmt16 = [
    "cs-en",
    "de-en",
    "fi-en",
    "ro-en",
    "ru-en",
    "tr-en",
]

tasks_wmt16_wrmk=[
    "cs-en@wrmk",
    "de-en@wrmk",
    "fi-en@wrmk",
    "ro-en@wrmk",
    ]

tasks_qa = [
    "piqa",
    "truthful_qa",
    "allenai/ai2_arc",
]

tasks_code = [
    "deepmind/code_contests",
    ]

tasks_data2text = [
    "e2e_nlg",
    "allenai/common_gen",
]

tasks_data2text_wrmk=[
    "e2e_nlg@wrmk",
    "allenai/common_gen@wrmk",
    ]

tasks_sum = [
    "UCL-DARK/openai-tldr-filtered",
    "cnn_dailymail",
    "samsum",
]

tasks_text2sql = [
    "wikisql",
    "spider",
]

tasks_safety = [
    "PKU-Alignment/PKU-SafeRLHF",
    "thu-coai/diasafety",
    ]

tasks_general = [
    "liangzid/claude3_chat3.3k",
    "liangzid/claude3_short256",
    "teknium/GPT4-LLM-Cleaned",
    "BAAI/Infinity-Instruct",
]

This is a spectrum of results.

Watermark Resistance experiments.

We use a green-set based watermarking by Kirchenbauer et al. to implement our text watermarks.

The original code comes from here. All rights are reserved for the original repository.

Our evaluation code is in ./watermark

./watermark/llama3_watermark_gen.py shows how to generate texts with watermark for llama3-70B.

You can simply run bash ./watermark/1.1.train_with_wtmk.sh to obtain all experiments.

Detection and visualization are here:

$python ${root_dir}watermark/watermark_detect.py

$python ${root_dir}plot_watermark_curve.py

Hyper-parameter’s Experiments

Fidelity

Distribution to Victim Models

Reference

  @misc{liang2024alignmentawaremodelextractionattacks,
      title={Alignment-Aware Model Extraction Attacks on Large Language Models}, 
      author={Zi Liang and Qingqing Ye and Yanyun Wang and Sen Zhang and Yaxin Xiao and Ronghua Li and Jianliang Xu and Haibo Hu},
      year={2024},
      eprint={2409.02718},
      archivePrefix={arXiv},
      primaryClass={cs.CR},
      url={https://arxiv.org/abs/2409.02718}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 495 Commits
GLUE_infers		GLUE_infers
STEALED_PKLS		STEALED_PKLS
WMT16_infers		WMT16_infers
doc		doc
general_train		general_train
images		images
post_process		post_process
qa_dataset_res		qa_dataset_res
res_vary_trainnum_glue		res_vary_trainnum_glue
safety		safety
scripts		scripts
utils		utils
vary_train_num_WMT16_infers		vary_train_num_WMT16_infers
vary_train_num_qa_infers		vary_train_num_qa_infers
watermark		watermark
watermark_res		watermark_res
wmt_0617_varymodelsize_dataset_res		wmt_0617_varymodelsize_dataset_res
.gitignore		.gitignore
2.0.2.train_llama370.sh		2.0.2.train_llama370.sh
README.org		README.org
bleu4.py		bleu4.py
code_process.py		code_process.py
common_task_process.py		common_task_process.py
data2text_process.py		data2text_process.py
distribute_3d_res.pdf		distribute_3d_res.pdf
distribute_heat_res.pdf		distribute_heat_res.pdf
distribute_heat_res2x4.pdf		distribute_heat_res2x4.pdf
distribute_heat_res_test.pdf		distribute_heat_res_test.pdf
domain_specific_victim_process.py		domain_specific_victim_process.py
draw_spectrum.py		draw_spectrum.py
eval_vary_modelsize.py		eval_vary_modelsize.py
eval_vary_period.py		eval_vary_period.py
eval_vary_trainNum.py		eval_vary_trainNum.py
evaluate_llm.py		evaluate_llm.py
exe_infer.py		exe_infer.py
fidelity-results.pdf		fidelity-results.pdf
gen_pipeline_open.py		gen_pipeline_open.py
glue_process.py		glue_process.py
lord_complex_train.py		lord_complex_train.py
lord_reinforce_train.py		lord_reinforce_train.py
lord_train.py		lord_train.py
merge_lora.py		merge_lora.py
nlg_metric.py		nlg_metric.py
perplexity_process.py		perplexity_process.py
plot_distribution.py		plot_distribution.py
plot_fidelity.py		plot_fidelity.py
plot_watermark_curve.py		plot_watermark_curve.py
py2_0_1_test_loading_lora.py		py2_0_1_test_loading_lora.py
qa_process.py		qa_process.py
re.txt		re.txt
rewardmodel_train.py		rewardmodel_train.py
rlhf_tokenlevel_train.py		rlhf_tokenlevel_train.py
rlhf_train.py		rlhf_train.py
safety_process.py		safety_process.py
sequence_utils.py		sequence_utils.py
sft_myself.py		sft_myself.py
spectrum-results.pdf		spectrum-results.pdf
sum_process.py		sum_process.py
supervised_distillation.py		supervised_distillation.py
supervised_finetuning.py		supervised_finetuning.py
supervised_training.py		supervised_training.py
temp_qa_infer.py		temp_qa_infer.py
temp_wmt_infer.py		temp_wmt_infer.py
text2sql_process.py		text2sql_process.py
train_pod2.py		train_pod2.py
train_pod3.py		train_pod3.py
train_pod4_lord_II.py		train_pod4_lord_II.py
training_data_collecting_openai.py		training_data_collecting_openai.py
vary_modelsize_mix.pdf		vary_modelsize_mix.pdf
vary_modelsize_wmt16.pdf		vary_modelsize_wmt16.pdf
vary_periods.pdf		vary_periods.pdf
vary_query_num_overall_res_qa.json		vary_query_num_overall_res_qa.json
vary_query_num_overall_res_wmt16.json		vary_query_num_overall_res_wmt16.json
vary_trainNum.pdf		vary_trainNum.pdf
vary_trainNum_QAs.pdf		vary_trainNum_QAs.pdf
vary_trainNum_code.pdf		vary_trainNum_code.pdf
vary_trainNum_glue.pdf		vary_trainNum_glue.pdf
vary_trainNum_wmt16.pdf		vary_trainNum_wmt16.pdf
watermarks.pdf		watermarks.pdf
wmt_process.py		wmt_process.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LoRD: Locality Reinforced Distillation

Introduction of LoRD

LoRD

Evaluation

Environments

Look the source Code/Use LoRD

Explanations of the Source Code

Experiments

Effectiveness comparison

Watermark Resistance experiments.

Hyper-parameter’s Experiments

Fidelity

Distribution to Victim Models

Reference

About

Languages

liangzid/LoRD-MEA

Folders and files

Latest commit

History

Repository files navigation

LoRD: Locality Reinforced Distillation

Introduction of LoRD

LoRD

Evaluation

Environments

Look the source Code/Use LoRD

Explanations of the Source Code

Experiments

Effectiveness comparison

Watermark Resistance experiments.

Hyper-parameter’s Experiments

Fidelity

Distribution to Victim Models

Reference

About

Topics

Resources

Stars

Watchers

Forks

Languages