QA-LoRA has been accepted by ICLR 2024!
This repository provides the official PyTorch implementation of QA-LoRA: Quantization-Aware Low-Rank Adaptation of Large Language Models.
QA-LoRA is easily implemented with a few lines of code, and it equips the original LoRA with two-fold abilities: (i) during fine-tuning, the LLM's weights are quantized (e.g., into INT4) to reduce time and memory usage; (ii) after fine-tuning, the LLM and auxiliary weights are naturally integrated into a quantized model without loss of accuracy.
Fix the conflict with the newest Auto-gptq version.
conda create -n qalora python=3.8
conda activate qalora
conda install pytorch==2.2.0 torchvision==0.17.0 torchaudio==2.2.0 pytorch-cuda=12.1 -c pytorch -c nvidia
git clone -b v0.3.0 https://github.com/PanQiWei/AutoGPTQ.git && cd AutoGPTQ
pip install .
cd ..
pip install bitsandbytes
pip install -r requirements.txt
pip install protobuf==3.20.*
Change the peft_utils.py
in your own auto-gptq path(python path/auto_gptq/utils/peft_utils.py) with the new one.
For the users of GPTQLORA, you only need to change the peft_utils.py
file.
We use GPTQ for quantization.
bits=4, group-size=32, act-order=False
If you change the group-size, you need to change the group_size in peft_utils.py
and merge.py
accordingly.
python qalora.py --model_path <path>
The file structure of the model checkpoint is as follows:
config.json llama7b-4bit-32g.bin special_tokens_map.json tokenizer_config.json
generation_config.json quantize_config.json tokenizer.model
Note that our trained LoRA modules can be perfectly merged into the quantized model. We offer a simple merged script in this repo.
There are two kinds of implementations of the dimention reduction(x from D_in to D_in//L). Both are mathematical equivalent.
Adopt avgpooling operation. But the weights of adapters will be divided by D_in//L during merge(refer to merge.py
).
adapter_result = (lora_B(lora_A(lora_dropout(self.qa_pool(x)))) * scale).type_as(result)
model[tmp_key+'.qzeros'] -= (lora['base_model.model.'+tmp_key+'.lora_B.weight'] @ lora['base_model.model.'+tmp_key+'.lora_A.weight']).t() * scale / group_size / model[tmp_key+'.scales']
Utilize sum operation. The adapters do not need to be divided during merge)
adapter_result = (lora_B(lora_A(lora_dropout(self.qa_pool(x) * group_size))) * scale).type_as(result)
model[tmp_key+'.qzeros'] -= (lora['base_model.model.'+tmp_key+'.lora_B.weight'] @ lora['base_model.model.'+tmp_key+'.lora_A.weight']).t() * scale / model[tmp_key+'.scales']
Some GPTQ implementation such as GPTQ-for-llama further compress the zeros into qzeros. You need to decode the qzeros first and restore fp16 format zeros.