Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cannot reproduce the MMLU accuracy claimed in paper, could you release the script? #28

Open
wenjingk-xilinx opened this issue Mar 23, 2024 · 4 comments

Comments

@wenjingk-xilinx
Copy link

Hi, I try to reproduce the llama7b finetune on MMLU, I get the maximum 5-shot eval accuracy is 36.5% with 4bits, 32.8% with 3bits. Could you please kindly specify your training configuration or release the training script?

@xxw11
Copy link
Collaborator

xxw11 commented Mar 25, 2024

Hi,using the default parameter settings in qalora.py can replicate the results of the paper.
For details on the evaluation, you can refer to the paper.

@freeSoul-SNU
Copy link

Hi,using the default parameter settings in qalora.py can replicate the results of the paper. For details on the evaluation, you can refer to the paper.

@xxw11 Hi. I checked the results without merging the LoRA adapters into the quantized parameters. I used the quantized Llama7B model with 4-bit quantization on the Alpaca dataset.

For the MMLU evaluation, the results were as follows:

  • Humanities: 35.2 (Paper: 36.6)
  • STEM: 32.0 (Paper: 32.4)
  • Social Science: 34.7 (Paper: 44.8)
  • Other: 39.6 (Paper: 44.9)

Regarding the default parameters in the code, I used a learning rate of 0.0002, which was different from the paper's setting of 0.00002.
I set the batch size to 16 and the gradient accumulation step to 1. I also uninstalled the Triton package as suggested in another issue.

I used the default parameters in the code, but I could not replicate the performance reported in the paper. As a test, I also tried training with the learning rate mentioned in the paper (0.00002), but the results got even worse.
Notably, the MMLU loss increased as training progressed.

Do you have any tips on what might be causing this?

Thanks.

@xxw11
Copy link
Collaborator

xxw11 commented Oct 31, 2024

Hi,using the default parameter settings in qalora.py can replicate the results of the paper. For details on the evaluation, you can refer to the paper.

@xxw11 Hi. I checked the results without merging the LoRA adapters into the quantized parameters. I used the quantized Llama7B model with 4-bit quantization on the Alpaca dataset.

For the MMLU evaluation, the results were as follows:

  • Humanities: 35.2 (Paper: 36.6)
  • STEM: 32.0 (Paper: 32.4)
  • Social Science: 34.7 (Paper: 44.8)
  • Other: 39.6 (Paper: 44.9)

Regarding the default parameters in the code, I used a learning rate of 0.0002, which was different from the paper's setting of 0.00002. I set the batch size to 16 and the gradient accumulation step to 1. I also uninstalled the Triton package as suggested in another issue.

I used the default parameters in the code, but I could not replicate the performance reported in the paper. As a test, I also tried training with the learning rate mentioned in the paper (0.00002), but the results got even worse. Notably, the MMLU loss increased as training progressed.

Do you have any tips on what might be causing this?

Thanks.

@freeSoul-SNU In my reproduction process, I found that using a linear learning rate, as specified in the original paper, might lead to unstable results. However, the MMLU score should be around 38-40. Your parameters are basically the same as mine; I set the batch size to 1 and the gradient accumulation step to 16, but theoretically, this shouldn't affect the final results. Did you use the MMLU evaluation repository mentioned in the QA-LoRA paper?

@freeSoul-SNU
Copy link

freeSoul-SNU commented Oct 31, 2024

Hi,using the default parameter settings in qalora.py can replicate the results of the paper. For details on the evaluation, you can refer to the paper.

@xxw11 Hi. I checked the results without merging the LoRA adapters into the quantized parameters. I used the quantized Llama7B model with 4-bit quantization on the Alpaca dataset.
For the MMLU evaluation, the results were as follows:

  • Humanities: 35.2 (Paper: 36.6)
  • STEM: 32.0 (Paper: 32.4)
  • Social Science: 34.7 (Paper: 44.8)
  • Other: 39.6 (Paper: 44.9)

Regarding the default parameters in the code, I used a learning rate of 0.0002, which was different from the paper's setting of 0.00002. I set the batch size to 16 and the gradient accumulation step to 1. I also uninstalled the Triton package as suggested in another issue.
I used the default parameters in the code, but I could not replicate the performance reported in the paper. As a test, I also tried training with the learning rate mentioned in the paper (0.00002), but the results got even worse. Notably, the MMLU loss increased as training progressed.
Do you have any tips on what might be causing this?
Thanks.

@freeSoul-SNU In my reproduction process, I found that using a linear learning rate, as specified in the original paper, might lead to unstable results. However, the MMLU score should be around 38-40. Your parameters are basically the same as mine; I set the batch size to 1 and the gradient accumulation step to 16, but theoretically, this shouldn't affect the final results. Did you use the MMLU evaluation repository mentioned in the QA-LoRA paper?

@xxw11 Thank you very much for your response.

For the MMLU evaluation, I used the mmlu_evaluation function in qalora.py shared by the author. The dataset wasn't included in the qalora Git repository, but I found it in the following GPTQ LoRA repository and used it: https://github.com/qwopqwop200/gptqlora

I also believe that increasing the batch size to 16 with gradient_accumulation set to 1 should not make a significant difference, but the MMLU evaluation results are still very low.

I used the script for finetuning as below:

CUDA_VISIBLE_DEVICES=1 HF_DATASETS_OFFLINE=1 python qalora.py --model_path "AutoGPTQ/examples/quantization/llama7b-quant4bit-g32/" \
    --output_dir output_alpaca \
    --dataset alpaca \
    --do_eval True \
    --do_mmlu_eval True \
    --do_train True \
    --mmlu_dataset 'mmlu-fs' \
    --save_strategy 'no' \
    --save_steps 1000 \
    --max_steps 10000 \
    --optim paged_adamw_32bit \

If it’s not too much trouble, could you share the MMLU evaluation script you used?

+ Did you change the part in qalora.py where the model is loaded in float32 to float16 before training?

Thank you so much.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants