cannot reproduce the results reported in the Espresso paper #80

Alex357853 · 2024-06-14T20:19:25Z

Hi, this is a really good and useful codebase. I tried to reproduce the results reported in the paper but failed. I used the code in README_ESE.md:

WANDB_MODE=disabled CUDA_VISIBLE_DEVICES=0,1,2,3 torchrun --nproc_per_node=4 --master_port=1234 -m angle_emb.angle_trainer \
--model_name_or_path WhereIsAI/UAE-Large-V1 \
--train_name_or_path SeanLee97/nli_for_simcse --save_dir ckpts/UAE-Large-Espresso \
--ibn_w 10.0 --cosine_w 0. --angle_w 1.0 --angle_tau 20.0 --learning_rate 1e-6 --maxlen 75 \
--workers 16 \
--pooling_strategy cls \
--epochs 1 \
--batch_size 128 \
--logging_steps 100 \
--warmup_steps 200 \
--save_steps 1000 \
--fp16 1 \
--gradient_accumulation_steps 4 \
--apply_ese 1 \
--ese_compression_size 128 \
--ese_kl_temperature 1.0

However, it only gave the following results:

sts12	sts13	sts14	sts15	sts16	STSB	SICKR	Avg.
79.25	88.63	84.15	89.61	85.99	87.79	79.59	85.00

I also change --cosine_w 0. to --cosine_w 1.0 and --ibn_w 10.0 to --ibn_w 35.0, but the results were even worse.

The results reported in your paper are:

sts12	sts13	sts14	sts15	sts16	STSB	SICKR	Avg.
79.64	90.40	85.76	90.33	86.64	88.54	81.09	86.06

If I purely evaluate the WhereIsAI/UAE-Large-V1 model, the results are:

sts12	sts13	sts14	sts15	sts16	STSB	SICKR	Avg.
79.09	89.62	85.02	89.51	86.61	89.06	82.09	85.86

This means fine-tuning gave me worse performance. In addition, I noticed that the more epochs I train, the worse the performance gets.
Besides, I also tried the code in examples/NLI/README.md to train Qwen1.5-0.5B:

CUDA_VISIBLE_DEVICES=1,2,3,4 torchrun --nproc_per_node=4 --master_port=1234 train_angle.py \
--task NLI-STS --save_dir ckpts/NLI-STS-angle-Qwen1.5-0.5B \
--model_name Qwen/Qwen1.5-0.5B \
--w2 35 --learning_rate 1e-4 --maxlen 50 \
--lora_r 32 --lora_alpha 32 --lora_dropout 0.1 \
--save_steps 500 --batch_size 120 --seed 42 --do_eval 0 --load_kbit 4 --gradient_accumulation_steps 4 --epochs 1

It gave me an average score of 70.23, whereas the paper reports 82.82.

I wonder whether these scripts are the ones you used to train your model, especially regarding the parameter values. It would be really helpful if you could assist me in reproducing the results so I can use this codebase. I really appreciate your time and help! Thank you!

The text was updated successfully, but these errors were encountered:

SeanLee97 · 2024-06-15T03:28:00Z

hi @Alex357853 thanks for following our work. Since ESE is under review, we didn't provide many details.

For UAE, you can try to increase ibn_w to 20, and evaluate it with cls_avg pooling (training with cls).
For Qwen, we use bi-directional LLMs, i.e., removing the causal mask of LLMs. For more details, you can refer to this documentation: https://angle.readthedocs.io/en/latest/notes/training.html#angle-trainer-recommended (in 3. Examples / b.LLaMA-based) Specifically, we set --apply_billm 1, --billm_model_class Qwen2ForCausalLM, --load_kbit 8, and set --epochs 2. I've uploaded the evaluation script here and made the ese-qwen weight public. You can have a try to evaluate the public model and check whether the evaluation works as expected.

The evaluation script is as follows:

BiLLM_START_INDEX=0 CUDA_VISIBLE_DEVICES=0 python eval_ese_nli.py --pooling_strategy avg --model_name_or_path Qwen/Qwen1.5-0.5B  --lora_weight WhereIsAI/ese-qwen-0.5b-nli --billm_model_class Qwen2ForCausalLM

BTW, you can try to increase the gradient_accumulation_steps to x-times gpu_counts. It might help improve performance further.

SeanLee97 · 2024-06-15T03:30:07Z

hi @Alex357853 thanks for following our work. Since ESE is under review, we didn't provided many details.

For UAE, you can try to increase ibn_w to 20, and evaluate it with cls_avg pooling (training with cls).

For Qwen, we use bi-directional LLMs, i.e., removing the causal mask of LLMs. For more details, you can refer to this documentation: https://angle.readthedocs.io/en/latest/notes/training.html#angle-trainer-recommended (in 3. Examples / b.LLaMA-based) Specifically, we set --apply_billm 1, --load_kbit 8, and set --epochs 2. I've uploaded the evaluation script here and made the ese-qwen weight public. You can have a try to evaluate the public model and check whether the evaluation works as expected.

The evaluation script is as follows:
BiLLM_START_INDEX=0 CUDA_VISIBLE_DEVICES=0 python eval_nli.py --pooling_strategy avg --model_name_or_path Qwen/Qwen1.5-0.5B  --lora_weight WhereIsAI/ese-qwen-0.5b-nli --billm_model_class Qwen2ForCausalLM
BTW, you can try to increase the gradient_accumulation_steps to x-times gpu_counts. It might help improve performance further.

BTW, you can have a try using the newly released Qwen/Qwen2-0.5B, it might boost the performance further.

Alex357853 · 2024-07-16T23:47:38Z

Hi @SeanLee97, thanks for your prompt reply! I am still struggling with the code. I noticed that your trainer can not train using the "last pooling" strategy. The potential bug I found is in

AnglE/angle_emb/angle.py

Lines 667 to 672 in 191ca1b

    
           features['attention_mask'] = self.tokenizer.pad( 
        
               {'input_ids': [feature['attention_mask'] for feature in new_features]}, 
        
               padding=self.padding, 
        
               max_length=self.max_length, 
        
               return_tensors=return_tensors, 
        
           )['input_ids']

For example, after

AnglE/angle_emb/angle.py

Lines 661 to 666 in 191ca1b

    
           features = self.tokenizer.pad( 
        
               {'input_ids': [feature['input_ids'] for feature in new_features]}, 
        
               padding=self.padding, 
        
               max_length=self.max_length, 
        
               return_tensors=return_tensors, 
        
           )

we already get features['attention_mask'] = tensor([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]). However, after lines L667-L692, it becomes tensor([ 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 151643]). I think this may affect the model's performance at the beginning, including other pooling strategies. Could you please clarify whether this is an issue in your code? Thank you for your time and help!

SeanLee97 · 2024-07-21T04:57:20Z

@Alex357853 Thank you for reporting this issue! It is indeed a bug.

It uses the pad token to pad attention mask, however, the pad token is 151643 not 0 in Qwen: https://huggingface.co/Qwen/Qwen2-0.5B-Instruct/blob/main/tokenizer_config.json#L36

I am fixing this issue on the PR: #89

Thank you again!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cannot reproduce the results reported in the Espresso paper #80

cannot reproduce the results reported in the Espresso paper #80

Alex357853 commented Jun 14, 2024

SeanLee97 commented Jun 15, 2024 •

edited

Loading

SeanLee97 commented Jun 15, 2024

Alex357853 commented Jul 16, 2024

SeanLee97 commented Jul 21, 2024

cannot reproduce the results reported in the Espresso paper #80

cannot reproduce the results reported in the Espresso paper #80

Comments

Alex357853 commented Jun 14, 2024

SeanLee97 commented Jun 15, 2024 • edited Loading

SeanLee97 commented Jun 15, 2024

Alex357853 commented Jul 16, 2024

SeanLee97 commented Jul 21, 2024

SeanLee97 commented Jun 15, 2024 •

edited

Loading