Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Humaneval and MBPP results of deepseek-6.7b-coder-instruct are lower than offical report of Deepseek team #262

Open
jessyford opened this issue Aug 9, 2024 · 2 comments

Comments

@jessyford
Copy link

Hello, Thanks for your great work. However here are three questions that confuse me.

  1. I run the humaneval and mbpp evaluation on deepseek-6.7b-coder-instruct.
    Offical report in https://deepseekcoder.github.io/:
    Humaneval: 78.6
    MBPP: 65.4
    My pass@1 test with bigcode:
    Humaneval:71.95
    MBPP: 58.8

Each task on my own run is lower than official report. Do you have any idea?

  1. As you know, deepseek instruct model has a specific prompt format which you have already implemented in the HumenEvalPack class in humanevalpack.py. I am wondering that, when I run bigcode's evaluation(humaneval or mbpp), do I need to set any specific parameter to tell the program that I need to wrap the question with deepseek format? Or the bigcode can automatically detect that it should use deepseek format?(if so, how?)

  2. In deepseek's paper they mentioned that they run humaneval with zero-shot, and mbpp with few-shot. I also noticed that in mbpp offical paper, they use three-shot. So I am wondering which strategy does bigcode use? One shot? or three shot? or other strategy?

Here is my script for running mbpp(for humaneval, only the --task is changed to humaneval):

accelerate launch main.py --model /models/huggingface.co/deepseek-ai/deepseek-coder-6.7b-instruct --tasks mbpp --limit 1000 --max_length_generation 4096 --temperature 0 --do_sample False --n_samples 1 --batch_size 1 --allow_code_execution --save_generations --precision fp16

The following values were not passed to accelerate launch and had defaults used instead:
--num_processes was set to a value of 1
--num_machines was set to a value of 1
--mixed_precision was set to a value of 'no'
--dynamo_backend was set to a value of 'no'
To avoid this warning pass in values for each of the problematic parameters or run accelerate config.
Detected kernel version 4.18.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.
Selected Tasks: ['mbpp']
Loading model in fp16
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:04<00:00, 2.32s/it]
generations were saved at generations_mbpp.json
Evaluating generations...
{
"mbpp": {
"pass@1": 0.588
},
"config": {
"prefix": "",
"do_sample": false,
"temperature": 0.0,
"top_k": 0,
"top_p": 0.95,
"n_samples": 1,
"eos": "<|endoftext|>",
"seed": 0,
"model": "/models/huggingface.co/deepseek-ai/deepseek-coder-6.7b-instruct",
"modeltype": "causal",
"peft_model": null,
"revision": null,
"use_auth_token": false,
"trust_remote_code": false,
"tasks": "mbpp",
"instruction_tokens": null,
"batch_size": 1,
"max_length_generation": 4096,
"precision": "fp16",
"load_in_8bit": false,
"load_in_4bit": false,
"left_padding": false,
"limit": 1000,
"limit_start": 0,
"save_every_k_tasks": -1,
"postprocess": true,
"allow_code_execution": true,
"generation_only": false,
"load_generations_path": null,
"load_data_path": null,
"metric_output_path": "evaluation_results.json",
"save_generations": true,
"load_generations_intermediate_paths": null,
"save_generations_path": "generations.json",
"save_references": false,
"save_references_path": "references.json",
"prompt": "prompt",
"max_memory_per_gpu": null,
"check_references": false
}
}

@SefaZeng
Copy link

SefaZeng commented Sep 1, 2024

Have solved this? I evaluated Qwen2 and also got lower results than the report.

@loubnabnl
Copy link
Collaborator

1- I'm not sure what was the evaluation setup is for DeepSeek paper
2- HumanEval just uses plain HumanEval prompts not in an instruct format. If you want to use HumanEvalPack you need to usehumanevalsynthesize-python task and specify which prompt format you want, in this case deepseek:

accelerate launch main.py \
  --model <MODEL_NAME> \
  --max_length_generation 2048 \
  --prompt deepseek \
  --tasks humanevalsynthesize-python \
  --do_sample False \
  --n_samples 1 \
  --batch_size 1 \
  --allow_code_execution

3- MBPP is one-shot, we include one test case in the prompt, you can see how the prompt is built here https://github.com/bigcode-project/bigcode-evaluation-harness/blob/main/bigcode_eval/tasks/mbpp.py#L48

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants