Humaneval and MBPP results of deepseek-6.7b-coder-instruct are lower than offical report of Deepseek team #262

jessyford · 2024-08-09T07:51:08Z

Hello, Thanks for your great work. However here are three questions that confuse me.

I run the humaneval and mbpp evaluation on deepseek-6.7b-coder-instruct.
Offical report in https://deepseekcoder.github.io/:
Humaneval: 78.6
MBPP: 65.4
My pass@1 test with bigcode:
Humaneval:71.95
MBPP: 58.8

Each task on my own run is lower than official report. Do you have any idea?

As you know, deepseek instruct model has a specific prompt format which you have already implemented in the HumenEvalPack class in humanevalpack.py. I am wondering that, when I run bigcode's evaluation(humaneval or mbpp), do I need to set any specific parameter to tell the program that I need to wrap the question with deepseek format? Or the bigcode can automatically detect that it should use deepseek format?(if so, how?)
In deepseek's paper they mentioned that they run humaneval with zero-shot, and mbpp with few-shot. I also noticed that in mbpp offical paper, they use three-shot. So I am wondering which strategy does bigcode use? One shot? or three shot? or other strategy?

Here is my script for running mbpp(for humaneval, only the --task is changed to humaneval):

accelerate launch main.py --model /models/huggingface.co/deepseek-ai/deepseek-coder-6.7b-instruct --tasks mbpp --limit 1000 --max_length_generation 4096 --temperature 0 --do_sample False --n_samples 1 --batch_size 1 --allow_code_execution --save_generations --precision fp16

The following values were not passed to accelerate launch and had defaults used instead:
--num_processes was set to a value of 1
--num_machines was set to a value of 1
--mixed_precision was set to a value of 'no'
--dynamo_backend was set to a value of 'no'
To avoid this warning pass in values for each of the problematic parameters or run accelerate config.
Detected kernel version 4.18.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.
Selected Tasks: ['mbpp']
Loading model in fp16
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:04<00:00, 2.32s/it]
generations were saved at generations_mbpp.json
Evaluating generations...
{
"mbpp": {
"pass@1": 0.588
},
"config": {
"prefix": "",
"do_sample": false,
"temperature": 0.0,
"top_k": 0,
"top_p": 0.95,
"n_samples": 1,
"eos": "<|endoftext|>",
"seed": 0,
"model": "/models/huggingface.co/deepseek-ai/deepseek-coder-6.7b-instruct",
"modeltype": "causal",
"peft_model": null,
"revision": null,
"use_auth_token": false,
"trust_remote_code": false,
"tasks": "mbpp",
"instruction_tokens": null,
"batch_size": 1,
"max_length_generation": 4096,
"precision": "fp16",
"load_in_8bit": false,
"load_in_4bit": false,
"left_padding": false,
"limit": 1000,
"limit_start": 0,
"save_every_k_tasks": -1,
"postprocess": true,
"allow_code_execution": true,
"generation_only": false,
"load_generations_path": null,
"load_data_path": null,
"metric_output_path": "evaluation_results.json",
"save_generations": true,
"load_generations_intermediate_paths": null,
"save_generations_path": "generations.json",
"save_references": false,
"save_references_path": "references.json",
"prompt": "prompt",
"max_memory_per_gpu": null,
"check_references": false
}
}

The text was updated successfully, but these errors were encountered:

SefaZeng · 2024-09-01T09:06:29Z

Have solved this? I evaluated Qwen2 and also got lower results than the report.

loubnabnl · 2024-09-02T13:41:03Z

1- I'm not sure what was the evaluation setup is for DeepSeek paper
2- HumanEval just uses plain HumanEval prompts not in an instruct format. If you want to use HumanEvalPack you need to usehumanevalsynthesize-python task and specify which prompt format you want, in this case deepseek:

accelerate launch main.py \
  --model <MODEL_NAME> \
  --max_length_generation 2048 \
  --prompt deepseek \
  --tasks humanevalsynthesize-python \
  --do_sample False \
  --n_samples 1 \
  --batch_size 1 \
  --allow_code_execution

3- MBPP is one-shot, we include one test case in the prompt, you can see how the prompt is built here https://github.com/bigcode-project/bigcode-evaluation-harness/blob/main/bigcode_eval/tasks/mbpp.py#L48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Humaneval and MBPP results of deepseek-6.7b-coder-instruct are lower than offical report of Deepseek team #262

Humaneval and MBPP results of deepseek-6.7b-coder-instruct are lower than offical report of Deepseek team #262

jessyford commented Aug 9, 2024

SefaZeng commented Sep 1, 2024

loubnabnl commented Sep 2, 2024

Humaneval and MBPP results of deepseek-6.7b-coder-instruct are lower than offical report of Deepseek team #262

Humaneval and MBPP results of deepseek-6.7b-coder-instruct are lower than offical report of Deepseek team #262

Comments

jessyford commented Aug 9, 2024

SefaZeng commented Sep 1, 2024

loubnabnl commented Sep 2, 2024