You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hello, Thanks for your great work. However here are three questions that confuse me.
I run the humaneval and mbpp evaluation on deepseek-6.7b-coder-instruct.
Offical report in https://deepseekcoder.github.io/:
Humaneval: 78.6
MBPP: 65.4
My pass@1 test with bigcode:
Humaneval:71.95
MBPP: 58.8
Each task on my own run is lower than official report. Do you have any idea?
As you know, deepseek instruct model has a specific prompt format which you have already implemented in the HumenEvalPack class in humanevalpack.py. I am wondering that, when I run bigcode's evaluation(humaneval or mbpp), do I need to set any specific parameter to tell the program that I need to wrap the question with deepseek format? Or the bigcode can automatically detect that it should use deepseek format?(if so, how?)
In deepseek's paper they mentioned that they run humaneval with zero-shot, and mbpp with few-shot. I also noticed that in mbpp offical paper, they use three-shot. So I am wondering which strategy does bigcode use? One shot? or three shot? or other strategy?
Here is my script for running mbpp(for humaneval, only the --task is changed to humaneval):
The following values were not passed to accelerate launch and had defaults used instead: --num_processes was set to a value of 1 --num_machines was set to a value of 1 --mixed_precision was set to a value of 'no' --dynamo_backend was set to a value of 'no'
To avoid this warning pass in values for each of the problematic parameters or run accelerate config.
Detected kernel version 4.18.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.
Selected Tasks: ['mbpp']
Loading model in fp16
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:04<00:00, 2.32s/it]
generations were saved at generations_mbpp.json
Evaluating generations...
{
"mbpp": {
"pass@1": 0.588
},
"config": {
"prefix": "",
"do_sample": false,
"temperature": 0.0,
"top_k": 0,
"top_p": 0.95,
"n_samples": 1,
"eos": "<|endoftext|>",
"seed": 0,
"model": "/models/huggingface.co/deepseek-ai/deepseek-coder-6.7b-instruct",
"modeltype": "causal",
"peft_model": null,
"revision": null,
"use_auth_token": false,
"trust_remote_code": false,
"tasks": "mbpp",
"instruction_tokens": null,
"batch_size": 1,
"max_length_generation": 4096,
"precision": "fp16",
"load_in_8bit": false,
"load_in_4bit": false,
"left_padding": false,
"limit": 1000,
"limit_start": 0,
"save_every_k_tasks": -1,
"postprocess": true,
"allow_code_execution": true,
"generation_only": false,
"load_generations_path": null,
"load_data_path": null,
"metric_output_path": "evaluation_results.json",
"save_generations": true,
"load_generations_intermediate_paths": null,
"save_generations_path": "generations.json",
"save_references": false,
"save_references_path": "references.json",
"prompt": "prompt",
"max_memory_per_gpu": null,
"check_references": false
}
}
The text was updated successfully, but these errors were encountered:
1- I'm not sure what was the evaluation setup is for DeepSeek paper
2- HumanEval just uses plain HumanEval prompts not in an instruct format. If you want to use HumanEvalPack you need to usehumanevalsynthesize-python task and specify which prompt format you want, in this case deepseek:
Hello, Thanks for your great work. However here are three questions that confuse me.
Offical report in https://deepseekcoder.github.io/:
Humaneval: 78.6
MBPP: 65.4
My pass@1 test with bigcode:
Humaneval:71.95
MBPP: 58.8
Each task on my own run is lower than official report. Do you have any idea?
As you know, deepseek instruct model has a specific prompt format which you have already implemented in the HumenEvalPack class in humanevalpack.py. I am wondering that, when I run bigcode's evaluation(humaneval or mbpp), do I need to set any specific parameter to tell the program that I need to wrap the question with deepseek format? Or the bigcode can automatically detect that it should use deepseek format?(if so, how?)
In deepseek's paper they mentioned that they run humaneval with zero-shot, and mbpp with few-shot. I also noticed that in mbpp offical paper, they use three-shot. So I am wondering which strategy does bigcode use? One shot? or three shot? or other strategy?
Here is my script for running mbpp(for humaneval, only the --task is changed to humaneval):
accelerate launch main.py --model /models/huggingface.co/deepseek-ai/deepseek-coder-6.7b-instruct --tasks mbpp --limit 1000 --max_length_generation 4096 --temperature 0 --do_sample False --n_samples 1 --batch_size 1 --allow_code_execution --save_generations --precision fp16
The following values were not passed to
accelerate launch
and had defaults used instead:--num_processes
was set to a value of1
--num_machines
was set to a value of1
--mixed_precision
was set to a value of'no'
--dynamo_backend
was set to a value of'no'
To avoid this warning pass in values for each of the problematic parameters or run
accelerate config
.Detected kernel version 4.18.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.
Selected Tasks: ['mbpp']
Loading model in fp16
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:04<00:00, 2.32s/it]
generations were saved at generations_mbpp.json
Evaluating generations...
{
"mbpp": {
"pass@1": 0.588
},
"config": {
"prefix": "",
"do_sample": false,
"temperature": 0.0,
"top_k": 0,
"top_p": 0.95,
"n_samples": 1,
"eos": "<|endoftext|>",
"seed": 0,
"model": "/models/huggingface.co/deepseek-ai/deepseek-coder-6.7b-instruct",
"modeltype": "causal",
"peft_model": null,
"revision": null,
"use_auth_token": false,
"trust_remote_code": false,
"tasks": "mbpp",
"instruction_tokens": null,
"batch_size": 1,
"max_length_generation": 4096,
"precision": "fp16",
"load_in_8bit": false,
"load_in_4bit": false,
"left_padding": false,
"limit": 1000,
"limit_start": 0,
"save_every_k_tasks": -1,
"postprocess": true,
"allow_code_execution": true,
"generation_only": false,
"load_generations_path": null,
"load_data_path": null,
"metric_output_path": "evaluation_results.json",
"save_generations": true,
"load_generations_intermediate_paths": null,
"save_generations_path": "generations.json",
"save_references": false,
"save_references_path": "references.json",
"prompt": "prompt",
"max_memory_per_gpu": null,
"check_references": false
}
}
The text was updated successfully, but these errors were encountered: