Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

max_length_generation parameter #207

Closed
icoderzqliu opened this issue Mar 21, 2024 · 4 comments
Closed

max_length_generation parameter #207

icoderzqliu opened this issue Mar 21, 2024 · 4 comments

Comments

@icoderzqliu
Copy link

You mentioned in the readme that max_length_generation=512 is enough for tasks like HumanEval and MBPP, but when I tested phi-1.5 and deepseek-coder-1.3b-base on the mbpp task, the following problems occurred at max_length_gen = 512.

ValueError: Input length of input_ids is 512, but `max_length` is set to 512. This can lead to unexpected behavior. You should consider increasing `max_length` or
, better yet, setting `max_new_tokens`.

How should this parameter be set so that the test results can be aligned? Will the setting of this parameter have a significant impact on the results?

@icoderzqliu icoderzqliu changed the title MAX_LENGTH_GEN parameter max_length_generation parameter Mar 21, 2024
@loubnabnl
Copy link
Collaborator

The default is 512 (works fine with HumanEval) but for some tasks you might need more try setting it to 1024 for mbpp. Regarding the impact on the results, if the benchmark has long prompts you want to have a higher max_length to have room for generation otherwise the solutions won't be complete.

@toptechie156
Copy link

toptechie156 commented Apr 14, 2024

@loubnabnl I was facing the same issue for multiple-java and mulitple-cpp while trying to reproduce the leaderboard score for codellama-7b using the steps given in leaderboard README here
https://github.com/bigcode-project/bigcode-evaluation-harness/tree/main/leaderboard#2--generation

is it supposed to be 1024 for multiple-cpp and multiple-java as well?

I was confused beacause in the leaderboad About section it is mentioned that

All models were evaluated with the [bigcode-evaluation-harness](https://github.com/bigcode-project/bigcode-evaluation-harness/tree/main) with top-p=0.95, temperature=0.2, max_length_generation 512, and n_samples=50.

image

@loubnabnl
Copy link
Collaborator

Hi sorry for the confusion, if this happens try 1024, some tokenizers might generate more tokens than others which takes more space. Will update the "About" section of the leaderboard.

nikita1503 pushed a commit to nikita1503/bigcode-evaluation-harness that referenced this issue Apr 17, 2024
@loubnabnl
Copy link
Collaborator

It seems MBPP has a prompt with 1700 tokens with some tokenizers, after this PR #244 you should be able to run the evaluation with a smaller max_length but you might get lower scores as the solutions to some long prompts won't be generated

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants