Prompting the 7B Llama model to reply with integer score only results in subpar evaluations. #449
ayooshkathuria
started this conversation in
General
Replies: 1 comment 1 reply
-
I played around a bit more with this:
Here is the modified generation function.
In some cases, the responses are ranked correctly when we don't force it to answer with an integer only but not otherwise. In some cases, the evaluation is messed up even when we ask for the descriptive response. For example:
|
Beta Was this translation helpful? Give feedback.
1 reply
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Hi Sebastian,
Thanks for writing this book and it has been of great help. However, there is something that I would like to bring to your attention.
I was fiddling with CH 07 notebook and while looking at the scores generated by the Llama3 model using Ollama, found some very subpar generations ranked fairly decently. This prompted me to run the
ch07/01_main-chapter-code/ch07.ipynb
notebook withnum_epochs=0
in the cell where the training is done, which basically evaluates the performance on the pretrained foundation model on the task. I was basically looking to quantify the improvement instruction finetuning brings to the performance of the model.Pre-loaded GPT Medium mostly just repeats something from the prompt. However, even bogus responses are ranked quite high by Llama3. So much so that it gets a score of 44.38 whereas the score with 2 epochs of training is 48.30
I've forked that repo and pushed the modified notebook with some example of such poor decisions here: https://github.com/ayooshkathuria/LLMs-from-scratch/blob/ollama_eval_weirdness/ch07/01_main-chapter-code/ch07.ipynb
I'm pasting one of them in that thread though you can play around on your own.
Is this expected? or am I missing anything about the proper use of LLMs here?
EDIT: I can't seem to run Llama 70B on my workstation for now to evaluate as there are some issues with my memory (ollama/ollama#941) so would be nice if you can see whether use of L70B alleviates this issue.
System Specs:
Ubuntu 24.04.1 LTS
Ryzen 5950, RTX 3090
Beta Was this translation helpful? Give feedback.
All reactions