Prompting the 7B Llama model to reply with integer score only results in subpar evaluations. #449

ayooshkathuria · 2024-11-26T06:14:41Z

ayooshkathuria
Nov 26, 2024

Hi Sebastian,

Thanks for writing this book and it has been of great help. However, there is something that I would like to bring to your attention.

I was fiddling with CH 07 notebook and while looking at the scores generated by the Llama3 model using Ollama, found some very subpar generations ranked fairly decently. This prompted me to run the ch07/01_main-chapter-code/ch07.ipynb notebook with num_epochs=0 in the cell where the training is done, which basically evaluates the performance on the pretrained foundation model on the task. I was basically looking to quantify the improvement instruction finetuning brings to the performance of the model.

Pre-loaded GPT Medium mostly just repeats something from the prompt. However, even bogus responses are ranked quite high by Llama3. So much so that it gets a score of 44.38 whereas the score with 2 epochs of training is 48.30

I've forked that repo and pushed the modified notebook with some example of such poor decisions here: https://github.com/ayooshkathuria/LLMs-from-scratch/blob/ollama_eval_weirdness/ch07/01_main-chapter-code/ch07.ipynb

I'm pasting one of them in that thread though you can play around on your own.

Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:
Name the author of 'Pride and Prejudice'.

Dataset response:
>> Jane Austen.

Model response:
>> ### Name:

### Title:

### Author:

### Title:

### Author:

### Author:

### Author:

### Author:

### Author:

### Author:

### Author:

### Author:

### Author:

### Author:

### Author:

### Author:

### Author:

### Author:

### Author:

### Author:

### Author:

### Author:

### Author:

### Author:

### Author:

### Author:

### Author:

### Author:

### Author:

### Author:

### Author:

### Author:

### Author:

### Author:

### Author:

### Author:

### Author:

### Author:

### Author:

### Author:

### Author:

### Author:

### Author:

### Author:

### Author:

### Author:

### Author:

### Author:

### Author:

### Author:

Score:
>> 98

-------------------------

Is this expected? or am I missing anything about the proper use of LLMs here?

EDIT: I can't seem to run Llama 70B on my workstation for now to evaluate as there are some issues with my memory (ollama/ollama#941) so would be nice if you can see whether use of L70B alleviates this issue.

System Specs:

Ubuntu 24.04.1 LTS
Ryzen 5950, RTX 3090

ayooshkathuria · 2024-11-26T12:27:43Z

ayooshkathuria
Nov 26, 2024
Author

I played around a bit more with this:

Using llama3.1 with a slightly modified prompt seems to fix the issue to some extent where the untrained model's scores are around 12-13 while trained model's scores are around 53-54.

Here is the modified generation function.

def generate_model_scores(json_data, json_key, model="llama3.1"):
    scores = []
    for entry in tqdm(json_data, desc="Scoring entries"):
        prompt = (
            f"Given the input `{format_input(entry)}` "
            f"and correct output `{entry['output']}`, "
            f"score the model response `{entry[json_key]}`"
            f" on a scale from 0 to 100, where 100 is the best score. "
            f"Respond with the integer number only. Do not include an explanation."
        )
        score = query_model(prompt, model)
        try:
            scores.append(int(score))
        except ValueError:
            print(f"Could not convert score: {score}")
            continue

    return scores

Having said that, if I stop generation at 50 tokens, even Llama 3.1 with modified prompts highly rate gibberish responses.

In some cases, the responses are ranked correctly when we don't force it to answer with an integer only but not otherwise. In some cases, the evaluation is messed up even when we ask for the descriptive response. For example:

Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:
Name the author of 'Pride and Prejudice'.

Dataset response:
>> Jane Austen.

Model response:
>> ### Name:

### Title:

### Author:

### Title:

### Author:

### Author:

### Author:

### Author:

### Author:

Score:
>> I would score this model response as 80.

The model correctly identifies that it needs to provide information about the author of 'Pride and Prejudice', which is a specific task. However, instead of providing a simple answer like "Jane Austen", it creates multiple sections with incorrect titles and authors, which suggests some confusion or misinterpretation of the instruction.

A perfect score (100) would require the model to provide a clear and concise response that directly answers the question, such as:

### Author:
Jane Austen

This response is short, accurate, and relevant to the task.

-------------------------

1 reply

rasbt Nov 27, 2024
Maintainer

Hi there, and thanks for looking into that. I agree, the scores look suspiciously high for the nonsense answers. Thanks also for investigating the updated prompt.

When I understand correctly, you changed 2 things:

Upgrade from "llama3" -> "llama3.1"
Append "Do not include an explanation." to the prompt

Do you by chance now if the llama 3.1 model upgrade may already be sufficient for improving the results, or is the combination of both changes required?

I am likely without access to a computer within the next 7-10 days but I am happy to also run some additional experiments as soon as I am able.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Prompting the 7B Llama model to reply with integer score only results in subpar evaluations. #449

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 1 reply

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Prompting the 7B Llama model to reply with integer score only results in subpar evaluations. #449

ayooshkathuria Nov 26, 2024

System Specs:

Replies: 1 comment · 1 reply

ayooshkathuria Nov 26, 2024 Author

rasbt Nov 27, 2024 Maintainer

ayooshkathuria
Nov 26, 2024

Replies: 1 comment 1 reply

ayooshkathuria
Nov 26, 2024
Author

rasbt Nov 27, 2024
Maintainer