Which evaluation method to include in Chapter 7: Instruction finetuning #194

rasbt · 2024-06-05T23:12:18Z

rasbt
Jun 5, 2024
Maintainer

Hi readers,

I would like to ask you for your preference, and your feedback would be really appreciated!

Suppose I have some space to include a method to evaluate our LLM via an external LLM (somewhat analogous to AlpacaEval). For this, I was thinking of either using GPT-4 or Llama 3 via Ollama.

Which one would you rather like to see in the main chapter?

I already have examples of both here in this repo if you want to check them out:

I'd say there are a few tradeoffs:

Llama 3 via ollama

(+) runs fine locally on a CPU
(+) doesn't require a signup
(-) performance/accuracy is not that good

GPT-4 via the ChatGPT API

(+) has a higher accuracy
(+) teaches you how to use the GPT-4 API (also for other use cases)
(+) doesn't require any sort of computational resources
(-) requires a login / creating an OpenAI platform account
(-/+) it is not free (costs about 30 cents to evaluate the model; but if you sign up with a new account, they give you $18 worth of free credits as far as I know)

Which evaluation method to include in Chapter 7

Llama 3 via ollama

100%

GPT-4 via the ChatGPT API

0%

Something else (let me know in the comments)

0%

3 votes

d-kleine · 2024-06-06T04:09:32Z

d-kleine
Jun 6, 2024

While I appreciate the value of learning to use the GPT-4, I find the OpenAI API LLM evaluation to be quite costly for each notebook run. Given that your book is focused on building LLMs from scratch, using APIs (and GPT-4's architecture is undisclosed) might be somewhat out of scope.

Regarding Ollama, it seems that the performance issues with Llama3 may be due to the 7B model size. The 70B model appears to perform better in evaluations, but requires a lot of RAM. Perhaps considering a different open-source model (phi-3 3.8b or even 7b) and/or with a "sweet spot" size, such as 14B, might yield better results?
https://ollama.com/library

What do you think about this alternative for an external LLM for evaluation?

Prometheus - a specialized open-source evaluation language model
--> https://github.com/prometheus-eval/prometheus-eval

8 replies

d-kleine Jun 6, 2024

Was just trying Prometheus and turns out this wouldn't work, because the underlying vllm library (which would be required for efficient inference) only supports Linux at the moment:

Yeah, I was just testing it too and had the same issue on Window.

It's more hassle than it's worth tbh. Llama 3 has a 80% correlation with GPT-4 scores, that should be good enough, imho.

I agree on that. One thing that might speak for phi-3 3b is that readers/users can run it locally without any problems, even though it performs a little worse than llama3 (and it's another good open-source LLM showcased in this book, not just Llama2, Llama3 and GPT models).

I would like to add some points to your pro/con list:

Llama 3 via ollama

...
deterministic (good for reproducibility), but this is currently bugged with seed and temperature=0
might exceed 16GB RAM (might be slightly too heavy for some users)

GPT-4 via the ChatGPT API

...
not deterministic

I think that the GPT4 eval with OpenAI API is the most convenient one, but also an expensive one (as long as you don't create a new account, I believe this is a crucial point for most readers group of this book, e.g. students). Personally, for me it's not a big deal that the performance for the ollama models is not as good as GPT4 (or even GPT4o) as you can clearly see the performance delta between model 1 and model 2, that model 1 significantly performs better.

By the way, I have noticed that the ollama models don't evaluate every input, resulting in a smaller number of scores. Is there any way to fix that (e.g., making the prompt more specific, like where 0 is the worst score and 100 is the best score). What should the eval LLM do if it cannot score an input for whatever reason?

I've learned something important from this exercise: you can't completely rely on the scores generated by the evaluation language model (LLM). The evaluation depends heavily on what the LLM knows and the prompt, making it feel like you're assessing with a black box. As a result, the scores provided by the model can seem somewhat arbitrary. They might be helpful for rough comparisons, but not much more than that.

rasbt Jun 6, 2024
Maintainer Author

Rough comparisons is a good point. The gold standard is currently a "win %" from human evals/comparisons. But that's expensive and biased too, and using GPT-4 or another LLM is like a way to automate that.

Re phi: I'll leave Llama 3 as the default because imho it's much better, but I will mention phi-3 as an alternative. I can make it a variable so that it can be changed in one central place.

d-kleine Jun 8, 2024

The problem is that both are not available through ollama. Sure, one could convert it for ollama, but this is probably too complicated to include in the book.

About PHUDGE, what do you think about that for the eval section of your book? Do you think this would any benefit to the book?

The PHUDGE model is hosted on HF and can be imported to ollama as a custom model. I agree that for users/readers this is way too complicated, but maybe the authors are keen to publish a quantized custom version of PHUDGE to ollama?

rasbt Jun 8, 2024
Maintainer Author

About PHUDGE, what do you think about that for the eval section of your book?

I would have to try it out first to see how well it works. And it also needs to be available in ollama, otherwise it'd be too hardware intensive.

Like you said, covering on how to import it as a custom model would be way too long for the chapter (I unfortunately don't have that much space left).

Publishing the model to ollama is a good option. I will see if I can contact the authors to do that. However, the other caveat is that the submission deadline is already in 2 weeks, so I am not sure if it is feasible. I may include it in the supplementary materials here on GitHub though.

d-kleine Jun 8, 2024

Publishing the model to ollama is a good option. I will see if I can contact the authors to do that. However, the other caveat is that the submission deadline is already in 2 weeks, so I am not sure if it is feasible. I may include it in the supplementary materials here on GitHub though.

I just read your request to the authors, that was exactly my idea too 🙃

Yeah, I think due to the deadline this is only an idea for the supplementary materials. From my pov, as a reader of this book, it would be cool to see if a some kind of on eval-tasks specialized custom model (e.g., PHUDGE) does better in evaluating LLMs then using a vanilla foundation model like llama or gpt.

rasbt · 2024-06-08T19:29:44Z

rasbt
Jun 8, 2024
Maintainer Author

Unfortunately, an evaluator model that does better than GPT-4 does not exist, yet. GPT-4 is the gold standard at the moment next to human eval (e.g., see AlpacaEval), and both Prometheus and PHUDGE have been trained to approximate GPT-4.

Prometheus-2 7B seems to be not any better Llama 3 8B. When they trained it to approximate GPT-4 their correlation to GPT-4 was between 0.64-0.89 (see table 3, https://arxiv.org/pdf/2405.01535). Based on my assessment here, Llama 3 also has a 0.8 correlation with GPT-4: https://github.com/rasbt/LLMs-from-scratch/blob/main/ch07/03_model-evaluation/scores/correlation-analysis.ipynb

PHUDGE looks bit better, but similar to Prometheus-2 it also has been trained to approximate GPT-4, so you won't get anything better than GPT-4 here.

Btw you also previously mentioned that running the evaluation with the GPT-4 API is too expensive (it costs about 20 cents). And if readers want to run Prometheus-2 or PHUDGE without ollama, they'll need to rent a GPU. This is likely going to cost them at least $1, which is like 5x as much as using the GPT-4 API (btw. the latter comes with $18 worth of free credits).

So, I think that Llama 3 8B via ollama is maybe a nice sweet spot for evaluation, and referencing GPT-4 for the more sophisticated evaluation that most researchers use would probably best then.

13 replies

d-kleine Jun 14, 2024

I have tried, but for some reason it's not creating scores for the models' outputs that can be aggregated, the scores list remain empty. I have not yet successfully debugged the the prompt, and there is currently no LoRA available for CausalLM-version of Phudge. As the model is not yet quantized, the training would take 2.5 hours on local Nvidia GPU on my computer - per model to be evaluated.

rasbt Jun 14, 2024
Maintainer Author

I see, that's maybe because it's not strictly creating integer values. Maybe print out some of the responses to see what they look like. Could be that it requires a slightly different prompt. Or maybe another LLM to extract the score from the response:

def generate_model_scores(json_data, json_key):
    scores = []
    for entry in tqdm(json_data, desc="Scoring entries"):
        prompt = (
            f"Given the input `{format_input(entry)}` "
            f"and correct output `{entry['output']}`, "
            f"score the model response `{entry[json_key]}`"
            f" on a scale from 0 to 100, where 100 is the best score. "
            f"Respond with the integer number only."
        )
        score = query_model(prompt)
        try:
            scores.append(int(score))
        except ValueError:
            print(f"Could not convert score: {score}")
            continue

    return scores

d-kleine Jun 14, 2024

Yeah, I was working on that already, the scoring system is a part of the problem I think.

rasbt Jun 14, 2024
Maintainer Author

Thanks for sharing! I am a bit time-pressed regarding finishing Chapter 7, but let me get back to this after. Here, one thing I would suggest is not to worry about having it output a score between 0 and 100 but let it rather generate the scores between 0 to 5 and then rescale those by multiplying by 20. I am thinking otherwise, if it generates a score of let's say 5, you won't know whether it followed the instruction and it's a 5 on the scale from 0 to 100 (a bad score) or a score on the original 0 to 5 score (a great score).

But yeah, we may have to use Llama 3 or so to extract the score in the first place :D

d-kleine Jun 14, 2024

That's a good point. Currently, it's hard to tell the eval LLM "Don't give me explanations, I just want a score". And with a scoring system, you would have to label each individual score from 0 to 100 to describe exactly what each value represents. It seems that scores ranges like "from 0 to 24" don't work very well for a scoring task.

But yeah, this task is not critical, I just had fun experimenting with it (and learned new things in the process) 🙂

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Which evaluation method to include in Chapter 7: Instruction finetuning #194

{{title}}

Replies: 2 comments 21 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Which evaluation method to include in Chapter 7: Instruction finetuning #194

rasbt Jun 5, 2024 Maintainer

Replies: 2 comments · 21 replies

d-kleine Jun 6, 2024

d-kleine Jun 6, 2024

rasbt Jun 6, 2024 Maintainer Author

d-kleine Jun 8, 2024

rasbt Jun 8, 2024 Maintainer Author

d-kleine Jun 8, 2024

rasbt Jun 8, 2024 Maintainer Author

d-kleine Jun 14, 2024

rasbt Jun 14, 2024 Maintainer Author

d-kleine Jun 14, 2024

rasbt Jun 14, 2024 Maintainer Author

d-kleine Jun 14, 2024

rasbt
Jun 5, 2024
Maintainer

Replies: 2 comments 21 replies

d-kleine
Jun 6, 2024

rasbt Jun 6, 2024
Maintainer Author

rasbt Jun 8, 2024
Maintainer Author

rasbt
Jun 8, 2024
Maintainer Author

rasbt Jun 14, 2024
Maintainer Author

rasbt Jun 14, 2024
Maintainer Author