Replies: 2 comments 21 replies
-
While I appreciate the value of learning to use the GPT-4, I find the OpenAI API LLM evaluation to be quite costly for each notebook run. Given that your book is focused on building LLMs from scratch, using APIs (and GPT-4's architecture is undisclosed) might be somewhat out of scope. Regarding Ollama, it seems that the performance issues with Llama3 may be due to the 7B model size. The 70B model appears to perform better in evaluations, but requires a lot of RAM. Perhaps considering a different open-source model (phi-3 3.8b or even 7b) and/or with a "sweet spot" size, such as 14B, might yield better results? What do you think about this alternative for an external LLM for evaluation? |
Beta Was this translation helpful? Give feedback.
-
Unfortunately, an evaluator model that does better than GPT-4 does not exist, yet. GPT-4 is the gold standard at the moment next to human eval (e.g., see AlpacaEval), and both Prometheus and PHUDGE have been trained to approximate GPT-4. Prometheus-2 7B seems to be not any better Llama 3 8B. When they trained it to approximate GPT-4 their correlation to GPT-4 was between 0.64-0.89 (see table 3, https://arxiv.org/pdf/2405.01535). Based on my assessment here, Llama 3 also has a 0.8 correlation with GPT-4: https://github.com/rasbt/LLMs-from-scratch/blob/main/ch07/03_model-evaluation/scores/correlation-analysis.ipynb PHUDGE looks bit better, but similar to Prometheus-2 it also has been trained to approximate GPT-4, so you won't get anything better than GPT-4 here. Btw you also previously mentioned that running the evaluation with the GPT-4 API is too expensive (it costs about 20 cents). And if readers want to run Prometheus-2 or PHUDGE without ollama, they'll need to rent a GPU. This is likely going to cost them at least $1, which is like 5x as much as using the GPT-4 API (btw. the latter comes with $18 worth of free credits). So, I think that Llama 3 8B via ollama is maybe a nice sweet spot for evaluation, and referencing GPT-4 for the more sophisticated evaluation that most researchers use would probably best then. |
Beta Was this translation helpful? Give feedback.
-
Hi readers,
I would like to ask you for your preference, and your feedback would be really appreciated!
Suppose I have some space to include a method to evaluate our LLM via an external LLM (somewhat analogous to AlpacaEval). For this, I was thinking of either using GPT-4 or Llama 3 via Ollama.
Which one would you rather like to see in the main chapter?
I already have examples of both here in this repo if you want to check them out:
I'd say there are a few tradeoffs:
Llama 3 via ollama
GPT-4 via the ChatGPT API
3 votes ·
Beta Was this translation helpful? Give feedback.
All reactions