diff --git a/README.md b/README.md index 97ebbc9d..803cea6c 100644 --- a/README.md +++ b/README.md @@ -15,7 +15,8 @@ AlpacaEval provides the following: - [**Automatic evaluator**](#evaluators): an automatic evaluator that has high agreement with humans (validated on 20K annotations). We evaluate a model by - measuring the fraction of times an powerful LLM (e.g. GPT 4 or Claude) prefers the outputs from that model over + measuring the fraction of times an powerful LLM (e.g. GPT 4 or Claude or ChatGPT) prefers the outputs from that model + over outputs from a reference model. Our evaluators enable caching and output randomization by default. - [**Leaderboard**](https://tatsu-lab.github.io/alpaca_eval/): a leaderboard of common models on the AlpacaEval evaluation set. @@ -67,6 +68,7 @@ Details in [limitations](#limitations). - [Data Release](#data-release) - [Differences with AlpacaFarm](#differences-with-alpacafarm) - [Related work](#related-work) + - [Major updates](#major-updates) @@ -97,11 +99,12 @@ Important parameters are the following: - **model_outputs** : A path to a json file for the outputs of the model to add to the leaderboard. Each dictionary should contain the keys `instruction` and `output`. -- **annotators_config**: This is the annotator to use (e.g., `alpaca_eval_gpt4` or `claude`). `alpaca_eval_gpt4` ( +- **annotators_config**: This is the annotator to use (e.g., `alpaca_eval_gpt4` or `claude` + or `chatgpt_fn`). `alpaca_eval_gpt4` ( default) has the - highest agreement rate with our human annotation data. `claude` has a decent agreement and is free for academics. For - a comparison of - annotators see [here](#evaluators). + highest agreement rate with our human annotation data. `claude` has a decent agreement and is free for + academics. `chatgpt_fn` is the worst of the three, but is available to everyone, cheap, and has 2x larger context + window (16K tokens). For a comparison of annotators see [here](#evaluators). - **reference_outputs**: The outputs of the reference model. Same format as `model_outputs`. By default, this is `text-davinci003` outputs on AlpacaEval dataset. @@ -145,8 +148,9 @@ For more information about each function use `alpaca_eval -- --help`. ## Models Our leaderboards are computed on the [AlpacaEval dataset](https://huggingface.co/datasets/tatsu-lab/alpaca_eval). -We precomputed the leaderboard for important models both using `gpt4` (best quality) and `claude` (free for academics, -and high quality). Our full leaderboards can be found at [on this page](https://tatsu-lab.github.io/alpaca_eval/), but +We precomputed the leaderboard for important models using `alpaca_eval_gpt4` (best quality), `claude` (free for +academics, and high quality), and `chatgpt_fn` (cheap and available for everyone). Our full leaderboards can be found +at [on this page](https://tatsu-lab.github.io/alpaca_eval/), but we give minimal leaderboards below. Later we also show how to [add your model](https://github.com/tatsu-lab/alpaca_eval#evaluating-a-model) to the leaderboard and how to make @@ -241,6 +245,26 @@ Details in [Related work](#related-work). +
+ chatgpt_fn minimal leaderboard + +| | Win Rate | Std Err. | +|:----------------------|---------:|---------:| +| gpt4 | 73.8 | 1.5 | +| claude | 70.4 | 1.6 | +| chatgpt | 66.1 | 1.7 | +| wizardlm-13b | 65.2 | 1.7 | +| vicuna-13b | 64.1 | 1.7 | +| guanaco-65b | 62.4 | 1.7 | +| oasst-rlhf-llama-33b | 62.0 | 1.7 | +| alpaca-farm-ppo-human | 60.2 | 1.7 | +| falcon-40b-instruct | 56.5 | 1.7 | +| text_davinci_003 | 50.0 | 0.0 | +| alpaca-7b | 45.2 | 1.7 | +| text_davinci_001 | 28.1 | 1.6 | + +
+ ## Evaluators We evaluate different automatic annotators on the AlpacaEval set by comparing to @@ -250,7 +274,7 @@ Below we show metrics for our suggested evaluator (`alpaca_eval_gpt4`), for prio automatic evaluators ([`alpaca_farm_greedy_gpt4`](https://github.com/tatsu-lab/alpaca_farm),[`aviary_gpt4`](https://aviary.anyscale.com/),[`lmsys_gpt4`](https://chat.lmsys.org/)), for humans (`humans`), and for different base models with essentially the same -prompt (`gpt4`,`claude`,`text_davinci_003`,`guanaco_33b`, `chatgpt`). +prompt (`gpt4`,`claude`,`text_davinci_003`,`chatgpt_fn`,`guanaco_33b`, `chatgpt`). See [here](https://github.com/tatsu-lab/alpaca_eval/tree/main/src/alpaca_eval/evaluators_configs) for the configs of all evaluators that are available out of the box and their associated metrics. @@ -260,11 +284,11 @@ evaluators that are available out of the box and their associated metrics. | aviary_gpt4 | 69.1 | 12.8 | 1869 | 29.5 | 13.1 | 0.70 | | gpt4 | 66.9 | 12.5 | 1037 | 31.5 | 14.6 | 0.65 | | alpaca_farm_greedy_gpt4 | 66.4 | 15.3 | 878 | 30.2 | 19.3 | 0.60 | -| humans | 65.7 | 300.0 | 36800 | 0.0 | | 0.64 | +| humans | 65.7 | 300.0 | 36800 | 0.0 | 34.3 | 0.64 | | claude | 65.5 | 11.1 | 173 | 31.9 | 18.0 | 0.62 | | text_davinci_003 | 64.1 | 8.7 | 121 | 33.8 | 22.7 | 0.70 | | lmsys_gpt4 | 63.2 | 13.9 | 17982 | 34.7 | 16.1 | 0.74 | -| guanaco_33b | 59.1 | | 930 | 54.5 | 27.1 | 0.70 | +| chatgpt_fn | 60.0 | 1.0 | 530 | 36.9 | 27.7 | 0.62 | | chatgpt | 57.2 | 0.8 | 285 | 39.4 | 34.1 | 0.59 |
@@ -360,8 +384,9 @@ due to resource (time and price) constraints. This explains why the #parsed is 6
Tips for choosing evaluators -Overall we recommend using `annotators_config=alpaca_eval_gpt4` if you want the highest agreement with humans, and -`annotators_config=claude` if you have academic (free) access to Claude and have a low budget. +Overall we recommend using `annotators_config=alpaca_eval_gpt4` if you want the highest agreement with humans, +`annotators_config=claude` if you have academic (free) access to Claude and have a low budget, and +`annotators_config=chatgpt_fn` if you don't have access to the other two models. When choosing an annotator we recommend you to consider the following (the first three are obvious): @@ -434,7 +459,7 @@ Details in [limitations](#limitations). [//]: # () -[//]: # ( key) `alpaca_eval --model_outputs 'example/outputs.json' --annotators_config 'text_davinci_003' --max_instances 3 --caching_path None`) +[//]: # ( key) `alpaca_eval --model_outputs 'example/outputs.json' --annotators_config 'text_davinci_003' ~~--max_instances 3~~ --caching_path None`) [//]: # () @@ -611,7 +636,8 @@ directly use `alpaca_eval evaluate_from_model` to also take care of generating o want to use a different model or a different dataset follow the same steps as (1.). 3. Choose an evaluator specified via `annotators_config`. We recommend using `alpaca_eval_gpt4` or `claude` (if you are an - academic). For options and comparisons see [this table](#evaluators). Depending on the evaluator you might need to + academic) or `chatgpt_fn` (if you don't have access to the other two). For options and comparisons + see [this table](#evaluators). Depending on the evaluator you might need to set the appropriate API_KEY in your environment or [here](https://github.com/tatsu-lab/alpaca_eval/blob/main/src/alpaca_eval/constants.py#L7). @@ -1024,9 +1050,9 @@ downloading [alpaca_eval_all_outputs.json](https://huggingface.co/datasets/tatsu ```bash alpaca_eval make_leaderboard \ - --leaderboard_path \ + --leaderboard_path src/alpaca_eval/leaderboards/data_AlpacaEval/_leaderboard.csv \ --all_model_outputs alpaca_eval_all_outputs.json \ - --annotators_config + --annotators_config ``` Then, please create a PR with the annotator config and leaderboard csv. @@ -1249,3 +1275,15 @@ For example: annotators favor style (e.g. use of list, tone, word choice, length) over factuality.
+ + +
+

Major updates

+ +- 19th June 2023: add leaderboard `chatgpt_fn` that anyone can use (no waiting lists). +- 19th June 2023: update to + use [OpenAI's function calling](https://openai.com/blog/function-calling-and-other-api-updates). + Example: [`chatgpt_fn`](https://github.com/tatsu-lab/alpaca_eval/tree/main/src/alpaca_eval/evaluators_configs/chatgpt_fn) + or [`alpaca_eval_gpt4_fn`](https://github.com/tatsu-lab/alpaca_eval/tree/main/src/alpaca_eval/evaluators_configs/alpaca_eval_gpt4_fn). + +
\ No newline at end of file diff --git a/docs/index.html b/docs/index.html index 07f2b5a5..68ee6333 100644 --- a/docs/index.html +++ b/docs/index.html @@ -8,7 +8,11 @@ + gpt4Radio.addEventListener('click', function () { + currentUrl = urls['gpt4']; + updateTable(currentUrl); + }); + + claudeRadio.addEventListener('click', function () { + currentUrl = urls['claude']; + updateTable(currentUrl); + }); + + // chatgptRadio.addEventListener('click', function () { + // currentUrl = urls['chatgpt']; + // updateTable(currentUrl); + // }); + + communityRadio.addEventListener('click', function () { + updateTable(currentUrl); + }); + + verifiedRadio.addEventListener('click', function () { + updateTable(currentUrl); + }); + + minimalRadio.addEventListener('click', function () { + updateTable(currentUrl); + }); + diff --git a/src/alpaca_eval/evaluators_configs/README.md b/src/alpaca_eval/evaluators_configs/README.md index 2479ceb2..f1d884d1 100644 --- a/src/alpaca_eval/evaluators_configs/README.md +++ b/src/alpaca_eval/evaluators_configs/README.md @@ -7,33 +7,32 @@ annotators. We compute those metrics on our suggested evaluator `alpaca_eval_gpt4`, on prior evaluators (`aviary_gpt4`, `lmsys_gpt4`, `alpaca_farm_greedy_gpt4`), and on different base models with which we use essentially the same prompt (`gpt4`, `text_davinci_003`, `claude`, `chatgpt`). +We also provide partial metrics (only 1 seed) for other evaluators, which include our evaluator using OpenAI's function +calls (`alpaca_eval_gpt4_fn`), prior work that we +improved (`improved_aviary_gpt4` and `improved_lmsys_gpt4`), prior work that was not meant to be used as a final +evaluator (`guanaco_33b`), and a ranking evaluator (`alpaca_farm`), and secondary models that use the same prompt as the +models above (`cohere`, `guanaco_33b`): | | Human agreement [%] | Price [$/1000 examples] | Time [seconds/1000 examples] | Bias | Variance | Proba. prefer longer | Proba. prefer lists | Proba. prefer 1 | # parsed | mode | |:------------------------|--------------------:|------------------------:|-----------------------------:|-----:|---------:|---------------------:|--------------------:|----------------:|---------:|:---------| +| alpaca_eval_gpt4_fn | 71.0 | 14.5 | 5046 | 27.6 | 11.1 | 0.75 | 0.63 | 0.48 | 2592 | verified | +| improved_aviary_gpt4 | 69.8 | 12.8 | 1831 | | | 0.73 | 0.68 | 0.49 | 648 | verified | | alpaca_eval_gpt4 | 69.2 | 13.6 | 1455 | 28.4 | 14.6 | 0.68 | 0.69 | 0.50 | 2592 | minimal | | aviary_gpt4 | 69.1 | 12.8 | 1869 | 29.5 | 13.1 | 0.70 | 0.65 | 0.53 | 2592 | minimal | +| claude_ranking | 67.6 | 5.0 | 218 | | | 0.73 | 0.63 | 0.46 | 648 | verified | | gpt4 | 66.9 | 12.5 | 1037 | 31.5 | 14.6 | 0.65 | 0.61 | 0.54 | 2592 | minimal | | alpaca_farm_greedy_gpt4 | 66.4 | 15.3 | 878 | 30.2 | 19.3 | 0.60 | 0.59 | 0.54 | 2592 | minimal | -| humans | 65.7 | 300.0 | 36800 | 0.0 | | 0.64 | 0.61 | 0.52 | 2592 | minimal | +| humans | 65.7 | 300.0 | 36800 | 0.0 | 34.3 | 0.64 | 0.61 | 0.52 | 2592 | minimal | | claude | 65.5 | 11.1 | 173 | 31.9 | 18.0 | 0.62 | 0.58 | 0.49 | 2592 | minimal | | text_davinci_003 | 64.1 | 8.7 | 121 | 33.8 | 22.7 | 0.70 | 0.64 | 0.47 | 2592 | minimal | | lmsys_gpt4 | 63.2 | 13.9 | 17982 | 34.7 | 16.1 | 0.74 | 0.64 | 0.56 | 2592 | minimal | +| guanaco_33b | 62.7 | | 911 | | | 0.70 | 0.72 | 0.43 | 451 | verified | +| improved_lmsys_gpt4 | 62.3 | 13.9 | 5398 | | | 0.75 | 0.67 | 0.51 | 648 | verified | | longest | 62.2 | 0.0 | 0 | 37.8 | 0.0 | 1.00 | 0.85 | 0.42 | 2592 | verified | +| alpaca_farm | 60.0 | 11.5 | 820 | | | 0.60 | 0.63 | 0.52 | 648 | verified | +| chatgpt_fn | 60.0 | 1.0 | 530 | 36.9 | 27.7 | 0.62 | 0.65 | 0.49 | 2592 | minimal | | chatgpt | 57.2 | 0.8 | 285 | 39.4 | 34.1 | 0.59 | 0.56 | 0.49 | 2589 | minimal | - -We also provide partial metrics (only 1 seed) for the following evaluators, which include prior work that we -improved (`improved_aviary_gpt4` and `improved_lmsys_gpt4`), prior work that was not meant to be used as a final -evaluator (`guanaco_33b`), and a ranking evaluator (`alpaca_farm`), and secondary models that use the same prompt as the -models above (`cohere`, `guanaco_33b`): - -| | Human agreement [%] | Price [$/1000 examples] | Time [seconds/1000 examples] | Bias | Variance | Proba. prefer longer | Proba. prefer lists | Proba. prefer 1 | # parsed | mode | -|:---------------------|--------------------:|------------------------:|-----------------------------:|-----:|---------:|---------------------:|--------------------:|----------------:|---------:|:---------| -| improved_aviary_gpt4 | 69.8 | 12.8 | 1831 | | | 0.73 | 0.68 | 0.49 | 648 | verified | -| claude_ranking | 67.6 | 5.0 | 218 | | | 0.73 | 0.63 | 0.46 | 648 | verified | -| guanaco_33b | 62.7 | | 911 | | | 0.70 | 0.72 | 0.43 | 451 | verified | -| improved_lmsys_gpt4 | 62.3 | 13.9 | 5398 | | | 0.75 | 0.67 | 0.51 | 648 | verified | -| alpaca_farm | 60.0 | 11.5 | 820 | | | 0.60 | 0.63 | 0.52 | 648 | verified | -| cohere | 53.4 | 3.5 | 217 | | | 0.50 | 0.51 | 0.47 | 648 | verified | +| cohere | 53.4 | 3.5 | 217 | | | 0.50 | 0.51 | 0.47 | 648 | verified | [//]: # (| | Human agreement [%] | Price [$/1000 examples] | Time [seconds/1000 examples] | Bias | Variance | Proba. prefer longer | Proba. prefer lists | Proba. prefer 1 | # parsed | mode |) diff --git a/src/alpaca_eval/evaluators_configs/chatgpt_fn/basic_function_prompt.txt b/src/alpaca_eval/evaluators_configs/chatgpt_fn/basic_function_prompt.txt new file mode 100644 index 00000000..f097efe4 --- /dev/null +++ b/src/alpaca_eval/evaluators_configs/chatgpt_fn/basic_function_prompt.txt @@ -0,0 +1,35 @@ +<|im_start|>system +You are a helpful instruction-following assistant that prints the best model by selecting the best outputs for a given instruction. +<|im_end|> +<|im_start|>user +Select the output (a) or (b) that best matches the given instruction. Choose your preferred output, which can be subjective. Your answer should ONLY contain: Output (a) or Output (b). Here's an example: + +# Example: +## Instruction: +Give a description of the following job: "ophthalmologist" + +## Output (a): +An ophthalmologist is a medical doctor who specializes in the diagnosis and treatment of eye diseases and conditions. + +## Output (b): +An ophthalmologist is a medical doctor who pokes and prods at your eyes while asking you to read letters from a chart. + +## Which is best, Output (a) or Output (b)? +Output (a) + +Here the answer is Output (a) because it provides a comprehensive and accurate description of the job of an ophthalmologist. In contrast, output (b) is more of a joke. + +# Task: +Now is the real task, do not explain your answer, just say Output (a) or Output (b). + +## Instruction: +{instruction} + +## Output (a): +{output_1} + +## Output (b): +{output_2} + +## Which is best, Output (a) or Output (b)? +<|im_end|> \ No newline at end of file diff --git a/src/alpaca_eval/evaluators_configs/chatgpt_fn/configs.yaml b/src/alpaca_eval/evaluators_configs/chatgpt_fn/configs.yaml new file mode 100644 index 00000000..b857a327 --- /dev/null +++ b/src/alpaca_eval/evaluators_configs/chatgpt_fn/configs.yaml @@ -0,0 +1,24 @@ +chatgpt_fn: + prompt_template: "chatgpt_fn/basic_function_prompt.txt" + fn_completions: "openai_completions" + completions_kwargs: + model_name: "gpt-3.5-turbo-16k-0613" + max_tokens: 50 + temperature: 0 + function_call: + name: "print_best_model" + functions: + - name: "print_best_model" + description: "Print the best model given the preferred output." + parameters: + type: "object" + properties: + best_output: + type: "string" + description: "Name of the best output, should be 'Output (a)' or 'Output (b)'" + "required": [ "best_output" ] + completion_parser_kwargs: + outputs_to_match: + 1: '(?i)output \(a\)' + 2: '(?i)output \(b\)' + batch_size: 1 diff --git a/src/alpaca_eval/leaderboards/evaluators/evaluators_leaderboard.csv b/src/alpaca_eval/leaderboards/evaluators/evaluators_leaderboard.csv index f3b42aa7..5e20dcca 100644 --- a/src/alpaca_eval/leaderboards/evaluators/evaluators_leaderboard.csv +++ b/src/alpaca_eval/leaderboards/evaluators/evaluators_leaderboard.csv @@ -1,4 +1,5 @@ ,Human agreement [%],Price [$/1000 examples],Time [seconds/1000 examples],Bias,Variance,Proba. prefer longer,Proba. prefer lists,Proba. prefer 1,# parsed,mode +alpaca_eval_gpt4_fn,70.98765432098766,14.471944444444444,5046.056233910331,27.623456790123456,11.111111111111104,0.750561797752809,0.6339285714285714,0.4799382716049383,2592,verified improved_aviary_gpt4,69.75308641975309,12.781435185185186,1831.2850013,,,0.7280898876404495,0.6785714285714286,0.4861111111111111,648,verified alpaca_eval_gpt4,69.1743827160494,13.601944444444444,1455.4169713998845,28.395061728395067,14.621913580246916,0.6831460674157304,0.6875,0.5011574074074074,2592,minimal aviary_gpt4,69.05864197530865,12.781666666666668,1868.680324340008,29.475308641975307,13.117283950617288,0.701123595505618,0.6517857142857143,0.533179012345679,2592,minimal @@ -13,5 +14,6 @@ guanaco_33b,62.74944567627494,,910.8929739450112,,,0.6991150442477876,0.71951219 improved_lmsys_gpt4,62.34567901234568,13.938055555555556,5397.837981725772,,,0.7534883720930232,0.6727272727272727,0.5138888888888888,648,verified longest,62.19135802469136,0.0,0.0,37.808641975308646,0.0,1.0,0.8482142857142857,0.4166666666666667,2592,verified alpaca_farm,60.03086419753087,11.547508744135802,820.2330700344137,,,0.6,0.6339285714285714,0.5246913580246915,648,verified +chatgpt_fn,59.992283950617285,1.0088333333333337,529.928419875,36.88271604938272,27.73919753086419,0.6247191011235955,0.6517857142857143,0.4911265432098766,2592,minimal chatgpt,57.21450617283951,0.8342726921591347,284.9753823429895,39.35185185185185,34.080370942812976,0.5910112359550562,0.5625,0.488991888760139,2589,minimal cohere,53.39506172839506,3.452932098765432,216.8668793200617,,,0.503370786516854,0.5089285714285714,0.4737654320987654,648,verified