diff --git a/README.md b/README.md
index 97ebbc9d..803cea6c 100644
--- a/README.md
+++ b/README.md
@@ -15,7 +15,8 @@ AlpacaEval provides the following:
 - [**Automatic evaluator**](#evaluators): an automatic evaluator that has high agreement with humans (validated on 20K
   annotations). We evaluate a
   model by
-  measuring the fraction of times an powerful LLM (e.g. GPT 4 or Claude) prefers the outputs from that model over
+  measuring the fraction of times an powerful LLM (e.g. GPT 4 or Claude or ChatGPT) prefers the outputs from that model
+  over
   outputs from a reference model. Our evaluators enable caching and output randomization by default.
 - [**Leaderboard**](https://tatsu-lab.github.io/alpaca_eval/): a leaderboard of common models on the AlpacaEval
   evaluation set.
@@ -67,6 +68,7 @@ Details in [limitations](#limitations).
     - [Data Release](#data-release)
     - [Differences with AlpacaFarm](#differences-with-alpacafarm)
     - [Related work](#related-work)
+    - [Major updates](#major-updates)
 
 </details>
 
@@ -97,11 +99,12 @@ Important parameters are the following:
 - **model_outputs** : A path to a json file for the outputs of the model to add to the leaderboard. Each dictionary
   should
   contain the keys `instruction` and `output`.
-- **annotators_config**: This is the annotator to use (e.g., `alpaca_eval_gpt4` or `claude`). `alpaca_eval_gpt4` (
+- **annotators_config**: This is the annotator to use (e.g., `alpaca_eval_gpt4` or `claude`
+  or `chatgpt_fn`). `alpaca_eval_gpt4` (
   default) has the
-  highest agreement rate with our human annotation data. `claude` has a decent agreement and is free for academics. For
-  a comparison of
-  annotators see [here](#evaluators).
+  highest agreement rate with our human annotation data. `claude` has a decent agreement and is free for
+  academics. `chatgpt_fn` is the worst of the three, but is available to everyone, cheap, and has 2x larger context
+  window (16K tokens). For a comparison of annotators see [here](#evaluators).
 - **reference_outputs**:  The outputs of the reference model. Same format as `model_outputs`. By default, this
   is `text-davinci003` outputs on
   AlpacaEval dataset.
@@ -145,8 +148,9 @@ For more information about each function use `alpaca_eval <command> -- --help`.
 ## Models
 
 Our leaderboards are computed on the [AlpacaEval dataset](https://huggingface.co/datasets/tatsu-lab/alpaca_eval).
-We precomputed the leaderboard for important models both using `gpt4` (best quality) and  `claude` (free for academics,
-and high quality). Our full leaderboards can be found at [on this page](https://tatsu-lab.github.io/alpaca_eval/), but
+We precomputed the leaderboard for important models using `alpaca_eval_gpt4` (best quality),  `claude` (free for
+academics, and high quality), and `chatgpt_fn` (cheap and available for everyone). Our full leaderboards can be found
+at [on this page](https://tatsu-lab.github.io/alpaca_eval/), but
 we give minimal leaderboards below.
 Later we also show how to [add your model](https://github.com/tatsu-lab/alpaca_eval#evaluating-a-model) to the
 leaderboard and how to make
@@ -241,6 +245,26 @@ Details in [Related work](#related-work).
 
 </details>
 
+<details>
+  <summary><b><code>chatgpt_fn</code> minimal leaderboard</b></summary>
+
+|                       | Win Rate | Std Err. |
+|:----------------------|---------:|---------:|
+| gpt4                  |     73.8 |      1.5 |
+| claude                |     70.4 |      1.6 |
+| chatgpt               |     66.1 |      1.7 |
+| wizardlm-13b          |     65.2 |      1.7 |
+| vicuna-13b            |     64.1 |      1.7 |
+| guanaco-65b           |     62.4 |      1.7 |
+| oasst-rlhf-llama-33b  |     62.0 |      1.7 |
+| alpaca-farm-ppo-human |     60.2 |      1.7 |
+| falcon-40b-instruct   |     56.5 |      1.7 |
+| text_davinci_003      |     50.0 |      0.0 |
+| alpaca-7b             |     45.2 |      1.7 |
+| text_davinci_001      |     28.1 |      1.6 |
+
+</details>
+
 ## Evaluators
 
 We evaluate different automatic annotators on the AlpacaEval set by comparing to
@@ -250,7 +274,7 @@ Below we show metrics for our suggested evaluator (`alpaca_eval_gpt4`), for prio
 automatic
 evaluators ([`alpaca_farm_greedy_gpt4`](https://github.com/tatsu-lab/alpaca_farm),[`aviary_gpt4`](https://aviary.anyscale.com/),[`lmsys_gpt4`](https://chat.lmsys.org/)),
 for humans (`humans`), and for different base models with essentially the same
-prompt (`gpt4`,`claude`,`text_davinci_003`,`guanaco_33b`, `chatgpt`).
+prompt (`gpt4`,`claude`,`text_davinci_003`,`chatgpt_fn`,`guanaco_33b`, `chatgpt`).
 See [here](https://github.com/tatsu-lab/alpaca_eval/tree/main/src/alpaca_eval/evaluators_configs) for the configs of all
 evaluators that are available out of the box and their associated metrics.
 
@@ -260,11 +284,11 @@ evaluators that are available out of the box and their associated metrics.
 | aviary_gpt4             |                69.1 |                    12.8 |                         1869 | 29.5 |     13.1 |                 0.70 |
 | gpt4                    |                66.9 |                    12.5 |                         1037 | 31.5 |     14.6 |                 0.65 |
 | alpaca_farm_greedy_gpt4 |                66.4 |                    15.3 |                          878 | 30.2 |     19.3 |                 0.60 |
-| humans                  |                65.7 |                   300.0 |                        36800 |  0.0 |          |                 0.64 |
+| humans                  |                65.7 |                   300.0 |                        36800 |  0.0 |     34.3 |                 0.64 |
 | claude                  |                65.5 |                    11.1 |                          173 | 31.9 |     18.0 |                 0.62 |
 | text_davinci_003        |                64.1 |                     8.7 |                          121 | 33.8 |     22.7 |                 0.70 |
 | lmsys_gpt4              |                63.2 |                    13.9 |                        17982 | 34.7 |     16.1 |                 0.74 |
-| guanaco_33b             |                59.1 |                         |                          930 | 54.5 |     27.1 |                 0.70 |
+| chatgpt_fn              |                60.0 |                     1.0 |                          530 | 36.9 |     27.7 |                 0.62 |
 | chatgpt                 |                57.2 |                     0.8 |                          285 | 39.4 |     34.1 |                 0.59 |
 
 <details>
@@ -360,8 +384,9 @@ due to resource (time and price) constraints. This explains why the #parsed is 6
 <details>
   <summary><b>Tips for choosing evaluators</b></summary>
 
-Overall we recommend using `annotators_config=alpaca_eval_gpt4` if you want the highest agreement with humans, and
-`annotators_config=claude` if you have academic (free) access to Claude and have a low budget.
+Overall we recommend using `annotators_config=alpaca_eval_gpt4` if you want the highest agreement with humans,
+`annotators_config=claude` if you have academic (free) access to Claude and have a low budget, and
+`annotators_config=chatgpt_fn` if you don't have access to the other two models.
 
 When choosing an annotator we recommend you to consider the following (the first three are obvious):
 
@@ -434,7 +459,7 @@ Details in [limitations](#limitations).
 
 [//]: # ()
 
-[//]: # (   key&#41; `alpaca_eval --model_outputs 'example/outputs.json' --annotators_config 'text_davinci_003' --max_instances 3 --caching_path None`)
+[//]: # (   key&#41; `alpaca_eval --model_outputs 'example/outputs.json' --annotators_config 'text_davinci_003' ~~--max_instances 3~~ --caching_path None`)
 
 [//]: # ()
 
@@ -611,7 +636,8 @@ directly use `alpaca_eval evaluate_from_model` to also take care of generating o
    want to use a different model or a different dataset follow the same steps as (1.).
 3. Choose an evaluator specified via `annotators_config`. We recommend using `alpaca_eval_gpt4` or `claude` (if you are
    an
-   academic). For options and comparisons see [this table](#evaluators). Depending on the evaluator you might need to
+   academic) or `chatgpt_fn` (if you don't have access to the other two). For options and comparisons
+   see [this table](#evaluators). Depending on the evaluator you might need to
    set the appropriate API_KEY in your environment
    or [here](https://github.com/tatsu-lab/alpaca_eval/blob/main/src/alpaca_eval/constants.py#L7).
 
@@ -1024,9 +1050,9 @@ downloading [alpaca_eval_all_outputs.json](https://huggingface.co/datasets/tatsu
 
 ```bash
 alpaca_eval make_leaderboard \
-  --leaderboard_path <src/alpaca_eval/leaderboards/data_AlpacaEval/your_leaderboard_name.csv> \
+  --leaderboard_path src/alpaca_eval/leaderboards/data_AlpacaEval/<evaluator>_leaderboard.csv \
   --all_model_outputs alpaca_eval_all_outputs.json \
-  --annotators_config <path_to_your_config.yaml>
+  --annotators_config <evaluator_config>
 ```
 
 Then, please create a PR with the annotator config and leaderboard csv.
@@ -1249,3 +1275,15 @@ For example:
   annotators favor style (e.g. use of list, tone, word choice, length) over factuality.
 
 </details>
+
+
+<details>
+  <summary><h2 tabindex="-1" dir="auto">Major updates</h2></summary>
+
+- 19th June 2023: add leaderboard `chatgpt_fn` that anyone can use (no waiting lists).
+- 19th June 2023: update to
+  use [OpenAI's function calling](https://openai.com/blog/function-calling-and-other-api-updates).
+  Example: [`chatgpt_fn`](https://github.com/tatsu-lab/alpaca_eval/tree/main/src/alpaca_eval/evaluators_configs/chatgpt_fn)
+  or [`alpaca_eval_gpt4_fn`](https://github.com/tatsu-lab/alpaca_eval/tree/main/src/alpaca_eval/evaluators_configs/alpaca_eval_gpt4_fn).
+
+</details>
\ No newline at end of file
diff --git a/docs/index.html b/docs/index.html
index 07f2b5a5..68ee6333 100644
--- a/docs/index.html
+++ b/docs/index.html
@@ -8,7 +8,11 @@
     <script async src="https://www.googletagmanager.com/gtag/js?id=G-VWV023WWP4"></script>
     <script>
         window.dataLayer = window.dataLayer || [];
-        function gtag() { dataLayer.push(arguments); }
+
+        function gtag() {
+            dataLayer.push(arguments);
+        }
+
         gtag('js', new Date());
 
         gtag('config', 'G-VWV023WWP4');
@@ -16,7 +20,7 @@
 
     <link rel="stylesheet" href="https://stackpath.bootstrapcdn.com/bootstrap/4.3.1/css/bootstrap.min.css">
     <link rel="icon" href="https://raw.githubusercontent.com/tatsu-lab/alpaca_eval/main/docs/AlpacaFarm_small.png">
-    <link href="https://cdn.jsdelivr.net/css-toggle-switch/latest/toggle-switch.css" rel="stylesheet" />
+    <link href="https://cdn.jsdelivr.net/css-toggle-switch/latest/toggle-switch.css" rel="stylesheet"/>
 
     <style>
         body {
@@ -100,7 +104,7 @@
             vertical-align: middle;
         }
 
-        .switch-toggle input+label {
+        .switch-toggle input + label {
             padding: 2px;
             padding-left: 7px;
             padding-right: 7px;
@@ -110,12 +114,12 @@
             font-size: 16px;
         }
 
-        .switch-toggle input:checked+label {
+        .switch-toggle input:checked + label {
             border-color: green;
             color: green;
         }
 
-        .switch-toggle input:not(:checked)+label {
+        .switch-toggle input:not(:checked) + label {
             color: black;
             box-shadow: none !important;
             user-select: none;
@@ -138,200 +142,209 @@
 </head>
 
 <body>
-    <div class="container">
-        <div id="branding">
-
-            <h1>AlpacaEval
-                <a href="https://github.com/tatsu-lab/alpaca_eval/tree/main">
-                    <img src="https://raw.githubusercontent.com/tatsu-lab/alpaca_eval/main/docs/AlpacaFarm_small.png"
-                        alt="Logo" style="height: 2em; vertical-align: middle;"></a>
-                Leaderboard
-            </h1>
-            <br>
-            <h2>An Automatic Evaluator for Instruction-following Language Models</h2>
-        </div>
+<div class="container">
+    <div id="branding">
+
+        <h1>AlpacaEval
+            <a href="https://github.com/tatsu-lab/alpaca_eval/tree/main">
+                <img src="https://raw.githubusercontent.com/tatsu-lab/alpaca_eval/main/docs/AlpacaFarm_small.png"
+                     alt="Logo" style="height: 2em; vertical-align: middle;"></a>
+            Leaderboard
+        </h1>
+        <br>
+        <h2>An Automatic Evaluator for Instruction-following Language Models</h2>
+    </div>
+
+    <div class="toggle-line">
 
-        <div class="toggle-line">
-
-            Evaluator:
-            <div class="switch-toggle switch-evaluator" style="margin-right: 4em">
-                <input id="gpt4" name="evaluator" type="radio" checked="checked" />
-                <label for="gpt4" onclick="">GPT-4</label>
-                <input id="claude" name="evaluator" type="radio" />
-                <label for="claude" onclick="">Claude</label>
-            </div>
-
-            Filter:
-            <div class="switch-toggle switch-compactness">
-                <input id="community" name="compactness" type="radio" checked="checked" />
-                <label for="community" onclick="">Community</label>
-                <input id="verified" name="compactness" type="radio" />
-                <label for="verified" onclick="">Verified</label>
-                <input id="minimal" name="compactness" type="radio" />
-                <label for="minimal" onclick="">Minimal</label>
-            </div>
+        Evaluator:
+        <div class="switch-toggle switch-evaluator" style="margin-right: 4em">
+            <input id="gpt4" name="evaluator" type="radio" checked="checked"/>
+            <label for="gpt4" onclick="">GPT-4</label>
+            <input id="claude" name="evaluator" type="radio"/>
+            <label for="claude" onclick="">Claude</label>
+            <!--            <input id="chatgpt" name="evaluator" type="radio"/>-->
+            <!--            <label for="chatgpt" onclick="">ChatGPT</label>-->
         </div>
 
-        <table id="leaderboard">
-            <tr>
-                <th>Model Name</th>
-                <th>Win Rate</th>
-            </tr>
-        </table>
-
-
-        <div id="documentation">
-            <div style="text-align: center;">
-                <a href="https://github.com/tatsu-lab/alpaca_eval" style="display: inline-block;">
-                    <i class="fab fa-fw fa-github" aria-hidden="true"></i> Github
-                </a>
-            </div>
-            <br>
-            <h2>About AlpacaEval</h2>
-            <p>
-                <a href="https://github.com/tatsu-lab/alpaca_eval" target="_blank">AlpacaEval</a>
-                an LLM-based automatic evaluation that is fast, cheap, and reliable.
-                It is based on the
-                <a href="https://crfm.stanford.edu/2023/05/22/alpaca-farm.html">AlpacaFarm</a>
-                evaluation set,
-                which tests the ability of models to follow general user instructions.
-                These responses are then compared to reference Davinci003 responses by
-                the provided GPT-4 or Claude based auto-annotators,
-                which results in the win rates presented above.
-                AlpacaEval displays a high agreement rate with ground truth human annotations,
-                and leaderboard rankings on AlpacaEval are very correlated with leaderboard rankings
-                based on human annotators.
-                Please see our
-                <a href="https://github.com/tatsu-lab/alpaca_eval#analysis" target="_blank">documentation</a>
-                for more details on our analysis.
-            </p>
-            <h2>Adding new models</h2>
-            <p>
-                We welcome new model contributions to the leaderboard from the community!
-                To do so, please follow the steps in the
-                <a href="https://github.com/tatsu-lab/alpaca_eval#contributing" target="_blank">contributions
-                    section</a>.
-                Specifically, you'll need to run the model on the evaluation set,
-                auto-annotate the outputs, and submit a PR with the model config and leaderboard results.
-                We've also set up a
-                <a href="https://discord.gg/GJMxJSVZZM" target="_blank">Discord</a>
-                for community support and discussion.
-            </p>
-            <h2>Adding new evaluators or eval sets </h2>
-            <p>
-                We also welcome contributions for new evaluators or new eval sets!
-                For making new evaluators, we release our ground-truth
-                <a href="https://github.com/tatsu-lab/alpaca_eval#data-release" target="_blank">human annotations</a>
-                and <a href="https://github.com/tatsu-lab/alpaca_eval#analyzing-an-evaluator" target="_blank">comparison
-                    metrics</a>.
-                We also release a
-                <a href="https://github.com/tatsu-lab/alpaca_eval#analyzing-an-eval-set" target="_blank">rough guide</a>
-                to follow for making new eval sets.
-                We specifically encourage contributions for harder instructions distributions and for safety testing of
-                LLMs.
-            </p>
-            <h2>AlpacaEval limitations</h2>
-            <p>
-                While AlpacaEval provides a useful comparison of model capabilities in following instructions,
-                it is not a comprehensive or gold-standard evaluation of model abilities.
-                For one, as detailed in the
-                <a href="https://arxiv.org/abs/2305.14387" target="_blank">AlpacaFarm paper</a>,
-                the auto annotator winrates are correlated with length.
-                Though human annotations also display this bias,
-                it is unclear if more verbose answers add utility in downstream tasks.
-                Additionally, the AlpacaFarm eval set, though diverse, consists mainly of simple instructions.
-                We encourage the community to contribute new, more complex eval sets, such as for tool use.
-                Finally, AlpacaEval does not evaluate the safety of any of the models.
-            </p>
+        Filter:
+        <div class="switch-toggle switch-compactness">
+            <input id="community" name="compactness" type="radio" checked="checked"/>
+            <label for="community" onclick="">Community</label>
+            <input id="verified" name="compactness" type="radio"/>
+            <label for="verified" onclick="">Verified</label>
+            <input id="minimal" name="compactness" type="radio"/>
+            <label for="minimal" onclick="">Minimal</label>
         </div>
+    </div>
 
+    <table id="leaderboard">
+        <tr>
+            <th>Model Name</th>
+            <th>Win Rate</th>
+        </tr>
+    </table>
+
+
+    <div id="documentation">
+        <div style="text-align: center;">
+            <a href="https://github.com/tatsu-lab/alpaca_eval" style="display: inline-block;">
+                <i class="fab fa-fw fa-github" aria-hidden="true"></i> Github
+            </a>
+        </div>
+        <br>
+        <h2>About AlpacaEval</h2>
+        <p>
+            <a href="https://github.com/tatsu-lab/alpaca_eval" target="_blank">AlpacaEval</a>
+            an LLM-based automatic evaluation that is fast, cheap, and reliable.
+            It is based on the
+            <a href="https://crfm.stanford.edu/2023/05/22/alpaca-farm.html">AlpacaFarm</a>
+            evaluation set,
+            which tests the ability of models to follow general user instructions.
+            These responses are then compared to reference Davinci003 responses by
+            the provided GPT-4 or Claude or ChatGPT based auto-annotators,
+            which results in the win rates presented above.
+            AlpacaEval displays a high agreement rate with ground truth human annotations,
+            and leaderboard rankings on AlpacaEval are very correlated with leaderboard rankings
+            based on human annotators.
+            Please see our
+            <a href="https://github.com/tatsu-lab/alpaca_eval#analysis" target="_blank">documentation</a>
+            for more details on our analysis.
+        </p>
+        <h2>Adding new models</h2>
+        <p>
+            We welcome new model contributions to the leaderboard from the community!
+            To do so, please follow the steps in the
+            <a href="https://github.com/tatsu-lab/alpaca_eval#contributing" target="_blank">contributions
+                section</a>.
+            Specifically, you'll need to run the model on the evaluation set,
+            auto-annotate the outputs, and submit a PR with the model config and leaderboard results.
+            We've also set up a
+            <a href="https://discord.gg/GJMxJSVZZM" target="_blank">Discord</a>
+            for community support and discussion.
+        </p>
+        <h2>Adding new evaluators or eval sets </h2>
+        <p>
+            We also welcome contributions for new evaluators or new eval sets!
+            For making new evaluators, we release our ground-truth
+            <a href="https://github.com/tatsu-lab/alpaca_eval#data-release" target="_blank">human annotations</a>
+            and <a href="https://github.com/tatsu-lab/alpaca_eval#analyzing-an-evaluator" target="_blank">comparison
+            metrics</a>.
+            We also release a
+            <a href="https://github.com/tatsu-lab/alpaca_eval#analyzing-an-eval-set" target="_blank">rough guide</a>
+            to follow for making new eval sets.
+            We specifically encourage contributions for harder instructions distributions and for safety testing of
+            LLMs.
+        </p>
+        <h2>AlpacaEval limitations</h2>
+        <p>
+            While AlpacaEval provides a useful comparison of model capabilities in following instructions,
+            it is not a comprehensive or gold-standard evaluation of model abilities.
+            For one, as detailed in the
+            <a href="https://arxiv.org/abs/2305.14387" target="_blank">AlpacaFarm paper</a>,
+            the auto annotator winrates are correlated with length.
+            Though human annotations also display this bias,
+            it is unclear if more verbose answers add utility in downstream tasks.
+            Additionally, the AlpacaFarm eval set, though diverse, consists mainly of simple instructions.
+            We encourage the community to contribute new, more complex eval sets, such as for tool use.
+            Finally, AlpacaEval does not evaluate the safety of any of the models.
+        </p>
     </div>
 
-    <script>
-        const gpt4Radio = document.getElementById('gpt4');
-        const claudeRadio = document.getElementById('claude');
+</div>
 
-        const communityRadio = document.getElementById('community');
-        const verifiedRadio = document.getElementById('verified');
-        const minimalRadio = document.getElementById('minimal');
+<script>
+    const gpt4Radio = document.getElementById('gpt4');
+    const claudeRadio = document.getElementById('claude');
+    // const chatgptRadio = document.getElementById('chatgpt');
 
-        const table = document.getElementById('leaderboard');
+    const communityRadio = document.getElementById('community');
+    const verifiedRadio = document.getElementById('verified');
+    const minimalRadio = document.getElementById('minimal');
 
-        const urls = {
-            'gpt4': 'https://raw.githubusercontent.com/tatsu-lab/alpaca_eval/main/docs/alpaca_eval_gpt4_leaderboard.csv',
-            'claude': 'https://raw.githubusercontent.com/tatsu-lab/alpaca_eval/main/docs/claude_leaderboard.csv'
-        }
+    const table = document.getElementById('leaderboard');
 
-        let currentUrl = urls['gpt4'];
+    const urls = {
+        'gpt4': 'https://raw.githubusercontent.com/tatsu-lab/alpaca_eval/main/docs/alpaca_eval_gpt4_leaderboard.csv',
+        'claude': 'https://raw.githubusercontent.com/tatsu-lab/alpaca_eval/main/docs/claude_leaderboard.csv',
+        'chatgpt': 'https://raw.githubusercontent.com/tatsu-lab/alpaca_eval/main/docs/chatgpt_fn_leaderboard.csv'
+    }
 
-        function updateTable(url) {
-            while (table.rows.length > 1) {
-                table.deleteRow(1);
-            }
+    let currentUrl = urls['gpt4'];
 
-            Papa.parse(url, {
-                download: true,
-                header: true,
-                complete: function (results) {
-                    results.data.forEach(row => {
-                        if (row['name'] || row['win_rate']) {
-                            let filter = row['filter'];
-
-                            if ((communityRadio.checked) ||
-                                (verifiedRadio.checked && (filter === 'verified' || filter === 'minimal')) ||
-                                (minimalRadio.checked && filter === 'minimal')) {
-
-                                const tr = document.createElement('tr');
-                                const nameTd = document.createElement('td');
-                                const winRateTd = document.createElement('td');
-
-                                if (row['link'] && row['link'].trim() !== '') {
-                                    const a = document.createElement('a');
-                                    a.textContent = row['name'];
-                                    a.href = row['link'];
-                                    a.target = "_blank";
-                                    nameTd.appendChild(a);
-                                } else {
-                                    nameTd.textContent = row['name'];
-                                }
-
-                                winRateTd.textContent = Number(row['win_rate']).toFixed(2) + '%';
-
-                                tr.appendChild(nameTd);
-                                tr.appendChild(winRateTd);
-
-                                table.appendChild(tr);
-                            }
-                        }
-                    });
-                }
-            });
+    function updateTable(url) {
+        while (table.rows.length > 1) {
+            table.deleteRow(1);
         }
 
-        updateTable(urls['gpt4']);
+        Papa.parse(url, {
+            download: true,
+            header: true,
+            complete: function (results) {
+                results.data.forEach(row => {
+                    if (row['name'] || row['win_rate']) {
+                        let filter = row['filter'];
+
+                        if ((communityRadio.checked) ||
+                            (verifiedRadio.checked && (filter === 'verified' || filter === 'minimal')) ||
+                            (minimalRadio.checked && filter === 'minimal')) {
+
+                            const tr = document.createElement('tr');
+                            const nameTd = document.createElement('td');
+                            const winRateTd = document.createElement('td');
+
+                            if (row['link'] && row['link'].trim() !== '') {
+                                const a = document.createElement('a');
+                                a.textContent = row['name'];
+                                a.href = row['link'];
+                                a.target = "_blank";
+                                nameTd.appendChild(a);
+                            } else {
+                                nameTd.textContent = row['name'];
+                            }
 
-        gpt4Radio.addEventListener('click', function () {
-            currentUrl = urls['gpt4'];
-            updateTable(currentUrl);
-        });
+                            winRateTd.textContent = Number(row['win_rate']).toFixed(2) + '%';
 
-        claudeRadio.addEventListener('click', function () {
-            currentUrl = urls['claude'];
-            updateTable(currentUrl);
-        });
+                            tr.appendChild(nameTd);
+                            tr.appendChild(winRateTd);
 
-        communityRadio.addEventListener('click', function () {
-            updateTable(currentUrl);
+                            table.appendChild(tr);
+                        }
+                    }
+                });
+            }
         });
+    }
 
-        verifiedRadio.addEventListener('click', function () {
-            updateTable(currentUrl);
-        });
+    updateTable(urls['gpt4']);
 
-        minimalRadio.addEventListener('click', function () {
-            updateTable(currentUrl);
-        });
-    </script>
+    gpt4Radio.addEventListener('click', function () {
+        currentUrl = urls['gpt4'];
+        updateTable(currentUrl);
+    });
+
+    claudeRadio.addEventListener('click', function () {
+        currentUrl = urls['claude'];
+        updateTable(currentUrl);
+    });
+
+    // chatgptRadio.addEventListener('click', function () {
+    //     currentUrl = urls['chatgpt'];
+    //     updateTable(currentUrl);
+    // });
+
+    communityRadio.addEventListener('click', function () {
+        updateTable(currentUrl);
+    });
+
+    verifiedRadio.addEventListener('click', function () {
+        updateTable(currentUrl);
+    });
+
+    minimalRadio.addEventListener('click', function () {
+        updateTable(currentUrl);
+    });
+</script>
 
 
 </body>
diff --git a/src/alpaca_eval/evaluators_configs/README.md b/src/alpaca_eval/evaluators_configs/README.md
index 2479ceb2..f1d884d1 100644
--- a/src/alpaca_eval/evaluators_configs/README.md
+++ b/src/alpaca_eval/evaluators_configs/README.md
@@ -7,33 +7,32 @@ annotators.
 We compute those metrics on our suggested evaluator `alpaca_eval_gpt4`, on prior
 evaluators (`aviary_gpt4`, `lmsys_gpt4`, `alpaca_farm_greedy_gpt4`), and on different base models with which we use
 essentially the same prompt (`gpt4`, `text_davinci_003`, `claude`, `chatgpt`).
+We also provide partial metrics (only 1 seed) for other evaluators, which include our evaluator using OpenAI's function
+calls (`alpaca_eval_gpt4_fn`), prior work that we
+improved (`improved_aviary_gpt4` and `improved_lmsys_gpt4`), prior work that was not meant to be used as a final
+evaluator (`guanaco_33b`), and a ranking evaluator (`alpaca_farm`), and secondary models that use the same prompt as the
+models above (`cohere`, `guanaco_33b`):
 
 |                         | Human agreement [%] | Price [$/1000 examples] | Time [seconds/1000 examples] | Bias | Variance | Proba. prefer longer | Proba. prefer lists | Proba. prefer 1 | # parsed | mode     |
 |:------------------------|--------------------:|------------------------:|-----------------------------:|-----:|---------:|---------------------:|--------------------:|----------------:|---------:|:---------|
+| alpaca_eval_gpt4_fn     |                71.0 |                    14.5 |                         5046 | 27.6 |     11.1 |                 0.75 |                0.63 |            0.48 |     2592 | verified |
+| improved_aviary_gpt4    |                69.8 |                    12.8 |                         1831 |      |          |                 0.73 |                0.68 |            0.49 |      648 | verified |
 | alpaca_eval_gpt4        |                69.2 |                    13.6 |                         1455 | 28.4 |     14.6 |                 0.68 |                0.69 |            0.50 |     2592 | minimal  |
 | aviary_gpt4             |                69.1 |                    12.8 |                         1869 | 29.5 |     13.1 |                 0.70 |                0.65 |            0.53 |     2592 | minimal  |
+| claude_ranking          |                67.6 |                     5.0 |                          218 |      |          |                 0.73 |                0.63 |            0.46 |      648 | verified |
 | gpt4                    |                66.9 |                    12.5 |                         1037 | 31.5 |     14.6 |                 0.65 |                0.61 |            0.54 |     2592 | minimal  |
 | alpaca_farm_greedy_gpt4 |                66.4 |                    15.3 |                          878 | 30.2 |     19.3 |                 0.60 |                0.59 |            0.54 |     2592 | minimal  |
-| humans                  |                65.7 |                   300.0 |                        36800 |  0.0 |          |                 0.64 |                0.61 |            0.52 |     2592 | minimal  |
+| humans                  |                65.7 |                   300.0 |                        36800 |  0.0 |     34.3 |                 0.64 |                0.61 |            0.52 |     2592 | minimal  |
 | claude                  |                65.5 |                    11.1 |                          173 | 31.9 |     18.0 |                 0.62 |                0.58 |            0.49 |     2592 | minimal  |
 | text_davinci_003        |                64.1 |                     8.7 |                          121 | 33.8 |     22.7 |                 0.70 |                0.64 |            0.47 |     2592 | minimal  |
 | lmsys_gpt4              |                63.2 |                    13.9 |                        17982 | 34.7 |     16.1 |                 0.74 |                0.64 |            0.56 |     2592 | minimal  |
+| guanaco_33b             |                62.7 |                         |                          911 |      |          |                 0.70 |                0.72 |            0.43 |      451 | verified |
+| improved_lmsys_gpt4     |                62.3 |                    13.9 |                         5398 |      |          |                 0.75 |                0.67 |            0.51 |      648 | verified |
 | longest                 |                62.2 |                     0.0 |                            0 | 37.8 |      0.0 |                 1.00 |                0.85 |            0.42 |     2592 | verified |
+| alpaca_farm             |                60.0 |                    11.5 |                          820 |      |          |                 0.60 |                0.63 |            0.52 |      648 | verified |
+| chatgpt_fn              |                60.0 |                     1.0 |                          530 | 36.9 |     27.7 |                 0.62 |                0.65 |            0.49 |     2592 | minimal  |
 | chatgpt                 |                57.2 |                     0.8 |                          285 | 39.4 |     34.1 |                 0.59 |                0.56 |            0.49 |     2589 | minimal  |
-
-We also provide partial metrics (only 1 seed) for the following evaluators, which include prior work that we
-improved (`improved_aviary_gpt4` and `improved_lmsys_gpt4`), prior work that was not meant to be used as a final
-evaluator (`guanaco_33b`), and a ranking evaluator (`alpaca_farm`), and secondary models that use the same prompt as the
-models above (`cohere`, `guanaco_33b`):
-
-|                      | Human agreement [%] | Price [$/1000 examples] | Time [seconds/1000 examples] | Bias | Variance | Proba. prefer longer | Proba. prefer lists | Proba. prefer 1 | # parsed | mode     |
-|:---------------------|--------------------:|------------------------:|-----------------------------:|-----:|---------:|---------------------:|--------------------:|----------------:|---------:|:---------|
-| improved_aviary_gpt4 |                69.8 |                    12.8 |                         1831 |      |          |                 0.73 |                0.68 |            0.49 |      648 | verified |
-| claude_ranking       |                67.6 |                     5.0 |                          218 |      |          |                 0.73 |                0.63 |            0.46 |      648 | verified |
-| guanaco_33b          |                62.7 |                         |                          911 |      |          |                 0.70 |                0.72 |            0.43 |      451 | verified |
-| improved_lmsys_gpt4  |                62.3 |                    13.9 |                         5398 |      |          |                 0.75 |                0.67 |            0.51 |      648 | verified |
-| alpaca_farm          |                60.0 |                    11.5 |                          820 |      |          |                 0.60 |                0.63 |            0.52 |      648 | verified |
-| cohere               |                53.4 |                     3.5 |                          217 |      |          |                 0.50 |                0.51 |            0.47 |      648 | verified |
+| cohere                  |                53.4 |                     3.5 |                          217 |      |          |                 0.50 |                0.51 |            0.47 |      648 | verified |
 
 [//]: # (|                         | Human agreement [%] | Price [$/1000 examples] | Time [seconds/1000 examples] | Bias | Variance | Proba. prefer longer | Proba. prefer lists | Proba. prefer 1 | # parsed | mode     |)
 
diff --git a/src/alpaca_eval/evaluators_configs/chatgpt_fn/basic_function_prompt.txt b/src/alpaca_eval/evaluators_configs/chatgpt_fn/basic_function_prompt.txt
new file mode 100644
index 00000000..f097efe4
--- /dev/null
+++ b/src/alpaca_eval/evaluators_configs/chatgpt_fn/basic_function_prompt.txt
@@ -0,0 +1,35 @@
+<|im_start|>system
+You are a helpful instruction-following assistant that prints the best model by selecting the best outputs for a given instruction.
+<|im_end|>
+<|im_start|>user
+Select the output (a) or (b) that best matches the given instruction. Choose your preferred output, which can be subjective. Your answer should ONLY contain: Output (a) or Output (b). Here's an example:
+
+# Example:
+## Instruction:
+Give a description of the following job: "ophthalmologist"
+
+## Output (a):
+An ophthalmologist is a medical doctor who specializes in the diagnosis and treatment of eye diseases and conditions.
+
+## Output (b):
+An ophthalmologist is a medical doctor who pokes and prods at your eyes while asking you to read letters from a chart.
+
+## Which is best, Output (a) or Output (b)?
+Output (a)
+
+Here the answer is Output (a) because it provides a comprehensive and accurate description of the job of an ophthalmologist. In contrast, output (b) is more of a joke.
+
+# Task:
+Now is the real task, do not explain your answer, just say Output (a) or Output (b).
+
+## Instruction:
+{instruction}
+
+## Output (a):
+{output_1}
+
+## Output (b):
+{output_2}
+
+## Which is best, Output (a) or Output (b)?
+<|im_end|>
\ No newline at end of file
diff --git a/src/alpaca_eval/evaluators_configs/chatgpt_fn/configs.yaml b/src/alpaca_eval/evaluators_configs/chatgpt_fn/configs.yaml
new file mode 100644
index 00000000..b857a327
--- /dev/null
+++ b/src/alpaca_eval/evaluators_configs/chatgpt_fn/configs.yaml
@@ -0,0 +1,24 @@
+chatgpt_fn:
+  prompt_template: "chatgpt_fn/basic_function_prompt.txt"
+  fn_completions: "openai_completions"
+  completions_kwargs:
+    model_name: "gpt-3.5-turbo-16k-0613"
+    max_tokens: 50
+    temperature: 0
+    function_call:
+      name: "print_best_model"
+    functions:
+      - name: "print_best_model"
+        description: "Print the best model given the preferred output."
+        parameters:
+          type: "object"
+          properties:
+            best_output:
+              type: "string"
+              description: "Name of the best output, should be 'Output (a)' or 'Output (b)'"
+        "required": [ "best_output" ]
+  completion_parser_kwargs:
+    outputs_to_match:
+      1: '(?i)output \(a\)'
+      2: '(?i)output \(b\)'
+  batch_size: 1
diff --git a/src/alpaca_eval/leaderboards/evaluators/evaluators_leaderboard.csv b/src/alpaca_eval/leaderboards/evaluators/evaluators_leaderboard.csv
index f3b42aa7..5e20dcca 100644
--- a/src/alpaca_eval/leaderboards/evaluators/evaluators_leaderboard.csv
+++ b/src/alpaca_eval/leaderboards/evaluators/evaluators_leaderboard.csv
@@ -1,4 +1,5 @@
 ,Human agreement [%],Price [$/1000 examples],Time [seconds/1000 examples],Bias,Variance,Proba. prefer longer,Proba. prefer lists,Proba. prefer 1,# parsed,mode
+alpaca_eval_gpt4_fn,70.98765432098766,14.471944444444444,5046.056233910331,27.623456790123456,11.111111111111104,0.750561797752809,0.6339285714285714,0.4799382716049383,2592,verified
 improved_aviary_gpt4,69.75308641975309,12.781435185185186,1831.2850013,,,0.7280898876404495,0.6785714285714286,0.4861111111111111,648,verified
 alpaca_eval_gpt4,69.1743827160494,13.601944444444444,1455.4169713998845,28.395061728395067,14.621913580246916,0.6831460674157304,0.6875,0.5011574074074074,2592,minimal
 aviary_gpt4,69.05864197530865,12.781666666666668,1868.680324340008,29.475308641975307,13.117283950617288,0.701123595505618,0.6517857142857143,0.533179012345679,2592,minimal
@@ -13,5 +14,6 @@ guanaco_33b,62.74944567627494,,910.8929739450112,,,0.6991150442477876,0.71951219
 improved_lmsys_gpt4,62.34567901234568,13.938055555555556,5397.837981725772,,,0.7534883720930232,0.6727272727272727,0.5138888888888888,648,verified
 longest,62.19135802469136,0.0,0.0,37.808641975308646,0.0,1.0,0.8482142857142857,0.4166666666666667,2592,verified
 alpaca_farm,60.03086419753087,11.547508744135802,820.2330700344137,,,0.6,0.6339285714285714,0.5246913580246915,648,verified
+chatgpt_fn,59.992283950617285,1.0088333333333337,529.928419875,36.88271604938272,27.73919753086419,0.6247191011235955,0.6517857142857143,0.4911265432098766,2592,minimal
 chatgpt,57.21450617283951,0.8342726921591347,284.9753823429895,39.35185185185185,34.080370942812976,0.5910112359550562,0.5625,0.488991888760139,2589,minimal
 cohere,53.39506172839506,3.452932098765432,216.8668793200617,,,0.503370786516854,0.5089285714285714,0.4737654320987654,648,verified