This directory contains end-to-end pipelines for AI-enhanced evaluation. We will introduce the evaluation pipeline and the data format in this document.
Make sure you have setup the OpenAI API Key in your environment. Then run:
python qa_baseline_gpt35.py --question table/question.jsonl --output table/answer/answer_gpt35.jsonl
Unfortunately, Bard has not release its public APIs till now. You may have to enter the anwsers manually. Or you could find a third-party project that interfaces with Bard.
To generate answers with Vicuna or other models, specify path to the model checkpoint. Then run:
python model_qa.py --model-name /model/path --question-file tables/question.jsonl --answer-file table/answer/answer.jsonl
PS: If you do not current have access to GPT-4 API, but you have access to GPT-4 chatbot, you can evaluate the answers manually, according to the instructions in the Data Format section. table/review/*.jsonl
are some examples of reviews.
TODO: add instructions
You can generate the data for the webpage by running:
python eval/generate_webpage_data_from_table.py
Then you can serve a static website in webpage
to see the results.
If you want to have a deeper understanding of our evaluation pipeline or want to contribute to the evaluation process, you need to learn the data format we used for evaluation.
Our evaluation data are encoded with JSON Lines.
We use the shortuuid
Python library for generating short random UUIDs.
import shortuuid
shortuuid.uuid() -> str
model.jsonl
contains model information we used for generating anwsers.
Each row contains a record of a model with the following field:
model_id
(str): A unique ID for a model. Models with different IDs is supposed to have different performance. This ID is generated by{model_name}:{model_version}
.model_name
(str): The name of a model. This is not unique, because a model could be trained and updated continuously, but it is still considered as the same model with different versions.model_version
(str): The version of a model.model_metadata
(Any): Any metadata of a model (descriptions etc). This is optional.
For example:
{
"model_id": "vicuna-13b:v1",
"model_name": "vicuna-13b",
"model_version": "v1",
"model_metadata": "learning rate 1e-5, 3 epochs, 13b"
}
We store prompts in prompt.jsonl
. Each row contains a record of a prompt with the following field:
prompt_id
(int): A unique integer ID for a prompt. Prompts with different IDs are supposed to have different purpose.system_prompt
(str): The system prompt given to a model. This is the prompt that the model sees first.prompt_template
(str): The prompt body. This is the user prompt that the model sees after the system prompt. It is a Python f-string template, so that we can fill in the inputs later.defaults
(dict): A dictionary of default values for the prompt template. It can be empty.description
(str): A description of the functionality of the prompt.
For example:
{
"prompt_id": 1,
"system_prompt": "You are a helpful assistant.",
"prompt_template": "[Question]\n{question}\n\n[Assistant 1]\n{answer_1}\n\n[End of Assistant 1]\n\n[Assistant 2]\n{answer_2}\n\n[End of Assistant 2]\n\n[System]\n{prompt}\n\n",
"defaults": {"prompt": "Which assistant is more helpful?"},
"description": "Compare two assistants' answers to a question."
}
reviewer.jsonl
contains reviewer information we used for reviewing answers generated by different models. Each row contains a record of a reviewer with the following field:
reviewer_id
(str): A unique ID for a reviewer. Reviewers with different IDs is supposed to have different reviewing performance.prompt_id
(str): The ID of the prompt given to the reviewer (e.g., an AI assistant). Different prompts could result in different reviewing performance.metadata
(dict): Metadata of a reviewer about its configurations.description
(str): A description of the reviewer.
For example:
{
"reviewer_id": "gpt-4-0328-default",
"prompt_id": 1,
"temperature": 0.2,
"max_tokens": 8192,
"description": "GPT-4 for generic questions."
}
question.jsonl
contains questions we used for evaluation. Each row contains a record of a question with the following field:
question_id
(int): A unique integer for a question. Questions with different IDs is supposed to be different.text
(str): The question text.category
(str): The category of the question. Questions with the same category are supposed to be similar or originate from the same source.
answer/xxx.jsonl
contains answers generated by different models. Each row contains a record of an answer with the following field:
answer_id
(str): A unique UUID for an answer. Answers with different IDs is supposed to be different.question_id
(int): The ID of the question the answer is generated for.model_id
(str): The ID of the model the answer is generated by.text
(str): The answer text.metadata
(dict): Any metadata of the answer.
Example:
{
"answer_id": "[short uuid]",
"question_id": 1,
"model_id": "vicuna-13b:v1",
"text": "Here are five tips...",
"metadata": {}
}
review/xxx.jsonl
contains reviews given by reviewers, comparing peformance between a pair of models. Each row contains a record of a review with the following field:
review_id
(str): A unique UUID for a review. Reviews with different IDs is supposed to be different.question_id
(int): The ID of the question the review is given for.answer1_id
(str): The ID of the first answer.answer2_id
(str): The ID of the second answer.text
(str): The review text.score
(list): A list of scores given by the reviewer. The first score is for the first answer, and the second score is for the second answer.reviewer_id
(str): The ID of the reviewer.metadata
(dict): Any metadata of the review.
{
"review_id": "[short uuid]",
"question_id": 1,
"answer1_id": "[answer1_id]",
"answer2_id": "[answer2_id]",
"text": "Assistant 2 is better...",
"score": [9.0, 7.5],
"reviewer_id": "gpt-4-0328-default",
"metadata": {}
}