Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Structured output eval #152

Merged
merged 7 commits into from
Nov 12, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion factgenie/config/default_prompts.yml
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,7 @@ llm_eval: |
```
Instructions for annotating the text:

Output the errors as a JSON list "annotations" in which each object contains fields "reason", "text", and "type". The value of "reason" is the reason for the annotation. The value of "text" is the literal value of the text inside the highlighted span, so that the span can later be identified using string matching. The value of "type" is an integer index of the error based on the following list:
Output the errors as a JSON list "annotations" in which each object contains fields "reason", "text", and "annotation_type". The value of "reason" is the reason for the annotation. The value of "text" is the literal value of the text inside the highlighted span, so that the span can later be identified using string matching. The value of "annotation_type" is an integer index of the error based on the following list:

{error_list}

Expand Down
6 changes: 3 additions & 3 deletions factgenie/config/llm-eval/example-ollama-llama3-eval.yaml
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
type: ollama_metric
model: llama3
model: llama3.1:8b
# You can run ollama alson on other machine than factgenie
# e.g. we run it on a machine tdll-3gpu3 and access it from any machine which is withing the same firewall
# in that case we use api_url: http://tdll-3gpu3.ufal.hide.ms.mff.cuni.cz:11434/api/
Expand Down Expand Up @@ -33,7 +33,7 @@ prompt_template: |
```
{text}
```
Output the errors as a JSON list "annotations" in which each object contains fields "reason", "text", and "type". The value of "text" is the text of the error. The value of "reason" is the reason for the error. The value of "type" is one of {0, 1, 2, 3} based on the following list:
Output the errors as a JSON list "annotations" in which each object contains fields "reason", "text", and "annotation_type". The value of "text" is the text of the error. The value of "reason" is the reason for the error. The value of "annotation_type" is one of {0, 1, 2, 3} based on the following list:
- 0: Incorrect fact: The fact in the text contradicts the data.
- 1: Not checkable: The fact in the text cannot be checked in the data.
- 2: Misleading: The fact in the text is misleading in the given context.
Expand All @@ -54,6 +54,6 @@ prompt_template: |
Nokia 3310 is produced in Finland and features a 320x320 display. It is available in black color. The data seem to provide only partial information about the phone.
```
output:
```{ "annotations": [{"reason": "The country where the phone is produced is not mentioned in the data.", "text": "produced in Finland", "type": 1}, {"reason": "The data mentions that the display has resolution 320x240px.", "text": "320x320", type: 0}, {"reason": "Misleadingly suggests that the phone is not available in other colors.", "text": "available in black color", type: 2}, {"reason": "The note is irrelevant for the phone description.", "text": "The data seem to provide only partial information about the phone.", type: 3}] }
```{ "annotations": [{"reason": "The country where the phone is produced is not mentioned in the data.", "text": "produced in Finland", "annotation_type": 1}, {"reason": "The data mentions that the display has resolution 320x240px.", "text": "320x320", "annotation_type": 0}, {"reason": "Misleadingly suggests that the phone is not available in other colors.", "text": "available in black color", "annotation_type": 2}, {"reason": "The note is irrelevant for the phone description.", "text": "The data seem to provide only partial information about the phone.", "annotation_type": 3}] }
```
Note that some details may not be mentioned in the text: do not count omissions as errors. Also do not be too strict: some facts can be less specific than in the data (rounded values, shortened or abbreviated text, etc.), do not count these as errors. If there are no errors in the text, "annotations" will be an empty list.
Original file line number Diff line number Diff line change
Expand Up @@ -25,7 +25,7 @@ prompt_template: |
```
{text}
```
Output the errors as a JSON list "annotations" in which each object contains fields "reason", "text", and "type". The value of "text" is the text of the error. The value of "reason" is the reason for the error. The value of "type" is one of {0, 1, 2, 3} based on the following list:
Output the errors as a JSON list "annotations" in which each object contains fields "reason", "text", and "annotation_type". The value of "text" is the text of the error. The value of "reason" is the reason for the error. The value of "annotation_type" is one of {0, 1, 2, 3} based on the following list:
- 0: Incorrect fact: The fact in the text contradicts the data.
- 1: Not checkable: The fact in the text cannot be checked in the data.
- 2: Misleading: The fact in the text is misleading in the given context.
Expand All @@ -46,6 +46,6 @@ prompt_template: |
Nokia 3310 is produced in Finland and features a 320x320 display. It is available in black color. The data seem to provide only partial information about the phone.
```
output:
```{ "annotations": [{"reason": "The country where the phone is produced is not mentioned in the data.", "text": "produced in Finland", "type": 1}, {"reason": "The data mentions that the display has resolution 320x240px.", "text": "320x320", type: 0}, {"reason": "Misleadingly suggests that the phone is not available in other colors.", "text": "available in black color", type: 2}, {"reason": "The note is irrelevant for the phone description.", "text": "The data seem to provide only partial information about the phone.", type: 3}] }
```{ "annotations": [{"reason": "The country where the phone is produced is not mentioned in the data.", "text": "produced in Finland", "annotation_type": 1}, {"reason": "The data mentions that the display has resolution 320x240px.", "text": "320x320", "annotation_type": 0}, {"reason": "Misleadingly suggests that the phone is not available in other colors.", "text": "available in black color", "annotation_type": 2}, {"reason": "The note is irrelevant for the phone description.", "text": "The data seem to provide only partial information about the phone.", "annotation_type": 3}] }
```
Note that some details may not be mentioned in the text: do not count omissions as errors. Also do not be too strict: some facts can be less specific than in the data (rounded values, shortened or abbreviated text, etc.), do not count these as errors. If there are no errors in the text, "annotations" will be an empty list.
59 changes: 59 additions & 0 deletions factgenie/config/llm-eval/example-vllm-llama3-eval.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,59 @@
type: vllm_metric
model: meta-llama/Meta-Llama-3-8B-Instruct
# You can run vllm also on other machine than factgenie
# e.g. we run it on a machine tdll-3gpu3 and access it from any machine which is withing the same firewall
# in that case we use api_url: http://tdll-3gpu3.ufal.hide.ms.mff.cuni.cz:8000/v1/
# If you run vllm at the same machine as factgenie let's use just localhost.
api_url: http://localhost:8000/v1/
model_args:
num_predict: 1024
temperature: 0.0
top_p: 1.0
top_k: 0.0
seed: 42
annotation_span_categories:
- name: "Incorrect"
color: "#ffbcbc"
description: "The fact in the text contradicts the data."
- name: "Not checkable"
color: "#e9d2ff"
description: "The fact in the text cannot be checked given the data."
- name: "Misleading"
color: "#fff79f"
description: "The fact in the text is misleading in the given context."
- name: "Other"
color: "#bbbbbb"
description: "The text is problematic for another reason, e.g. grammatically or stylistically incorrect, irrelevant, or repetitive."
prompt_template: |
Given the data:
```
{data}
```
Annotate all the errors in the following text:
```
{text}
```
Output the errors as a JSON list "annotations" in which each object contains fields "reason", "text", and "annotation_type". The value of "text" is the text of the error. The value of "reason" is the reason for the error. The value of "annotation_type" is one of {0, 1, 2, 3} based on the following list:
- 0: Incorrect fact: The fact in the text contradicts the data.
- 1: Not checkable: The fact in the text cannot be checked in the data.
- 2: Misleading: The fact in the text is misleading in the given context.
- 3: Other: The text is problematic for another reason, e.g. grammatically or stylistically incorrect, irrelevant, or repetitive.

The list should be sorted by the position of the error in the text. Make sure that the annotations are not overlapping.

*Example:*
data:
```
Nokia 3310
-----
- **color**: black, blue, grey
- **display**: 320x240px
```
text (product description):
```
Nokia 3310 is produced in Finland and features a 320x320 display. It is available in black color. The data seem to provide only partial information about the phone.
```
output:
```{ "annotations": [{"reason": "The country where the phone is produced is not mentioned in the data.", "text": "produced in Finland", "annotation_type": 1}, {"reason": "The data mentions that the display has resolution 320x240px.", "text": "320x320", "annotation_type": 0}, {"reason": "Misleadingly suggests that the phone is not available in other colors.", "text": "available in black color", "annotation_type": 2}, {"reason": "The note is irrelevant for the phone description.", "text": "The data seem to provide only partial information about the phone.", "annotation_type": 3}] }
```
Note that some details may not be mentioned in the text: do not count omissions as errors. Also do not be too strict: some facts can be less specific than in the data (rounded values, shortened or abbreviated text, etc.), do not count these as errors. If there are no errors in the text, "annotations" will be an empty list.
2 changes: 1 addition & 1 deletion factgenie/data/datasets_TEMPLATE.yml
Original file line number Diff line number Diff line change
Expand Up @@ -11,4 +11,4 @@ example-dataset-id:
- list
- of
- dataset
- splits
- splits
Loading
Loading