Benchmarking of LLM models
- Entailed Polarity Dataset: https://github.com/google/BIG-bench/blob/main/bigbench/benchmark_tasks/entailed_polarity/task.json
- Analytical Entailment Dataset: https://github.com/google/BIG-bench/blob/main/bigbench/benchmark_tasks/analytic_entailment/task.json
Link to the original dataset used - Original Dataset
- Type of Benchmarking: Entailment recognition in Yes/No.
- Models: https://learn.microsoft.com/en-us/azure/cognitive-services/openai/concepts/models#gpt-3-models
- text-davinci-003
- text-curie-001
- text-babbage-001
- text-ada-001
- gpt-35-turbo
- Prompt1: "{premise}. {hypothesis} yes or no?", # no instruction with response options (Yes/No)
- Prompt2: "{prefix}. {premise}. {hypothesis}.", # instruction with only prefix
- Prompt3: "{premise}. {intermediate}. {hypothesis}.", # instruction with only intermediate instruction
- Prompt4: "{premise}. {hypothesis}? {suffix}.", # instruction with only suffix
- prefix = "Given the fact, answer the following question with yes/no."
- suffix = "Given the previous fact, answer the following question with yes/no."
- intermediate = "Given the previous fact, answer the following question with yes/no."