A library of five metrics evaluating large language models' pragmatic competence:
- Naturalness: LLMs will generate surprisal scores as a proxy to text naturalness for each sentence in a minimal pair, which reflect how unexpected a sentence is, given the preceding context. Hypothetically, if LLMs show pragmatic sensitivity, LLMs should assign a lower surprisal score to the intended implied meaning in an appropriate context.
- Sensitivity to different Shades of Meaning (SSM)
- Pragmatic Reasoning Chains (PRC)
- Implicature Recovery Rate (IRR)
- Pragmatic Sensitivity Index (PSI)
Benchmark datasets (work-in-progress), examples, and documentation are also provided.