Mock calls
You can now mock functions that your chain or agent might be calling:
input: I live in London, can I expect rain today?
expected: ["no"]
calls:
- name: forecast.get_n_day_weather_forecast
returns: It's sunny in London.
arguments:
location: London
num_days: 1
This will replace get_n_day_weather_forecast
in forecast
with a mocked function always returning It's sunny in London.
See examples/weather_functions
for some examples.
Embedding Distance
New evalautor EmbeddingEvaluator
, embeds both the model output and the expected values, and compare the cosine distance.
Currently the threshold is hardcoded and set to 0.9 but will be dynamic in the future.
$ bench run . --evaluator embedding
Scoring
Evaluators now return List[Evaluator.Candidate]
instead of Optional[Evaluator.Match]
, this lets us inspect the score (for example cosine distance) for failed evaluations.
This is incompatible with the old caching format.
Multiple test functions in the same file
You can now have multiple @benchllm.test
in the same python file, the function name is also now shown in the benchllm output.
import benchllm
def my_model(input, model):
# implementation
@benchllm.test(suite=".")
def gpt_3_5(input: ChatInput):
return my_model(input)
@benchllm.test(suite=".")
def gpt_4(input: ChatInput):
return my_model(input, model="gpt-4")