Integration tests or benchmarks? #3

AidanTilgner · 2024-08-26T03:17:07Z

In reference to this pr and the mention of integration tests to track updates and performance, I was wondering whether integration tests for this kind of project should look more like benchmarks.

Like, there's always a sort of pass/fail test that could be run to tell us whether something still does work. However, with a non-deterministic system such as an agent, it seems like testing how well the system "works", needs to be a bit more broad. Essentially, I'm thinking of a test suite, which would contain a series of pass/fail tests, each which would be designed to test a specific criteria. Then, whenever changes were made to UC (or other agents for that matter), they could be tested against the benchmark, to see if the update improved the performance or not.

There are definitely benchmarks which exist already, and I'm gonna look around some more. However, I haven't seen very comprehensive ones, and either way UC would need an adapter to actually utilize an existing one. Either way, I'm thinking that I'm going to see about getting UC hooked up to some benchmarking suite, so that something like an update to the file editing tool will show either performance improvement or detriment.

I think I'm going to try to work out a manual benchmark first, just getting things down conceptually to start. Then, I can see about automating it.

So far I'm thinking about covering these key areas:

Tools
- Tool Selection: How well does the agent select which tool to use in a specific situation?
- Tool Use: How well does the agent use a tool for maximum effectiveness?
  - Editing: How well does it use editing tools
  - Commands: How well does it execute commands which are fitting for the task
Reasoning
- Plan Formation: How well does it form specific plans
- Steps Required: How many steps between task initialization and completion?
- Task Understanding: How well does it understand instructions, versus misinterpreting them?
- Redundancy: How often are actions repeated when they don't need to be?
- Multi-Step Operations: How well are multi-step operations performed?
- Assumptions: Does the model make assumptions about tasks which could lead to accuracy deficits?
  - Assumption Rate: Given a task, how many assumptions are made about the nature of the task?
  - Assumption Accuracy: Given a task, how accurate are the assumption which are made?
Code:
- Quality: How is the code quality based on several metrics?
- Optimization: How well can it optimize existing code?
- Testing: How well does the agent test its code?
  - Test Accuracy: How often do the tests fail when they shouldn't?
Robustness:
- Inter-model Ability: How well does the model perform when using different models

If memory gets added

Memory:
- Memory Creation: How well does it decide to make memories
- Memory Recall: How well does it pull old memories accurately to working memory
- Memory Management: How well is the line between proposed and working memory handled?

But this is kinda a rough draft. I'm opening this issue as a reference point and start of a discussion around this topic!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Integration tests or benchmarks? #3

Integration tests or benchmarks? #3

AidanTilgner commented Aug 26, 2024

Integration tests or benchmarks? #3

Integration tests or benchmarks? #3

Comments

AidanTilgner commented Aug 26, 2024