You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In reference to this pr and the mention of integration tests to track updates and performance, I was wondering whether integration tests for this kind of project should look more like benchmarks.
Like, there's always a sort of pass/fail test that could be run to tell us whether something still does work. However, with a non-deterministic system such as an agent, it seems like testing how well the system "works", needs to be a bit more broad. Essentially, I'm thinking of a test suite, which would contain a series of pass/fail tests, each which would be designed to test a specific criteria. Then, whenever changes were made to UC (or other agents for that matter), they could be tested against the benchmark, to see if the update improved the performance or not.
There are definitely benchmarks which exist already, and I'm gonna look around some more. However, I haven't seen very comprehensive ones, and either way UC would need an adapter to actually utilize an existing one. Either way, I'm thinking that I'm going to see about getting UC hooked up to some benchmarking suite, so that something like an update to the file editing tool will show either performance improvement or detriment.
I think I'm going to try to work out a manual benchmark first, just getting things down conceptually to start. Then, I can see about automating it.
So far I'm thinking about covering these key areas:
Tools
Tool Selection: How well does the agent select which tool to use in a specific situation?
Tool Use: How well does the agent use a tool for maximum effectiveness?
Editing: How well does it use editing tools
Commands: How well does it execute commands which are fitting for the task
Reasoning
Plan Formation: How well does it form specific plans
Steps Required: How many steps between task initialization and completion?
Task Understanding: How well does it understand instructions, versus misinterpreting them?
Redundancy: How often are actions repeated when they don't need to be?
Multi-Step Operations: How well are multi-step operations performed?
Assumptions: Does the model make assumptions about tasks which could lead to accuracy deficits?
Assumption Rate: Given a task, how many assumptions are made about the nature of the task?
Assumption Accuracy: Given a task, how accurate are the assumption which are made?
Code:
Quality: How is the code quality based on several metrics?
Optimization: How well can it optimize existing code?
Testing: How well does the agent test its code?
Test Accuracy: How often do the tests fail when they shouldn't?
Robustness:
Inter-model Ability: How well does the model perform when using different models
If memory gets added
Memory:
Memory Creation: How well does it decide to make memories
Memory Recall: How well does it pull old memories accurately to working memory
Memory Management: How well is the line between proposed and working memory handled?
But this is kinda a rough draft. I'm opening this issue as a reference point and start of a discussion around this topic!
The text was updated successfully, but these errors were encountered:
In reference to this pr and the mention of integration tests to track updates and performance, I was wondering whether integration tests for this kind of project should look more like benchmarks.
Like, there's always a sort of pass/fail test that could be run to tell us whether something still does work. However, with a non-deterministic system such as an agent, it seems like testing how well the system "works", needs to be a bit more broad. Essentially, I'm thinking of a test suite, which would contain a series of pass/fail tests, each which would be designed to test a specific criteria. Then, whenever changes were made to UC (or other agents for that matter), they could be tested against the benchmark, to see if the update improved the performance or not.
There are definitely benchmarks which exist already, and I'm gonna look around some more. However, I haven't seen very comprehensive ones, and either way UC would need an adapter to actually utilize an existing one. Either way, I'm thinking that I'm going to see about getting UC hooked up to some benchmarking suite, so that something like an update to the file editing tool will show either performance improvement or detriment.
I think I'm going to try to work out a manual benchmark first, just getting things down conceptually to start. Then, I can see about automating it.
So far I'm thinking about covering these key areas:
If memory gets added
But this is kinda a rough draft. I'm opening this issue as a reference point and start of a discussion around this topic!
The text was updated successfully, but these errors were encountered: