Some models feel competent despite under-scoring on benchmarks like MMLU, GPQA, MATH, or NIAH.
Aidan Bench rewards:
- Creativity
- Reliability
- Contextual attention
- Instruction following
Aidan Bench is weakly correlated with LMSYS Arena scores, has no score ceiling, and aligns with real-world open-ended use.
We give LLMs a set of open-ended questions spanning various domains:
"Provide an explanation for Japan's Lost Decades.",
"How might you use a brick and a blanket?",
"What architectural features might you include in a tasteful house?",
"Propose a solution to Los Angeles traffic.",
"What activities might I include at a party for firefighters?",
"How could we redesign schools to better prepare students for the 22nd century?",
# ... and many more
For each question, we ask the model to generate novel answers while avoiding previous responses. The benchmark continues generating answers until either:
- The answer becomes incoherent (
$C \leq 15/100$ ) - The answer is too similar to previous responses (
$N \leq 0.15$ )
Given a language model
where
Coherence Score
Novelty Score
where
The final AidanBench score aggregates performance across all questions
Here are the latest benchmark results across various models:
We test models at temperature=0.7.
- Python 3.x
- OpenAI API key
- OpenRouter API key
-
Clone the repository:
git clone https://github.com/aidanmclaughlin/Aidan-Bench.git cd Aidan-Bench
-
Install required packages:
pip install numpy openai colorama retry
-
Set up environment variables:
export OPENAI_API_KEY="your-openai-key" export OPEN_ROUTER_KEY="your-openrouter-key"
Run the benchmark with:
python main.py
The script will guide you through several choices:
-
Select model(s) to benchmark
- Choose from a list of supported models
- Option to test multiple models in sequence
-
Configure test parameters
- Threading mode (multi-threaded or single-threaded)
- Temperature setting (default: 0.7)
- Number of questions to test
- Use of LLM judge for similarity scoring
-
Configure thresholds The benchmark uses three key thresholds:
Coherence threshold
$\tau_c$ controls minimum answer qualityEmbedding threshold
$\tau_n$ prevents semantic redundancyLLM similarity threshold (optional) provides additional diversity checking
Results will be saved to results.json
and can be visualized using the included visualization tool.
After running the benchmark, you can visualize results using the included visualization tool:
cd visualize_results
python -m http.server 8000
Then open http://localhost:8000/visualization
in your browser to explore the results interactively.