This document describes a project that evaluates the capabilities of large language models (LLMs) using clembench. The chosen game is Ludo, a classic turn-based board game where players strategically move tokens across a board.
Traditional LLM evaluation relies on static datasets and tasks focused on language understanding. However, as LLMs evolve, there's a need to assess them in more dynamic, goal-oriented environments. This work introduces Ludo as a testbed for evaluating LLMs in a situated, strategic setting.
The Ludo adaptation features both single-player and multiplayer modes, allowing testing with varying numbers of tokens controlled by the model. Here's a breakdown of the key aspects:
- Board: A 1x23 ASCII board with movement only allowed from left to right.
- Tokens: Each player controls up to 4 tokens.
- Goal: Navigate all tokens successfully across the board in the fewest turns possible.
- Evaluation: The benchmark assesses decision-making, spatial reasoning, and the ability to follow game rules.
The evaluation involved eight LLMs and explored the impact of various factors on their performance:
- Chain of Thought (CoT) prompting: Guides the model's thought process by providing a structured approach.
- Reprompting: Allows the model to receive additional prompts after making an invalid move.
- Board representation: Text-based vs. no board representation.
- Single vs. multi-token control: Tests the model's ability to manage multiple tokens.
- Overall Performance: The game proved challenging for LLMs, with an average abortion rate of 83.1%.
- Single Token vs. Multitoken: Models performed significantly better when controlling a single token.
- Chain of Thought: CoT prompting generally improved performance, especially in multitoken scenarios.
- Reprompting: Reprompting showed a slight positive impact but seemed more effective in multitoken games.
- Parsing Errors: Most errors were parsing errors, indicating difficulty understanding game instructions.
- Board Representation: Providing a board representation might help reduce parsing errors.
These findings highlight the challenges LLMs face in complex, interactive environments and the importance of strategic reasoning, situational understanding, and efficient information processing.
The authors suggest further studies on:
- Prompt design: Optimizing prompts to provide clearer instructions and guidance.
- Action space exploration: Investigating strategies for LLMs to explore different game actions.
- Multimodal learning: Combining textual information with visual representations (e.g., board image) for improved understanding.
clembench: A Framework for the Systematic Evaluation of Chat-Optimized Language Models as Conversational Agents
UPDATE (16.02.24): We released v0.3 of the benchmark code. The main branch will continue as v1.0-beta which has changes that effect the game code. Follow this guide to update your game.
The cLLM (chat-optimized Large Language Model, "clem") framework tests such models' ability to engage in games – rule-constituted activities played using language. The framework is a systematic way of probing for the situated language understanding of language using agents.
This repository contains the code for setting up the framework and implements a number of games that are further discussed in
Chalamalasetti, K., Götze, J., Hakimov, S., Madureira, B., Sadler, P., & Schlangen, D. (2023). clembench: Using Game Play to Evaluate Chat-Optimized Language Models as Conversational Agents (arXiv:2305.13455). arXiv. https://doi.org/10.48550/arXiv.2305.13455
On the main project website , under leaderboard.
- A Simple Word Game: taboo
- A Word-Guessing Game Based on Clues: wordle
- Drawing Instruction Giving and Following: image
- An ASCII Picture Reference Game: reference
- Scorekeeping: private and shared
This repository is tested on Python 3.8+
We welcome you to contribute to or extend the benchmark with your own games and models. Please simply open a pull request. You can find more information on how to use the benchmark in the links below.