Ludo Benchmark for Evaluating Chat-Optimized Language Models using clembench Framework.

This document describes a project that evaluates the capabilities of large language models (LLMs) using clembench. The chosen game is Ludo, a classic turn-based board game where players strategically move tokens across a board.

Introduction

Traditional LLM evaluation relies on static datasets and tasks focused on language understanding. However, as LLMs evolve, there's a need to assess them in more dynamic, goal-oriented environments. This work introduces Ludo as a testbed for evaluating LLMs in a situated, strategic setting.

Ludo Benchmark Design

The Ludo adaptation features both single-player and multiplayer modes, allowing testing with varying numbers of tokens controlled by the model. Here's a breakdown of the key aspects:

Board: A 1x23 ASCII board with movement only allowed from left to right.
Tokens: Each player controls up to 4 tokens.
Goal: Navigate all tokens successfully across the board in the fewest turns possible.
Evaluation: The benchmark assesses decision-making, spatial reasoning, and the ability to follow game rules.

Evaluation Methodology

The evaluation involved eight LLMs and explored the impact of various factors on their performance:

Chain of Thought (CoT) prompting: Guides the model's thought process by providing a structured approach.
Reprompting: Allows the model to receive additional prompts after making an invalid move.
Board representation: Text-based vs. no board representation.
Single vs. multi-token control: Tests the model's ability to manage multiple tokens.

Key Findings

Overall Performance: The game proved challenging for LLMs, with an average abortion rate of 83.1%.
Single Token vs. Multitoken: Models performed significantly better when controlling a single token.
Chain of Thought: CoT prompting generally improved performance, especially in multitoken scenarios.
Reprompting: Reprompting showed a slight positive impact but seemed more effective in multitoken games.
Parsing Errors: Most errors were parsing errors, indicating difficulty understanding game instructions.
Board Representation: Providing a board representation might help reduce parsing errors.

These findings highlight the challenges LLMs face in complex, interactive environments and the importance of strategic reasoning, situational understanding, and efficient information processing.

Future Work

The authors suggest further studies on:

Prompt design: Optimizing prompts to provide clearer instructions and guidance.
Action space exploration: Investigating strategies for LLMs to explore different game actions.
Multimodal learning: Combining textual information with visual representations (e.g., board image) for improved understanding.

clembench: A Framework for the Systematic Evaluation of Chat-Optimized Language Models as Conversational Agents

UPDATE (16.02.24): We released v0.3 of the benchmark code. The main branch will continue as v1.0-beta which has changes that effect the game code. Follow this guide to update your game.

The cLLM (chat-optimized Large Language Model, "clem") framework tests such models' ability to engage in games – rule-constituted activities played using language. The framework is a systematic way of probing for the situated language understanding of language using agents.

This repository contains the code for setting up the framework and implements a number of games that are further discussed in

Chalamalasetti, K., Götze, J., Hakimov, S., Madureira, B., Sadler, P., & Schlangen, D. (2023). clembench: Using Game Play to Evaluate Chat-Optimized Language Models as Conversational Agents (arXiv:2305.13455). arXiv. https://doi.org/10.48550/arXiv.2305.13455

Evaluation Results

On the main project website , under leaderboard.

Game details

A Simple Word Game: taboo
A Word-Guessing Game Based on Clues: wordle
Drawing Instruction Giving and Following: image
An ASCII Picture Reference Game: reference
Scorekeeping: private and shared

Using the benchmark

This repository is tested on Python 3.8+

We welcome you to contribute to or extend the benchmark with your own games and models. Please simply open a pull request. You can find more information on how to use the benchmark in the links below.

Name		Name	Last commit message	Last commit date
Latest commit History 323 Commits
backends		backends
clemgame		clemgame
docs		docs
evaluation		evaluation
games		games
results		results
scripts		scripts
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
chat-two-tracks.css		chat-two-tracks.css
key.json.template		key.json.template
logging.yaml		logging.yaml
pipeline_clembench.sh		pipeline_clembench.sh
pipeline_huggingfaces.sh		pipeline_huggingfaces.sh
pipeline_llama2_hf.sh		pipeline_llama2_hf.sh
prepare_path.sh		prepare_path.sh
requirements.txt		requirements.txt
requirements_hf.txt		requirements_hf.txt
run.sh		run.sh
run_ludo_experiments.sh		run_ludo_experiments.sh
setup.sh		setup.sh
setup_hf.sh		setup_hf.sh
setup_llamacpp_cuda122.sh		setup_llamacpp_cuda122.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Ludo Benchmark for Evaluating Chat-Optimized Language Models using clembench Framework.

Introduction

Ludo Benchmark Design

Evaluation Methodology

Key Findings

Future Work

clembench: A Framework for the Systematic Evaluation of Chat-Optimized Language Models as Conversational Agents

UPDATE (16.02.24): We released v0.3 of the benchmark code. The main branch will continue as v1.0-beta which has changes that effect the game code. Follow this guide to update your game.

Evaluation Results

Game details

Using the benchmark

About

Releases

Packages

Languages

License

kimono998/ludo_clembench

Folders and files

Latest commit

History

Repository files navigation

Ludo Benchmark for Evaluating Chat-Optimized Language Models using clembench Framework.

Introduction

Ludo Benchmark Design

Evaluation Methodology

Key Findings

Future Work

clembench: A Framework for the Systematic Evaluation of Chat-Optimized Language Models as Conversational Agents

UPDATE (16.02.24): We released v0.3 of the benchmark code. The main branch will continue as v1.0-beta which has changes that effect the game code. Follow this guide to update your game.

Evaluation Results

Game details

Using the benchmark

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages