Skip to content

Series of experiments to evaluate whether LLMs can be used to replace participants in linguistic experiments.

License

Notifications You must be signed in to change notification settings

DH-Cologne/LLMTestSubjects

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

72 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

LLMTestSubjects

Series of experiments to evaluate whether LLMs can be used to replace participants in linguistic experiments.

Pronomina

Based on the two papers (Patterson and Schumacher 2021 and Patterson et al. 2022 - further ExpB and ExpA), we analyse whether LLMs

  • could make similar judgements about the acceptability of pronoun references as human subjects.
  • prefer the same R-expressions for pronouns as human subjects.
  • can be used reliably in anaphora resolution.

Step 1: Collect data from experiments, generate lists for participants/llms

The data of the experiments from the two papers is available online, references can be found on the linked pages. We received the combined data from the authors in a file that we pre-processed with the script 01_CreateParticipantsList.R

The code generates two data frames (ExpAData, ExpBData) that we use to generate our prompts for the large language models (folder ExperimentParticipantsLists) and exports them to RDS files. These files allow us (and you) to feed the LLMs and use our code (02/03 A/B) to analyse our/your results without having to rely on the original data.

Step 2: Use LLMs as simulated participants

We used the generated participants lists to generate the prompts for the various LLMs (that we used as simulated participants).

The notation for LLaMA-based models is as follows:

ModelAbbrevation_Parameters_Temperature_Experiment

The OpenAI GPT notation is similar, but without the Parameters.

The following models were used:

Round Abbreviation Parameters Link to model
1 EGLM 7B EM German Leo Mistral
1 EMG 70B EM German 70b v01
1 SKLM 7B SauerkrautLM Her0
1 GPT4 NA OpenAI GPT4 Turbo
2 ML3 8B Meta-Llama-3-8B-Instruct
2 ML3 70B Meta-Llama-3-70B-Instruct
2 DL3 8B Llama3-DiscoLeo-Instruct 8B (version 0.1)
2 SK3 8B Llama-3-SauerkrautLM-8b-Instruct
2 SK3 70B Llama-3-SauerkrautLM-70b-Instruct
2 KFK3 8B Llama-3-KafkaLM-8B-v0.1
2 PHI3 8B Phi-3-medium-4k-instruct
3 GEMMA2 9B Google Gemma 2
3 GRAN 8B IBM Granite 8b
3 GRAN_MoE 3B MoE IBM Granite 3b a800m MoE
3 MISTRALNEMO 12B Mistral Nemo
3 ML3.1 8B Meta Llama 3.1 8B Instruct
3 ML3.2 3B Meta Llama 3.2 3B Instruct
3 Ministral 8B Ministral 8B Instruct 2410
3 OC3.6 8B OpenChat 3.6 8B
3 QWEN2.5 14B Qwen 2.5 14B Instruct
3 SKGEMMA2 9B Sauerkraut Gemma 2
3 SKv2 14B Sauerkraut v2 14B DPO

Differences between Round 1 and Round 2

Since Round 1 was made prior to the release of Meta Llama 3, some differences exist between the two rounds:

  • Round 1 experiments were run using different Quanizations (Q4 and Q5), while Round 2 experiments were run using Q6.
  • Round 2 models had some issues adhering to the Experiment B question, thus we had to use a 1-shot prompt.

Step 3: Read out the answers given by the LLMs and aggregate them together with the experiment data.

Since ExpA (completion) examined ditransitive verbs (ExpA1) and benefactive verbs (ExpA2), we've generated two a new data frames for ExpA (ExpA1DataAnswers and ExpA2DataAnswers). Data from ExpB and rating-answers from LLMs were collected within a one data frame (ExpBDataAnswers). See 02A_CollectCompletionAnswers.R and 02B_CollectRatingAnswers.R

Step 4: Analyse behaviour of LLMs compared to behaviour of participants of the original experiments

See 02A_AnalyseCompletionAnswers.R and 03B_AnalyseRatingAnswers.R

About

Series of experiments to evaluate whether LLMs can be used to replace participants in linguistic experiments.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published