📄 Dutch README | 🤖️ GEITje-chat-v2 demo️
GEITje is a large open Dutch language model with 7 billion parameters, based on Mistral 7B. It has been further trained on 10 billion tokens of Dutch text. This has improved its Dutch language skills and increased its knowledge of Dutch topics.
Update 18 December 2023: Release of GEITje-7B-chat-v2, trained on a lot more translated chat conversations.
Update 4 February 2024: Bram Vanroy has made GEITJE-7B-ultra: a superior chatbot, trained on more chat data using DPO.
GEITje is open-source under the Apache 2.0 license. This means that, unlike ChatGPT, for example, you can run GEITje yourself, on your own infrastructure and with any (confidential) data you want. You can also modify or further train the code or the model.
GEITje is a hobby project by Edwin Rijgersberg. Have you made something cool with GEITje? I'd love to hear about it! Send me an email or a message on Twitter or Mastodon. Or open an issue here on GitHub, of course.
More background on the development of GEITje can be found on my blog: GoingDutch.ai.
GEITje is based on Mistral 7B. It's a large open language model with 7 billion parameters, trained by Mistral AI. According to Mistral AI, the 7B model performs better than Llama 2 13B on all (English-language) benchmarks they tested it on. Mistral 7B has been released under the Apache 2.0 open source license.
GEITje was created by further training Mistral 7B on no less than 10 billion tokens of Dutch text from the Dutch Gigacorpus and the MADLAD-400 web crawling corpus. It is a so-called full-parameter finetune: performed on all parameters. It is not a PEFT or LoRA finetune. Like Mistral, GEITje has a context length of 8,192 tokens.
As a demonstration of GEITje's capabilities for chat applications, two initial chat variants of GEITje have also been finetuned: GEITje-chat and GEITje-chat-v2. They can follow instructions, answer questions, and hold dialogues on a variety of topics. GEITje-ultra is a more advanced chatbot, trained on more data and optimized for dialogues with Direct Preference Optimization.
Model | Parameters | Type | Link to 🤗 Hugging Face Models | Based on |
---|---|---|---|---|
GEITje | 7B | foundation | GEITje-7B | Mistral-7B-v0.1 |
GEITje-chat | 7B | chat SFT | GEITje-7B-chat (gguf, gptq, awq) |
GEITje-7B️ |
7B | chat SFT | GEITje-7B-chat-v2 (gguf) |
GEITje-7B️ | |
GEITje-ultra* *contributed by Bram Vanroy |
7B | chat SFT + DPO | BramVanroy/GEITje-7B-ultra (gguf) |
GEITje-7B️ |
Chat with GEITje-chat-v2 in the demo on 🤗 Hugging Face Spaces.
GEITje is best used with 🤗 Hugging Face Transformers.
from transformers import pipeline, Conversation
chatbot = pipeline(task='conversational', model='Rijgersberg/GEITje-7B-chat-v2',
device_map='auto')
print(chatbot(
Conversation('Welk woord hoort er niet in dit rijtje thuis: "auto, vliegtuig, geitje, bus"?')
))
# Conversation id: 602cfe35-614d-4df1-bdb5-2e29038f1d04
# user: Welk woord hoort er niet in dit rijtje thuis: "auto, vliegtuig, geitje, bus"?
# assistant: "Geitje" is het woord dat niet in dit rijtje thuishoort. Het rijtje bestaat uit allemaal vervoersmiddelen.
Or, if you prefer more control:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
device = 'cuda' if torch.cuda.is_available() else 'cpu'
model_name = 'Rijgersberg/GEITje-7B-chat-v2'
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.bfloat16,
low_cpu_mem_usage=True, use_flash_attention_2=True,
device_map=device)
tokenizer = AutoTokenizer.from_pretrained(model_name)
def generate(conversation, temperature=0.2, top_k=50, max_new_tokens=1_000):
tokenized = tokenizer.apply_chat_template(conversation, add_generation_prompt=True,
return_tensors='pt').to(device)
outputs = model.generate(tokenized, do_sample=True, temperature=temperature,
top_k=top_k, max_new_tokens=max_new_tokens)
return tokenizer.decode(outputs[0], skip_special_tokens=True)
conversation = [
{
'role': 'user',
'content': 'Welk woord hoort er niet in dit rijtje thuis: "auto, vliegtuig, geitje, bus"?'
}
]
print(generate(conversation))
# <|user|>
# Welk woord hoort er niet in dit rijtje thuis: "auto, vliegtuig, geitje, bus"?
# <|assistant|>
# Het woord dat niet op zijn plaats staat is 'geit'. Een geit zou niet tussen een lijst van vervoersmiddelen moeten staan. Het past beter bij een boerderijthema of dierenlijst.
You can also use GEITje with LM Studio.
- Use the built-in search to find a model — for example
Rijgersberg/GEITje-7B-chat-v2-gguf
. - Use the Zephyr-preset for the correct settings.
- Set the
temperature
to approximately 0.2 for the best experience.
GEITje also works with Ollama.
- Download a gguf variant of GEITje, for example GEITje-7B-chat-v2.gguf.
- Copy the Modelfile from this repo.
- Create an Ollama model:
$ ollama create GEITje-7B-chat-v2 -f Modelfile
. - Run the model in Ollama:
$ ollama run GEITje-7B-chat-v2
>>> Vraagje: welk woord hoort er niet in dit rijtje thuis: "auto, vliegtuig, geit, bus"?
Geit hoort niet in het rijtje thuis. De andere drie zijn voertuigen.
GEITje is a foundation model. It is trained to complete texts and is not optimized for dialogue applications.
To apply GEITje yourself, you can fine-tune it on your own dataset. If you deploy it in production, make sure you have sufficient guardrails in your training set to safely apply the model.
The GEITje-chat dialogue dataset also contains some examples of chat conversations where the assistant refuses to respond, but the model has not undergone advanced alignment. Therefore, it is possible that it generates problematic output, especially if it is prompted to do so.
Also note that while Mistral 7B as a model is open source, Mistral AI has not disclosed on which data it has been trained. It is therefore also unknown if undesirable material was included in it. The training data for GEITje and GEITje-chat, on the other hand, are transparent. See the following paragraphs for this.
The evaluation of Dutch language models is still in its infancy, but significant progress has been made recently.
Want to contribute? Make a PR or open an issue here on GitHub!
Measured perplexity on yhavinga/mc4_nl_cleaned,
the validation split of the tiny subset. Reproducible with eval.py
.
Model | Parameters | Perplexity (lower is better) |
---|---|---|
GEITje | 7B | 4.70 |
Mistral | 7B | 7.99 |
LLaMA 2 | 7B | 8.91 |
13B | 7.87 | |
70B (8-bit) | 6.44 | |
BramVanroy/llama2-13b-ft-mc4_nl_cleaned_tiny | 13B | 6.62 |
Falcon | 7B | 25.13 |
40B (8-bit) | 6.70 | |
BramVanroy/falcon-7b-ft-mc4_nl_cleaned_tiny | 7B | 9.22 |
BLOOM | 7B | 34.80 |
Good datasets for evaluating language models containing untranslated, original Dutch data are scarce.
Recently, however, a leaderboard has been initiated for the performance of LLMs in Dutch: The Open Dutch LLM Evaluation Leaderboard. It uses four datasets (automatically translated from English) from the Language Model Evaluation Harness.
I've used it to evaluate these four models:
Model name | Foundation model | Continued training on | Finetuned on |
---|---|---|---|
GEITje-7B | mistral-7b-v0.1 | - GigaCorpusNL - MADLAD-400-nl |
|
Mistral-7B-v0.1-chat-nl | mistral-7b-v0.1 | - no_robots_nl - ultrachat_10k_nl |
|
GEITje-7B-chat | mistral-7b-v0.1 | - GigaCorpusNL - MADLAD-400-nl |
- no_robots_nl - ultrachat_10k_nl |
GEITje-7B-chat-v2 | mistral-7b-v0.1 | - GigaCorpusNL - MADLAD-400-nl |
- no_robots_nl - ultrachat_10k_nl - dutch_chat_datasets |
Below is a snapshot as of December 2023. Models trained by me are displayed in italics.
Model | Average | ARC | HellaSwag | MMLU | TruthfulQA |
---|---|---|---|---|---|
zephyr-7b-beta | 0.49 | 0.43 | 0.58 | 0.43 | 0.53 |
mistral-7b-v0.1-chat-nl* | 0.48 | 0.42 | 0.63 | 0.37 | 0.49 |
GEITje-7B-chat | 0.47 | 0.42 | 0.67 | 0.33 | 0.46 |
neural-chat-7b-v3-1 | 0.47 | 0.43 | 0.58 | 0.34 | 0.51 |
GEITJE-7B-chat-v2* | 0.46 | 0.42 | 0.65 | 0.33 | 0.45 |
mistral-7b-v0.1 | 0.46 | 0.42 | 0.58 | 0.37 | 0.45 |
orca-2-13b | 0.45 | 0.42 | 0.54 | 0.37 | 0.50 |
GEITje-7B | 0.45 | 0.38 | 0.65 | 0.32 | 0.43 |
llama-2-13b-chat-hf | 0.44 | 0.41 | 0.55 | 0.37 | 0.43 |
llama2-13b-ft-mc4_nl_cleaned_tiny | 0.44 | 0.40 | 0.58 | 0.35 | 0.42 |
llama-2-13b-chat-dutch | 0.43 | 0.38 | 0.56 | 0.35 | 0.44 |
llama-2-13b-hf | 0.43 | 0.38 | 0.57 | 0.36 | 0.41 |
orca-2-7b | 0.41 | 0.37 | 0.49 | 0.33 | 0.45 |
llama-2-7b-chat-hf | 0.41 | 0.36 | 0.49 | 0.33 | 0.44 |
llama-2-7b-hf | 0.40 | 0.36 | 0.51 | 0.32 | 0.41 |
A preliminary conclusion from this could be that further training has especially helped to significantly increase the HellaSwag score for common sense reasoning. However, this apparently did not lead to better results in the other benchmarks.
Note: only after running the evaluations did I realize that the Open Dutch LLM Evaluation Leaderboard evaluates all models in 8-bit mode.
I evaluated my models in bfloat16
, so the results may not be comparable.
They have been marked with an asterix*.
As soon as the GEITje models are officially included in the leaderboard, I will update the above table with the scores.
Wietse de Vries, Martijn Wieling, and Malvina Nissim from GroNLP have recently compiled the DUMB benchmark (website, paper).
Although it seems designed for masked language models like BERT and RoBERTa, it would be interesting to see how well GEITje performs on it in zero-shot, few-shot, and finetuned settings.
The source code of GEITje is available, and the training dataset is compiled from other sources. The code for compiling the dataset is also available. So, you can reproduce the model yourself.
Geitje is trained on a subset of het Nederlandse Gigacorpus and MADLAD-400.
Source | Subset | Tokens in source | Selected tokens | Epochs | Percentage of total |
---|---|---|---|---|---|
Gigacorpus | subtitles | 300 M | 100 M | 0.33 | 1 % |
Gigacorpus | wiki | 375 M | 1,125 M | 3 | 11 % |
Gigacorpus | 545 M | 545 M | 1 | 5 % | |
Gigacorpus | recht | 2,300 M | 250 M | 0.11 | 3 % |
Gigacorpus | books | 11,100 M | 1,800 M | 0.16 | 18 % |
Gigacorpus | articles | 107 M | 321 M | 3 | 3 % |
Gigacorpus | fora | 42,500 M | 1,000 M | 0.02 | 10 % |
Gigacorpus-extra | dbnl | 2,000 M | 100 M | 0.05 | 1 % |
Gigacorpus-extra | kamerstukken | 2,900 M | 250 M | 0.09 | 3 % |
MADLAD-400 | nl, clean | 115,000 M | 4,500 M | 0.04 | 45 % |
Totaal: | 177,100 M | 9,997 M | 0.06 | 100 % |
Follow this process to reproduce the dataset yourself
- Download the Gigacorpus torrent from gigacorpus.nl.
- Extract all files
- Run
gigacorpus2hf.py
to parse the large text files into separate documents in Hugging Face Datasets. Note! This can take up quite a bit of disk space. By default, the datasets are immediately uploaded to the Hugging Face Hub. The files from Gigacorpus-extra are currently not public. - Run
create_geitje_dataset.py
to compile the training dataset from the Hugging Face Datasets of Gigacorpus and MADLAD-400.
Pretrain code is available in pretrain.py
.
The code is based on Hugging Face Transformers and uses the Trainer
API.
Flash Attention 2 allows for more efficient training on modern GPUs,
and Hugging Face Accelerate for multi-GPU support.
First, install the requirements:
$ python3 -m pip install -r requirements.txt
Optionally: log into the Hugging Face Hub and into Weights & Biases with your API-keys:
$ huggingface-cli login
$ wandb login
Start training:
$ python3 pretrain.py # op 1 GPU, of
$ accelerate launch pretrain.py # meerdere GPUs
For more details about the pretraining, see the report on Weights & Biases, or the loss chart below.
GEITje-chat is a first demo of possible applications of GEITje.
Unfortunately, there is very little example data of Dutch-language chat conversations publicly available. Bram Vanroy has made dutch_chat_datasets publicly available: a dataset of automatically translated question-answer pairs from Dolly, Alpaca, Stack Overflow, and Quora. But I wanted to train with examples of chats with multiple question-answer rounds, to better simulate the use of a chatbot.
Therefore, I had two new chat datasets automatically translated by GPT3.5.
See the scripts in ./data/chat/
for the code for this.
- no_robots_nl: A translated version of all 10k examples from HuggingFaceH4/no_robots.
- ultrachat_10k_nl: A translated version of 10k randomly selected examples from the 200k examples in HuggingFaceH4/ultrachat_200k.
These two datasets together form the training data for GEITje-chat.
During the finetuning of GEITje-chat, the SFTrainer
API from trl
and NEFTune were applied.
Once again, training was done on all parameters.
Optionally: log into the Hugging Face Hub and into Weights & Biases with your API-keys:
$ huggingface-cli login
$ wandb login
Start finetuning:
$ python3 finetune.py
GEITje-chat was trained for 3 epochs. To investigate the effect of pretraining, I also subjected the base model Mistral 7B v0.1 to the exact same training. This model is called Mistral-7B-v0.1-chat-nl.
For more details about the finetuning, see the report on Weights & Biases, or the loss chart below.
GEITje-chat-v2 is trained on the same dataset as v1, supplemented with BramVanroy/dutch_chat_datasets.
It has been trained for a single epoch. See the loss chart below.
GEITje was trained in the Lambda Labs Cloud, on an instance with 8x NVIDIA H100 80 GB GPUs. Training took 526 GPU hours, with an estimated energy consumption of 350 kWh. For comparison: training Llama 2 7B from scratch by Meta used 184,320 GPU hours and consumed about 74,000 kWh.
GEITje-chat and GEITje-chat-v2 were both trained in the cloud of RunPod, on an instance with 1x NVIDIA A100 80GB. Training took 10.5 GPU hours each, with an estimated energy consumption of 5 kWh.
Since GEITje is based on Mistral 7B, it also uses the Mistral 7B tokenizer. Be aware that this tokenizer is not optimized for Dutch. Your texts might be split into more tokens than you might be used to.
Visualizations by Tokenwiz. For more comparisons between tokenizers for Dutch: Yeb Havinga has recently made the Dutch Tokenizer Arena!
A special thanks to Bob Lucassen, without whom GEITje could never have existed. Not only did he compile and publish the Dutch Gigacorpus, but he also actively contributed to processing the corpus and provided additional data. Also check out his Petje af.
While GEITje is one of the first large Dutch language models trained further on a large amount of Dutch text, it was trained with a budget minuscule compared to the millions allocated for language models for other languages. I hope that GEITje is a starting point for a series of innovative, open applications from the Dutch-speaking AI community.
TNO, SURF, and NFI have received funding from the Ministry of Economic Affairs and Climate to develop a large Dutch language model in 2024 and 2025: GPT-NL. It is not yet confirmed whether its weights will be publicly available, and if so, under what license. Hopefully, GEITje can be an example of what is possible with open source.
If you are also working with large language models and are planning to take it to a grander scale than GEITje, please contact me!
GEITje-7B, or one of its derivative models, have been used in the following scientific works:
- Vanroy, Bram. "Language resources for Dutch large language modelling." arXiv preprint arXiv:2312.12852 (2023).
- Terryn, Ayla Rigouts, and Miryam de Lhoneux. "Exploratory Study on the Impact of English Bias of Generative Large Language Models in Dutch and French." Proceedings of the Fourth Workshop on Human Evaluation of NLP Systems (HumEval)@ LREC-COLING 2024. 2024.
- Suijkerbuijk, Michelle, et al. "BLiMP-NL."
- Sinnaeve, Wout, Orphée De Clercq, and Joni Kruijsbergen. "Controlled Text Simplification for Dutch Using Generative Large Language Models."
- Snoeij, C. AI for GovTech - Exploring the use of LLMs for GovTech Benchmark Operationalization. Master Thesis, Delft University of Technology, 2024.
- Bakker, Femke. Assessing Large Language Models for Accurate Long Dutch Document Classification. Master Thesis, University of Amsterdam, 2024.
- Rogiers, Alexander, et al. "KamerRaad: Enhancing Information Retrieval in Belgian National Politics Through Hierarchical Summarization and Conversational Interfaces." Joint European Conference on Machine Learning and Knowledge Discovery in Databases. Cham: Springer Nature Switzerland, 2024.
- Weering, Sanne, and Tommaso Caselli. "FC_RUG at CheckThat! 2024: few-shot learning using GEITje for check-worthiness detection in Dutch." 25th Working Notes of the Conference and Labs of the Evaluation Forum, CLEF 2024. CEUR Workshop Proceedings (CEUR-WS. org), 2024.
- Thakkar, Gaurish, et al. "Building a Large Language Model for Moderately Resourced Language: A Case of Croatian." 35th International Scientific Conference Central European Conference on Information and Intelligent Systems Proceedings. Fakultet organizacije i informatike;, 2024.
- Jain, Devansh, et al. "PolygloToxicityPrompts: Multilingual Evaluation of Neural Toxic Degeneration in Large Language Models." arXiv preprint arXiv:2405.09373 (2024).
- Redelaar, F., et al. "Attributed Question Answering for the Dutch Law using Retrieval augmented Large language models", The 34th Meeting of Computational Linguistics in The Netherlands, 2024.
- Debaene, F., et al. "Do we still need specialized transformers? A comparison to generative foundation models for non-normative Dutch", The 34th Meeting of Computational Linguistics in The Netherlands, 2024.
- Sinnaeve, W., et al. "Controlled Text Simplification for Dutch using Generative Large Language Models", The 34th Meeting of Computational Linguistics in The Netherlands, 2024.
- Chen, S., et al. "Automated Pass/Fail Classification for Dutch as a Second Language Using Large Language Models", The 34th Meeting of Computational Linguistics in The Netherlands, 2024.
- Grotti, L. "Evaluating Grapheme-to-Phoneme Conversion and Syllable Counting of Large Language Models for Dutch", The 34th Meeting of Computational Linguistics in The Netherlands, 2024.
- Noels, Sander, Jorne De Blaere, and Tijl De Bie. "A Dutch Financial Large Language Model." arXiv preprint arXiv:2410.12835 (2024).
- Vanroy, Bram. "GEITje 7B Ultra: A Conversational Model for Dutch." arXiv preprint arXiv:2412.04092 (2024).
If you use GEITje, you can use the following reference:
@misc{rijgersberg2023geitje,
title = {GEITje: een groot open Nederlands taalmodel},
shorttitle = {GEITje},
author = {Rijgersberg, Edwin and Lucassen, Bob},
year = {2023},
month = dec,
url = {https://github.com/Rijgersberg/GEITje}
}
This README was translated from the original Dutch with the help of GPT 4.