-
-
Notifications
You must be signed in to change notification settings - Fork 3
Configure a model
The parameters for training a model are stored in the experiment folder in a file named 'config.yml'. The file uses the YAML format. Related settings are grouped together in sections.
These are the sections of a config file.
data:
eval:
infer:
model:
params:
train:
It is not necessary to specify options for all of these sections for every training. Only those with parameters which differ from the default values need to be specified. See Parameter Definitions for a full list of supported parameters and their definitions.
A minimal config.yml file looks like this:
data:
corpus_pairs:
- type: train,val,test
src: src-text
trg: trg-text
share_vocab: false
src_vocab_size: 24000
trg_vocab_size: 32000
model: facebook/nllb-200-distilled-600M
This minimal config file provides these instructions to the system. Train a model to translate between src and trg languages. Split the texts into three parts one for training, one for validation and one for test. Use the default sizes for the validation and test sets and all the remaining data for the training. Create a separate vocab file for the source and target languages. Instruct sentencepiece to create a source vocab of 24000 tokens and to create a target vocab of 32000 tokens. Use the defaults for all the other settings including the default model architecture and default early stopping conditions.
Another way to learn how to configure training is by examining the effective config file that is produced when an experiment is run.
The parallel text available for low resource languages are translations of Scripture that are aligned by verse reference.
When the aligned Scripture files are used as a corpus pair it is possible to select parts of the data for training and testing without having to split the text files prior to training.
We have added a corpus_books
config option for this function. There is also a similar option to specify which books to include in the test set test_books
. Another option in the terms section is filter_books
, which specifies which books to be included for key terms, has the same available syntax at the book level (chapters cannot be specified).
The example below shows the corpus_pairs section for restricting the entire model to only the data in the New Testament. The training, validation and test sets are all drawn only from that data.
corpus_pairs:
- type: train,test,val
corpus_books: NT
src: src-bible
trg: trg-bible
val_size: 250
test_size: 250
The following is an example showing how to specify a corpus_pairs
to use the New Testament, Genesis and Psalms for the training and validation sets. It also shows how to restrict the test set to verses from the book of Exodus.
corpus_pairs:
- type: train,val,test
corpus_books: NT,GEN,PSA
src: src-bible
trg: trg-bible
val_size: 250
test_books: EXO
test_size: 250
seed: 111
In this example the book of Exodus is reserved for the test set and the remaining books of the Bible are available for training and validation. The test_books
parameter excludes the books listed there from appearing in the Training or Validation sets. So even though only 250 verses of Exodus are used for the test set non of the remaining verses are included in either the training or validation sets. Therefore the test_books
parameter may be used to restrict the training to a smaller set of data without having to modify the data files.
No error is raised if you specify a test_size
larger than the number of verses in the test_books
. In that case all of the verses in the test_books
will be used as the test set.
model: SILTransformerBase
data:
corpus_pairs:
- type: train,val,test
src: src-bible
trg: trg-bible
val_size: 250
test_books: EXO
test_size: 250
Alternative syntax for corpus_books, test_books, and filter_books to use chapter specification, book ranges, and subtraction.
In addition to using comma-separated lists to specify the books used for trianing and testing, it is also possible to specify data at the chapter level, with book ranges, and with subtraction. To do this, use a semicolon-separated list, where each section has one of the following formats:
- A comma-separated list of chapters and chapter ranges for a specific book, e.g.
MAT1,2,6-10
.filter_books
does not allow chapter specification. - A range of books, e.g.
GEN-DEU
- A single book or testament, e.g.
MAT
,OT
- To subtract some data from the selection, use one of the above types preceded by
-
, e.g.-MAT1-4
,-GEN-LEV
. Sections are evaluated in the order that they appear, so make sure the selection being subtracted has already been added to the data set.
Examples:
GEN;EXO;LEV
OT;MAT-ROM;-ACT4-28
NT;-3JN
There are several ways to use more than one source in your experiment data. If you want to use different sources to get data from different parts of a text, you can define mulitple corpus pairs. This is useful when a source has incomplete data, or when you want to use different sources for training vs evaluation and testing.
data:
corpus_pairs:
- type: train,val,test
src: src-bible1
trg: trg-bible
corpus_books: GEN,EXO
test_books: LEV
- type: train,val,test
src: src-bible2
trg: trg-bible
corpus_books: NUM,DEU
test_books: JOS
If you instead want to use multiple sources but want to select data from the same portion of the texts, you can define a mixed-source corpus pair. This will equally and randomly choose verses from each text without overlap.
data:
corpus_pairs:
- mapping: mixed_src
type: train,val,test
src:
- src-bible1
- src-bible2
trg: trg-bible
corpus_books: GEN,EXO
test_books: LEV
Additionally, the many_to_many
mapping allows you to map multiple sources to multiple targets.
data:
corpus_pairs:
- mapping: many_to_many
type: train,val,test
src:
- src-bible1
- src-bible2
trg:
- trg-bible1
- trg-bible2
corpus_books: GEN,EXO
test_books: LEV
Abbreviations for Old Testament Books
GEN EXO LEV NUM DEU JOS JDG RUT 1SA 2SA 1KI 2KI 1CH 2CH EZR NEH EST JOB PSA PRO
ECC SNG ISA JER LAM EZK DAN HOS JOL AMO OBA JON MIC NAM HAB ZEP HAG ZEC MAL
Abbreviations for New Testament Books
MAT MRK LUK JHN ACT ROM 1CO 2CO GAL EPH PHP COL 1TH 2TH 1TI 2TI TIT PHM HEB JAS 1PE 2PE 1JN 2JN 3JN JUD REV
Abbreviations for Deutero cannonical Books
TOB JDT ESG WIS SIR BAR LJE S3Y SUS BEL 1MA 2MA 3MA 4MA 1ES 2ES MAN PS2 ODA PSS JSA JDB TBS SST DNT BLT
3ES EZA 5EZ 6EZ INT CNC GLO TDX NDX DAG PS3 2BA LBA JUB ENO 1MQ 2MQ 3MQ REP 4BA LAO
The seed parameter is used as a seed for a random number generator. The benefit of setting this explicitly is that the same random selection of Validation and Test set verses are chosen from the available data. Setting the seed means that other training runs using the makes it possible to compare the effect of changing other parameters against an identical test set. If this is not set explicitly then the training, validation and test sets contents' will vary between one training run and the next.
YAML is designed to be easy to read. It is useful to know that there are various ways to specify a list. Inline lists are separated with commas and square brackets are optional for a simple list. For a list that is too long for a single each item can be on a separate line preceded with a hyphen and a space.
These are three ways of indicating the same list:
test_books: GEN,EXO,LEV,NUM,DEU
test_books: [GEN,EXO,LEV,NUM,DEU]
test_books:
- GEN
- EXO
- LEV
- NUM
- DEU
The hyphen and space -
on the line after the corpus_pairs
parameter indicates that these settings are part of a list. In the examples above only one corpus pair is specified. Here is an example of a complete config.yml file, the one we used to train our German to English parent model. There are three corpus pairs one for each of the Training, Validation and Test sets.
model: SILTransformerBaseAlignmentEnhanced
data:
terms:
dictionary: true
corpus_pairs:
- type: train
src: de-WMT2020+Bibles
trg: en-WMT2020+Bibles
- type: val
src: de-newstest2014_ende
trg: en-newstest2014_ende
- type: test
src: de-newstest2017_ende
trg: en-newstest2017_ende
seed: 111
share_vocab: false
src_casing: lower
src_vocab_size: 32000
trg_casing: preserve
trg_vocab_size: 32000
params:
coverage_penalty: 0.1
word_dropout: 0
train:
keep_checkpoint_max: 5
max_step: 1000000
sample_buffer_size: 10000000
eval:
steps: 10000
export_on_best: bleu
early_stopping: null
export_format: checkpoint
max_exports_to_keep: 100
The files required for training, validation, and testing will be tokenized using the tokenizer of the specified model and the outputs will be written to the experiment folder. These are named:
train.src.txt
train.src.detok.txt
train.trg.txt
train.trg.detok.txt
train.vref.txt
val.src.txt
val.src.detok.txt
val.trg.txt
val.trg.detok.txt
val.vref.txt
test.src.txt
test.src.detok.txt
test.trg.detok.txt
test.vref.txt
The seed in the config file is used in the selection of verses for each of the training splits, and this behavior is enabled by default for consistent experimentation.
The effective config file is created as soon as the training begins. A good way to learn about all the default parameters is to compare a simple config file like this one to the effective config that it creates. Although there may be more than 100 parameters in the effective config file they all have default values. Typically we've found very few areas where we can get better results by changing a default value. They have been the subject of many experiments and are chosen by the OpenNMT project according to the results of the latest research.
Definitions of every configurable experiment parameter and their default values. Information about Hugging Face parameters can be found here. Selected HF parameters are defined below for convenience, and default values are only given if they are explicitly defined in silnlp.
-
add_new_lang_code=True
: Add any language codes inlanguage_codes
to the tokenizer if they do not already exist. -
aligner="fast_align"
: Aligner to use. -
corpus_pairs
:-
augment=[]
: List of data augmentation methods and their arguments to apply to the data. See example below.augment: - subword: - encodings: 2
-
corpus_books=[]
: Books to be included in the dataset. See Selection of books or chapters for training on Scripture data. -
disjoint_test=True
: Use the same test set across all source-target pairs in the corpus pair to ensure no overlap between any train and test sets. -
disjoint_val=True
: Use the same evaluation set across all source-target pairs in the corpus pair to ensure no overlap between any train and evaluation sets. -
lexical=False
: Whether data is made up of lexical items rather than sentences. -
mapping="one_to_one"
: How to map sources to targets. Options areone_to_one
,mixed_src
, ormany_to_many
. See Using Multiple Sources. -
score_threshold=0.0
: If <1, it is the minimum alignment score sentence pairs must have to be included in the training data. If >=1, that number of training sentence pairs with the lowest alignment scores will be filtered out of the training data. -
size=1.0
: Size of training split. Ifsize
is a float between 0 and 1, it will be interpreted as a ratio of the total size, otherwise if it is >1 or an integer, it will be interpreted as an absolute size. -
src
: Required argument. List of sources. Sources can be a mix of strings and dictionaries. Passing a dictionary allows the user to configure the source file object. See the DataFile class for a list of the configurable properties. Targets (seetrg
) can also be defined in this way. See example below.src: - name: aaa-SRC_BT include_test: false - aaa-SRC
-
src_noise=[]
: List of noise-adding methods and their arguments to apply to source sentences. See example below.src_noise: - dropout: .1 - replacement: [.1, <blank>] - permutation: 2
-
tags=[]
: Tags to prefix to each source sentence. -
test_books=[]
: Books to be included in the test set. See Selection of books or chapters for training on Scripture data. -
test_size=250
: Size of test split. Iftest_size
is a float between 0 and 1, it will be interpreted as a ratio of the total size, otherwise if it is >1 or an integer, it will be interpreted as an absolute size. -
trg
: Required argument. List of targets. Seesrc
. -
type="train,test,val"
: What the data in the corpus pair will be used for. Possible values are any combination oftrain
,test
, andval
. -
use_test_set_from=""
: Use the set of verses in the given experiment's test set for this experiment. -
val_size=250
: Size of evaluation split. Ifval_size
is a float between 0 and 1, it will be interpreted as a ratio of the total size, otherwise if it is >1 or an integer, it will be interpreted as an absolute size.
-
-
lang_codes
: Mapping of ISO language codes to their NLLB equivalents for each language included in the data. See example below.lang_codes: en: eng_Latn npi: npi_Deva
-
mirror=False
: Add mirrored data to the dataset, where the source and target are flipped. -
seed=111
: Seed for random verse selection. See A note about the seed parameter. -
share_vocab=False
: Use the same vocab file for the source and target languages. -
stats_max_size=100000
: Maximum number of sentence pairs allowed for a stats file to be generated. -
terms
:-
categories="PN"
: Which categories of key terms to include. -
dictionary=False
: Write dictionary with key terms. -
include_glosses=True
: Include glosses of key terms. Can also be set to the ISO language code of the gloss to include. Thelang_code
parameter must include this ISO to NLLB mapping for accurate results. -
train=True
: Train on key terms data. -
filter_books=[]
: Which books of key terms to include. See Selection of books or chapters for training on Scripture data.
-
-
tokenize=True
: Tokenize data. -
tokenizer
:-
update_src=False
: Update the tokenizer for the source language. -
update_trg=False
: Update the tokenizer for the target language. -
trained_tokens=False
: If True, train a new tokenizer on the source and/or target (specified by theupdate_src
andupdate_trg
parameters) to obtain trained tokens tailored to the source and/or target. All of the resulting tokens that are not present in the existing tokenizer are then added to the existing tokenizer. If False, only individual characters that are present in the source and/or target text and not present in the existing tokenizer will be added to the existing tokenizer, rather than trained tokens. -
src_vocab_size
: Only applicable ifupdate_src
andtrained_tokens
are True. This sets the vocab size for the new tokenizer for the source side. There is no default value, so it must be explicitly specified whenupdate_src
andtrained_tokens
are True. -
trg_vocab_size
: Only applicable ifupdate_trg
andtrained_tokens
are True. This sets the vocab size for the new tokenizer for the target side. There is no default value, so it must be explicitly specified whenupdate_trg
andtrained_tokens
are True. -
share_vocab=False
: Only applicable ifupdate_src
,update_trg
, andtrained_tokens
are True. Rather than create new tokenizers for the source and target separately, use a single new tokenizer for both the source and target combined with a vocab size ofsrc_vocab_size
+trg_vocab_size
. -
init_unk=False
: Initialize new token embeddings using the embedding for the unk token rather than using the model's default initialization behavior.
-
HF Arguments: eval_accumulation_steps
, eval_delay
, eval_steps=1000
, evaluation_strategy="steps"
, greater_is_better
, include_inputs_for_metrics
, load_best_model_at_end=True
, per_device_eval_batch_size=16
, predict_with_generate=True
-
eval_steps=1000
: Number of update steps between two evaluations if evaluation_strategy="steps". Should be an integer or a float in range [0,1). If smaller than 1, will be interpreted as ratio of total training steps. -
metric_for_best_model="bleu"
: Metric to use for evaluation during training. Supported values in silnlp are 'bleu', 'chrf3', 'chrf3+', and 'chrf3++'.
Other Arguments:
-
detokenize=True
: Detokenize verses before computing metrics during evaluation/testing. -
early_stopping
:-
min_improvement=0.2
: How much themetric_for_best_model
metric must improve for training to continue. -
steps=4
: The amount of times in a row that an evaluation can improve by less thanmin_improvement
before training is stopped.
-
-
multi_ref_eval=False
: Evaluate outputs against multiple targets.
-
infer_batch_size=16
: Batch size for inference. -
num_beams=2
: Number of beams for beam search during translation.
model
: Required argument. Name of base model to be used. Defined at the top level of the config, i.e. at the same level as data
, eval
, etc..
HF Arguments: adafactor
, adam_beta1
, adam_beta2
, adam_epsilon
, full_determinism
, generation_max_length
, generation_num_beams
, label_smoothing_factor=0.2
, learning_rate
, lr_scheduler_type
, max_grad_norm
, optim="adamw_torch"
, warmup_ratio
, warmup_steps=4000
, weight_decay
, attn_implementation="eager"
Other Arguments:
-
activation_dropout=0.0
: Dropout rate for activation layers. -
attention_dropout=0.1
: Dropout rate for attention layers. -
dropout=0.1
: Dropout rate for all other layers.
HF Arguments: gradient_accumulation_steps=4
, gradient_checkpointing=True
, "gradient_checkpointing_kwargs"={"use_reentrant": True}
, group_by_length=True
, log_level="info"
, logging_dir
, logging_first_step
, logging_nan_inf_filter
, logging_steps
, logging_strategy
, max_steps=100000
, num_train_epochs
, output_dir=str(exp_dir / "run")
, per_device_train_batch_size=16
, save_on_each_node
, save_steps=1000
, save_strategy="steps"
, save_total_limit=2
-
gradient_accumulation_steps=4
: Number of updates steps to accumulate the gradients for before performing a backward/update pass. -
gradient_checkpointing=True
: Use gradient checkpointing to save memory at the expense of slower backward pass. -
"gradient_checkpointing_kwargs"={"use_reentrant": True}
: Use the reentrant implementation of gradient checkpointing. (If errors occur with gradient checkpointing and LoRA or some other method that freezes parameters/layers, try settinguse_reentrant
to False.) -
logging_steps=500
: Number of update steps between two logs iflogging_strategy="steps"
. Should be an integer or a float in range [0,1). If smaller than 1, will be interpreted as ratio of total training steps. -
max_steps=100000
: The total number of training steps to perform. For a finite dataset, training is reiterated through the dataset (if all data is exhausted) until max_steps is reached. Overrides num_train_epochs. Set to -1 to instead use num_train_epochs. -
num_train_epochs=3.0
: Total number of training epochs to perform (if not an integer, will perform the decimal part percents of the last epoch before stopping training). -
per_device_train_batch_size=16
: The batch size per GPU core/CPU for training. -
save_steps=1000
: Number of updates steps before two checkpoint saves if save_strategy="steps". Should be an integer or a float in range [0,1). If smaller than 1, will be interpreted as ratio of total training steps. -
attn_implementation="eager"
: Sets the attention implementation for a model. Possible values are"sdpa"
,"eager"
, or"flash_attention_2"
. Note that"flash_attention_2"
is not currently compatible with NLLB.
Other Arguments:
-
auto_grad_acc=False
: Find and use the largest possible batch size and adjust the number of gradient accumulation steps accordingly to maintain an effective batch size of 64. Theper_device_train_batch_size
andgradient_accumulation_steps
arguments are ignored while using this option. -
delete_checkpoint_optimizer_state=True
: Delete optimizer state from every saved checkpoint after training. -
delete_checkpoint_tokenizer=True
: Delete tokenizer from every saved checkpoint after training. -
lora_config
: Optional configuration for LoRA. See Common LoRA Parameters in PEFT.-
alpha=32
: Value forlora_alpha
. "The alpha parameter for Lora scaling." -
dropout=0.1
: Value forlora_dropout
. "The dropout probability for Lora layers." -
modules_to_save
: "List of modules apart from LoRA layers to be set as trainable and saved in the final checkpoint." Default value depends on the model being trained, but it normally includes "embed_tokens" and "lm_head". -
r=4
: "Lora attention dimension." -
target_modules
: "The names of the modules to apply Lora to." Default value depends on the model being trained, but it normally includes all linear layers.
-
-
max_source_length=200
: Maximum length of a source segment. Segments longer than this value are truncated. -
max_target_length=200
: Maximum length of a target segment. Segments longer than this value are truncated. -
use_lora=False
: Train model using LoRA through thepeft
library. See here for more information.
The following are some parameters that can be useful to change when running experiments for the purpose of testing during development. This is mostly to reduce training time while still making sure each part of the process is run.
eval:
eval_steps
per_device_eval_batch_size
infer:
infer_batch_size
params:
warmup_steps
train:
max_steps
num_train_epochs
per_device_train_batch_size
save_steps
-
save_steps
determines how often a model checkpoint is saved during training. For example, if you wanted to quickly get a model to inference with, you could set bothmax_steps
andsave_steps
to 100.
Using the --translate
option when running an experiment allows drafts to be created immediately following the training of a model. The configuration for each transalation request must be specified in translate_config.yml
in the experiment folder. The behavior of this process is identical to using the translate.py
script, and so the possible arguments for a configuration match the command line options of the script (With the exception of the clearml_queue
, and debug
options). The format of translate_config.yml
is a list of dictionaries, where each dictionary represents a translation request. See example below, as well as the translate.py
usage documentation for descriptions of the arguments.
translate:
- books: 1JN
- src_project: NASB
trg_project: NNRV
books: 1JN1-2;2JN
- In this example, the first request will translate 1 John from the experiment's source project to the target language. The second request will translate the specified chapters in the NASB to the target language, filling in incomplete books with text from the NNRV.
Originally, the default configuration for training a model in silnlp used a small learning rate and a large number of maximum steps, and rather than training each model for the maximum amount of steps, it used "early stopping" to detect when the model was adequately trained by comparing the model's evaluation scores over the course of training. The default configuration has since been updated to use a larger learning rate and a smaller number of maximum steps, and models are now always trained for the maximum number of steps. While models now train for much fewer steps (5k steps vs. 10-20k steps), the adjustments made to the learning rate and learning rate schedule allow the models to achieve equal performance compared to the previous setup in the majority of cases. However, there are still some situations, mainly more experimental ones, where the original configuration is better suited for the task. In that case, the original behavior can be restored by adding the below fields to their respective sections in the configuration file of an experiment. The current default values are also given for comparison.
Current training configuration (Oct 2024)
eval:
early_stopping: null
params:
warmup_steps: 1000
learning_rate: .0002
lr_scheduler_type: cosine
train:
max_steps: 5000
Previous training configuration:
eval:
early_stopping:
min_improvement: 0.2
steps: 4
params:
warmup_steps: 4000
learning_rate: .00005
lr_scheduler_type: linear
train:
max_steps: 100000