This repository implements experiments/baselines for SuperLim 2.
1. wget https://repo.anaconda.com/archive/Anaconda3-2022.10-Linux-x86_64.sh
2. bash Anaconda3-2022.10-Linux-x86_64.sh
3. yes, yes, yes
4. conda create -n ptgpu_venv python=3.9
5. conda activate ptgpu_venv
6. conda install pytorch torchvision torchaudio pytorch-cuda=11.7 -c pytorch -c nvidia
1. git clone [email protected]:aidotse/superlim-baselines.git
2. cd superlim-baselines
3. pip install -r requirements.txt
4. tmux new -s exp
5. tmux attach -t exp
1. Download all data to data directory
2. Setup wandb key in api_wandb_key.txt
3. Configure GPUs to use in run_bert_experiments.sh
4. Specify models and tasks in run_bert_experiments.sh
5. Specify accumulation sizes for models in bert/bert_experiment_driver.py to suit your available GPU memory.
6. ./run_dummy_experiments.sh
7. ./run_bert_experiments.sh (this includes gpt experiments if wanted)
8. Run the collect_results/create_table.py script, it will collect results in results/experiments/metrics directory and create the results/experiments/metrics/model_deliverables directory with json files.
This will fine-tune 672 models in total, if the 4 gpt-models are included (they are not by default). 84 models will be saved, one for each task-model combination.
1. Individual experiment results: results/experiments/metrics/
2. Packed results grouped by model: results/experiments/metrics/model_deliverables/
3. Model predictions on dev and test files: results/experiments/predictions/
4. Best fine-tuned models: results/experiments/models/
The tasks are named slightly different in this repository since it was developed during the development of SuperLim 2.
See task_to_info_dict
in Experiment.py
for the task names used. When the final results files are created, these
names are mapped to the official names.
Note: Winogender will be evaluated automatically when training/evaluating with SweMNLI.
The class hierarchy handles much of the execution flow, like a framework.
paths.py
defines paths that can be imported, and should work cross-platform and in docker containers etccompute_metrics.py
implements functions to compute metrics required for the experimentscollect_metrics.py
loads the experiment results in the pre-defined results file hierarchy and plots a table.
Experiment.py
defines the abstract class Experiment
, which handles the
experiment execution flow, keeps track of the model name, task name, and other meta-data.
It also handles the writing of results to file, and contains task_to_info_dict
which
should contain the required meta-data for each task, such as which metric is used.
ExperimentBert.py
defines another abstract class ExperimentBert
, which handles
loading of tokenizers and models, default hyperparameters, hyperparameter tuning,
HuggingFace Trainer, Evaluation, etc.
Children of ExperimentBert
simply define the task name, which should be supported in Experiment.py
,
and implements the abstract method create_dataset
, which calls the _load_data
method
and preprocesses it to make it ready for training.
Inherits Experiment, and uses sklearn.dummy
to provide a dummy baseline.
This works for any task that has training data and is a classification or regression problem.
- Make sure the data is accessible, preferably via HF Datasets, or otherwise through JSONL/TSV files
- Create a function
reformat_dataset_<task_name>
indataset_loader.py
, that converts the dataset into a dataset of desired format. - Add a pointer to this function in the dictionary
TASK_TO_REFORMAT_FUN
above theload_dataset_by_task
function in the same file. - Add an entry with the required meta-data for this task in
Experiment.py
, in the dictionarytask_to_info_dict
. - Add an entry in
bert_experiment_driver.py
for the corresponding ExperimentBert class - The dataset, reformatting, and meta-data loading is done automatically when a child instance of
Experiment
is used.
-
Metric results are stored in json-files with the following path:
results/experiments/metrics/<task_name>/<model_name>/metrics.json
-
Ray (library for HPS) results are stored in
results/ray_results
, and can be cleared after the experiments are done. Can be done automatically withclear_result_checkpoints.sh
. -
HF Trainer checkpoints are stored in
results/trainer_output
, and can be cleared after the experiments are done. This is done automatically in theBertExperiment
class, and withclear_result_checkpoints.sh
All tasks are evaluated using the Krippendorff Alpha metric (nominal for classification, and interval for regression).
Model | ABSA | Argumentation | DaLAJ | SweMNLI | SweParaphrase | SweWiC | SweWinograd | Swedish FAQ | Avg ↑ |
---|---|---|---|---|---|---|---|---|---|
KBLab/megatron-bert-large-swedish-cased-165k | 0.571564 | 0.946352 | 0.727593 | 0.776482 | 0.908897 | 0.312644 | 0.258708 | 0.883748 | 0.673248 |
AI-Nordics/bert-large-swedish-cased | 0.533768 | 0.936037 | 0.716007 | 0.746819 | 0.895458 | 0.289614 | 0.240471 | 0.83509 | 0.649158 |
KB/bert-base-swedish-cased | 0.513309 | 0.929816 | 0.715983 | 0.720911 | 0.900527 | 0.408507 | 0.188203 | 0.747808 | 0.640633 |
KBLab/megatron-bert-base-swedish-cased-600k | 0.497364 | 0.936032 | 0.690426 | 0.744255 | 0.895231 | 0.299891 | 0.179137 | 0.835171 | 0.634688 |
xlm-roberta-large | 0.554599 | 0.933846 | 0.71126 | 0.783054 | 0.908649 | 0.345008 | 0.151711 | 0.679484 | 0.633452 |
NbAiLab/nb-bert-base | 0.440776 | 0.923843 | 0.638692 | 0.717407 | 0.87952 | 0.351808 | 0.177085 | 0.718147 | 0.60591 |
KBLab/bert-base-swedish-cased-new | 0.476216 | 0.915303 | 0.726241 | 0.727923 | 0.812494 | 0.225232 | 0.0651066 | 0.553061 | 0.562697 |
xlm-roberta-base | 0.398698 | 0.917526 | 0.671511 | 0.720407 | 0.871351 | 0.253994 | -0.251163 | 0.619826 | 0.525269 |
SVM | 0.336347 | 0.906789 | 0.501149 | 0.126261 | 0.175726 | 0.012941 | 0.0981801 | 0.0473189 | 0.275589 |
Decision Tree | 0.206274 | 0.862748 | 0.298952 | 0.14797 | 0.223647 | 0.0405762 | -0.0220252 | -0.02735 | 0.216349 |
Random | -0.0566656 | 0.00489102 | -0.0357937 | 0.00267875 | -0.033005 | 0.0568082 | 0.0924728 | -0.118317 | -0.0108663 |
Random Forest | 0.0120151 | -0.256239 | -0.31043 | -0.255086 | 0.159126 | 0.0272561 | -0.251163 | 0.00746468 | -0.108382 |
MaxFreq/Avg | -0.0309908 | -0.256239 | -0.347135 | -0.343683 | -0.0246646 | -0.332 | -0.251163 | -0.316185 | -0.237757 |
Model | ABSA | Argumentation | DaLAJ | SweMNLI | SweParaphrase | SweWiC | SweWinograd | Swedish FAQ | Avg ↑ |
---|---|---|---|---|---|---|---|---|---|
KBLab/megatron-bert-large-swedish-cased-165k | 0.508299 | 0.627597 | 0.753261 | 0.231612 | 0.873878 | 0.307947 | 0.188953 | 0.777436 | 0.533623 |
AI-Nordics/bert-large-swedish-cased | 0.480036 | 0.563173 | 0.745449 | 0.240594 | 0.862311 | 0.316317 | 0.191522 | 0.718673 | 0.51476 |
KB/bert-base-swedish-cased | 0.529183 | 0.555028 | 0.739715 | 0.179116 | 0.844865 | 0.37619 | 0.139458 | 0.640648 | 0.500525 |
xlm-roberta-large | 0.51631 | 0.583698 | 0.737508 | 0.20472 | 0.881687 | 0.3672 | 0.0806007 | 0.583791 | 0.494439 |
KBLab/megatron-bert-base-swedish-cased-600k | 0.449322 | 0.562494 | 0.718029 | 0.217683 | 0.866812 | 0.277146 | 0.0614488 | 0.709154 | 0.482761 |
NbAiLab/nb-bert-base | 0.389723 | 0.540602 | 0.64446 | 0.171583 | 0.822616 | 0.325909 | 0.120361 | 0.659844 | 0.459387 |
KBLab/bert-base-swedish-cased-new | 0.427938 | 0.553602 | 0.753263 | 0.16292 | 0.754713 | 0.140347 | 0.0420433 | 0.446627 | 0.410182 |
xlm-roberta-base | 0.365947 | 0.497157 | 0.700577 | 0.185628 | 0.812797 | 0.181145 | -0.177215 | 0.473112 | 0.379893 |
SVM | 0.285916 | 0.353759 | 0.517739 | 0.000204149 | 0.23909 | 0.0422635 | 0.0549607 | 0.0381895 | 0.191515 |
Decision Tree | 0.117238 | 0.155629 | 0.268636 | 0.0132697 | 0.199644 | 0.0398626 | -0.24 | 0.0399946 | 0.0742843 |
Random | 0.00783217 | 0.0132383 | 0.00702486 | -0.0906326 | -0.0427819 | -0.00954447 | 0.0806007 | -0.150356 | -0.0230774 |
Random Forest | 0.00537142 | -0.272389 | -0.312481 | -0.411051 | 0.142812 | 0.00334587 | -0.177215 | 0.0318551 | -0.123719 |
MaxFreq/Avg | -0.0517904 | -0.272389 | -0.340028 | -0.433837 | -0.00149459 | -0.332667 | -0.177215 | -0.309699 | -0.23989 |
Model | SweWinogender (Parity) | SweWinogender (Alpha) |
---|---|---|
KBLab/megatron-bert-large-swedish-cased-165k | 0.995192 | -0.29472 |
xlm-roberta-base | 0.995192 | -0.298024 |
xlm-roberta-large | 0.985577 | -0.315893 |
KBLab/megatron-bert-base-swedish-cased-600k | 0.990385 | -0.320952 |
KBLab/bert-base-swedish-cased-new | 1 | -0.328369 |
AI-Nordics/bert-large-swedish-cased | 1 | -0.332265 |
KB/bert-base-swedish-cased | 1 | -0.332265 |
NbAiLab/nb-bert-base | 0.990385 | -0.332306 |
The search space for all tasks and transformer models were the following:
{
"learning_rate": [1e-5, 2e-5, 3e-5, 4e-5],
"batch_size": [16, 32]
}
with the exception of SweMNLI
, that had its search space reduced to the immense training set size:
{
"learning_rate": [1e-5, 4e-5],
"batch_size": [16, 32]
}
Furthermore, all models use the following hyperparameters along with HuggingFace Trainer's default arguments:
{
"warmup_ratio": 0.06,
"weight_decay": 0.1, # 0.0 if gpt
"num_train_epochs": 10,
"fp16": true
}
num_train_epochs=10
is the maximum epochs, using early stopping with patience=5
.
Below follow the selected Hyperparameters for each model and task, along with a standard deviation of the evaluation metric for the different configurations.
AI-Nordics/bert-large-swedish-cased
Task | LR | BS | hps std |
---|---|---|---|
ABSA | 2e-05 | 16 | 0.0120759 |
Argumentation | 1e-05 | 32 | 0.00375441 |
DaLAJ | 4e-05 | 32 | 0.0138314 |
SweMNLI | 1e-05 | 16 | 0.00832644 |
SweParaphrase | 3e-05 | 16 | 0.00558327 |
SweWiC | 3e-05 | 32 | 0.0153254 |
SweWinograd | 3e-05 | 32 | 0.0308409 |
Swedish FAQ | 1e-05 | 32 | 0.0166277 |
KB/bert-base-swedish-cased
Task | LR | BS | hps std |
---|---|---|---|
ABSA | 2e-05 | 16 | 0.02115 |
Argumentation | 3e-05 | 32 | 0.00774597 |
DaLAJ | 2e-05 | 32 | 0.00690644 |
SweMNLI | 1e-05 | 32 | 0.0118903 |
SweParaphrase | 4e-05 | 32 | 0.00267101 |
SweWiC | 2e-05 | 16 | 0.0111782 |
SweWinograd | 1e-05 | 16 | 0.0618928 |
Swedish FAQ | 1e-05 | 16 | 0.0258529 |
KBLab/bert-base-swedish-cased-new
Task | LR | BS | hps std |
---|---|---|---|
ABSA | 2e-05 | 16 | 0.010503 |
Argumentation | 4e-05 | 32 | 0.463319 |
DaLAJ | 2e-05 | 16 | 0.00939234 |
SweMNLI | 1e-05 | 16 | 0.00648224 |
SweParaphrase | 4e-05 | 16 | 0.0423114 |
SweWiC | 1e-05 | 32 | 0.171214 |
SweWinograd | 1e-05 | 32 | 0.132972 |
Swedish FAQ | 3e-05 | 32 | 0.144674 |
KBLab/megatron-bert-base-swedish-cased-600k
Task | LR | BS | hps std |
---|---|---|---|
ABSA | 4e-05 | 16 | 0.0215247 |
Argumentation | 3e-05 | 16 | 0.0777753 |
DaLAJ | 4e-05 | 16 | 0.0171051 |
SweMNLI | 1e-05 | 16 | 0.00194938 |
SweParaphrase | 4e-05 | 16 | 0.00612823 |
SweWiC | 4e-05 | 16 | 0.0291987 |
SweWinograd | 4e-05 | 32 | 0.114922 |
Swedish FAQ | 3e-05 | 32 | 0.00878437 |
KBLab/megatron-bert-large-swedish-cased-165k
Task | LR | BS | hps std |
---|---|---|---|
ABSA | 3e-05 | 16 | 0.0126327 |
Argumentation | 4e-05 | 16 | 0.0226433 |
DaLAJ | 3e-05 | 32 | 0.0174812 |
SweMNLI | 1e-05 | 32 | 0.00384093 |
SweParaphrase | 4e-05 | 16 | 0.00475201 |
SweWiC | 4e-05 | 32 | 0.0130878 |
SweWinograd | 3e-05 | 16 | 0.0664638 |
Swedish FAQ | 4e-05 | 16 | 0.00752451 |
NbAiLab/nb-bert-base
Task | LR | BS | hps std |
---|---|---|---|
ABSA | 4e-05 | 16 | 0.0263801 |
Argumentation | 3e-05 | 16 | 0.0194445 |
DaLAJ | 1e-05 | 16 | 0.00804185 |
SweMNLI | 1e-05 | 16 | 0.0108116 |
SweParaphrase | 4e-05 | 32 | 0.00655906 |
SweWiC | 4e-05 | 32 | 0.0228019 |
SweWinograd | 4e-05 | 32 | 0.029244 |
Swedish FAQ | 3e-05 | 16 | 0.330018 |
xlm-roberta-base
Task | LR | BS | hps std |
---|---|---|---|
ABSA | 4e-05 | 16 | 0.0325399 |
Argumentation | 2e-05 | 16 | 0.029516 |
DaLAJ | 2e-05 | 32 | 0.0173028 |
SweMNLI | 1e-05 | 16 | 0.0144903 |
SweParaphrase | 1e-05 | 16 | 0.00433707 |
SweWiC | 1e-05 | 16 | 0.233132 |
SweWinograd | 2e-05 | 16 | 0 |
Swedish FAQ | 1e-05 | 16 | 0.352092 |
xlm-roberta-large
Task | LR | BS | hps std |
---|---|---|---|
ABSA | 2e-05 | 16 | 0.240555 |
Argumentation | 3e-05 | 32 | 0.512098 |
DaLAJ | 1e-05 | 32 | 0.477851 |
SweMNLI | 1e-05 | 32 | 0.471841 |
SweParaphrase | 1e-05 | 16 | 0.00389993 |
SweWiC | 1e-05 | 32 | 0.31005 |
SweWinograd | 2e-05 | 32 | 0.128864 |
Swedish FAQ | 2e-05 | 32 | 0.454154 |
The following table shows the average standard deviation of the hyperparameter configuration performances. Sorted on avg std
and could indicate hyperparameter sensitivity of the models.
Model | avg std |
---|---|
AI-Nordics/bert-large-swedish-cased | 0.0132957 |
KBLab/megatron-bert-large-swedish-cased-165k | 0.0185533 |
KB/bert-base-swedish-cased | 0.018661 |
KBLab/megatron-bert-base-swedish-cased-600k | 0.0346735 |
NbAiLab/nb-bert-base | 0.0566626 |
xlm-roberta-base | 0.0854263 |
KBLab/bert-base-swedish-cased-new | 0.122608 |
xlm-roberta-large | 0.324914 |
The following table shows the average of the mean distances to the maximum achieved performance. I.e. for each task hyperparameter search, take the mean of the metric distances to the maximum hyperparameter configuration.
Model | avg mean distance |
---|---|
KBLab/megatron-bert-large-swedish-cased-165k | 0.0242377 |
AI-Nordics/bert-large-swedish-cased | 0.0243757 |
KB/bert-base-swedish-cased | 0.0294655 |
KBLab/megatron-bert-base-swedish-cased-600k | 0.0472022 |
NbAiLab/nb-bert-base | 0.0486684 |
xlm-roberta-base | 0.0871857 |
KBLab/bert-base-swedish-cased-new | 0.122801 |
xlm-roberta-large | 0.35873 |
The random baseline takes random values from the range of all seen labels, not the current number of possible answers.
Traditional ML baselines take a random answer from the candidates that the models independently predict as a correct answer.
For these traditional ML baselines, only 5% (20,000 samples) of the training set is used for training. This did not seem to have a noticeable effect on the end performance, and the motivation for this was to reduce the training time.