Lexical Generalization Improves with Larger Models and Longer Training (EMNLP 2022)

1. Introduction

This repository contains the training and evaluation code and data used in the EMNLP 2022 paper "Lexical Generalization Improves with Larger Models and Longer Training"

NOTE: The evaluation test set presnted in the paper, ALSQA, can be found in 🤗 at https://huggingface.co/datasets/biu-nlp/alsqa.

2. Training

Text Pair Classification

The finetuning of pretrained models for text-pair classification tasks was done with the code in 'train.py' modified from huggingface's finetuning example. The full hyper-parameters can be found in the table in the next section.

Finetuning examples

MNLI

python train.py \
    --model_name_or_path [PRETRAINED_MODEL] \
    --task_name mnli \
    --do_train \
    --do_eval \
    --max_seq_length 512 \
    --per_device_train_batch_size [BATCH_SIZE] \
    --learning_rate [LR] \
    --num_train_epochs 6 \
    --output_dir ./outputs/ \
    --seed [SEED] \
    --lr_scheduler_type linear \
    --pad_to_max_len False

PAWS

python train.py \
    --model_name_or_path [PRETRAINED_MODEL] \
    --task_name qqp \
    --do_train \
    --do_eval \
    --max_seq_length 512 \
    --per_device_train_batch_size [BATCH_SIZE] \
    --learning_rate [LR] \
    --num_train_epochs 6 \
    --output_dir ./outputs/ \
    --seed [SEED] \
    --lr_scheduler_type linear \
    --pad_to_max_len False

SQUAD (Answerability Only)

python train.py \
    --model_name_or_path [PRETRAINED_MODEL] \
    --task_name squad2 \
    --do_train \
    --do_eval \
    --max_seq_length 512 \
    --per_device_train_batch_size [BATCH_SIZE] \
    --learning_rate [LR] \
    --num_train_epochs 6 \
    --output_dir ./outputs/ \
    --seed [SEED] \
    --lr_scheduler_type linear \
    --pad_to_max_len False

Finetuning Hyper-Parameters

checkpoint	task	scheduller	warmup pr	batch	lr	epochs	seeds
prajjwal1/bert-tiny	mnli	linear	0.06	32	2e-5,3e-5	6	[1,...,6]*
prajjwal1/bert-mini	mnli	linear	0.06	32	2e-5,3e-5	6	[1,...,6]*
prajjwal1/bert-medium	mnli	linear	0.06	32	2e-5,3e-5	6	[1,...,6]*
bert-base-uncased	mnli	linear	0.06	32	2e-5,3e-5	6	[1,...,6]*
bert-large-uncased	mnli	linear	0.06	32	2e-5,3e-5	6	[1,...,6]*
roberta-base	mnli	linear	0.06	32	2e-5,3e-5	6	[1,...,6]*
roberta-large	mnli	linear	0.06	32	2e-5,3e-5	6	[1,...,6]*
roberta-base	qqp	linear	0.06	32	2e-5,3e-5	6	[1,...,6]*
roberta-large	qqp	linear	0.06	32	2e-5,3e-5	6	[1,...,6]*
electra-small	qqp	linear	0.0	32	2e-5	6	[1,...,6]
electra-base	qqp	linear	0.0	32	2e-5	6	[1,...,6]
electra-large	qqp	linear	0.0	32	2e-5	6	[1,...,6]
roberta-base	squad2	linear	0.06	32	2e-5,3e-5	6	[1,...,6]*
roberta-large	squad2	linear	0.06	32	2e-5,3e-5	6	[1,...,6]*
electra-small	squad2	linear	0.0	32	2e-5	6	[1,...,6]
electra-base	squad2	linear	0.0	32	2e-5	6	[1,...,6]
electra-large	squad2	linear	0.0	32	2e-5	6	[1,...,6]

* Where there are two learning rates the first half of the seeds were used in conjuction with the first learning rate etc, so every model were finetuned 6 times in total

Reading Comprhension

The finetuning of pretrained models for text-pair classification tasks was done with run_qa.py from huggingface's finetuning examples.

Finetuning Script

python train.py \
    --model_name_or_path [PRETRAINED_MODEL] \
    --dataset_name squad_v2 \
    --version_2_with_negative \
    --do_train \
    --do_eval \
    --max_seq_length 384 \
    --doc_stride 128 \
    --per_device_train_batch_size [BATCH_SIZE] \
    --learning_rate [LR] \
    --num_train_epochs 6 \
    --output_dir ./outputs/ \
    --seed [SEED] \
    --lr_scheduler_type linear \
    --pad_to_max_len False

Finetuning Hyper-Parameters

checkpoint	task	scheduller	warmup pr	batch	lr	epochs	seeds
roberta-base	squad2	linear	0.06	32	2e-5,3e-5	6	[1,...,6]
roberta-large	squad2	linear	0.06	32	2e-5,3e-5	6	[1,...,6]
electra-small	squad2	linear	0.0	32	2e-5	6	[1,...,6]
electra-base	squad2	linear	0.0	32	2e-5	6	[1,...,6]
electra-large	squad2	linear	0.0	32	2e-5	6	[1,...,6]

3. Controlled Evaluation

In every example choose finetuned model from huggingface hub or path to local model and evaluate it.

HANS

    from datasets import load_dataset
    from evaluation import evaluate_entailment_classification

    hans = load_dataset('hans', split='validation')
    dataset = hans.filter(lambda e: e['heuristic'] == 'lexical_overlap')
    
    evals = evaluate_entailment_classification(MNLI_MODEL_PATH, dataset)

PAWS-QQP

First reconstruct the PAWS-QQP test set based on the instruction here, then save in the data folder under paws_qqp_test.tsv.

    from datasets import load_dataset
    from evaluation import evaluate_paraphrase_detection
    
    dataset = load_dataset("csv", data_files="data\paws_qqp_test.tsv", delimiter='\t')['train']
    
    evals = evaluate_paraphrase_detection(QQP_MODEL_PATH, dataset)

ALSQA (For Answerability Classification Models)

    from datasets import load_dataset
    from evaluation import evaluate_answerability_classification
    
    dataset = load_dataset('biu-nlp/alsqa', split='test')
    
    evals = evaluate_answerability_classification(ANSWERABILITY_MODEL_PATH, dataset)

ALSQA (For SQuAD2.0 Models)

    from datasets import load_dataset
    from evaluation import evaluate_answerability_squad
    
    dataset = load_dataset('biu-nlp/alsqa', split='test')
    
    evals = evaluate_answerability_squad(SQUAD2_MODEL_PATH, dataset)

Citation

@inproceedings{bandel-etal-2022-lexical,
    title = "Lexical Generalization Improves with Larger Models and Longer Training",
    author = "Bandel, Elron  and
      Goldberg, Yoav  and
      Elazar, Yanai",
    booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2022",
    month = dec,
    year = "2022",
    address = "Abu Dhabi, United Arab Emirates",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2022.findings-emnlp.323",
    pages = "4398--4410",
    abstract = "While fine-tuned language models perform well on many language tasks, they were also shown to rely on superficial surface features such as lexical overlap. Excessive utilization of such heuristics can lead to failure on challenging inputs. We analyze the use of lexical overlap heuristics in natural language inference, paraphrase detection, and reading comprehension (using a novel contrastive dataset),and find that larger models are much less susceptible to adopting lexical overlap heuristics. We also find that longer training leads models to abandon lexical overlap heuristics. Finally, We provide evidence that the disparity between models size has its source in the pre-trained model.",
}

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
data		data
figures		figures
LICENSE		LICENSE
README.md		README.md
evaluate.py		evaluate.py
requirements.txt		requirements.txt
train.py		train.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Lexical Generalization Improves with Larger Models and Longer Training (EMNLP 2022)

1. Introduction

2. Training

Text Pair Classification

Finetuning examples

Finetuning Hyper-Parameters

Reading Comprhension

Finetuning Script

Finetuning Hyper-Parameters

3. Controlled Evaluation

Citation

About

Releases

Packages

Languages

License

elronbandel/lexical-generalization

Folders and files

Latest commit

History

Repository files navigation

Lexical Generalization Improves with Larger Models and Longer Training (EMNLP 2022)

1. Introduction

2. Training

Text Pair Classification

Finetuning examples

Finetuning Hyper-Parameters

Reading Comprhension

Finetuning Script

Finetuning Hyper-Parameters

3. Controlled Evaluation

Citation

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages