This repository contains code for our COLM'24 paper "HDT: Hierarchical Document Transformer"
We present HDT, a novel sparse Transformer architecture tailored for structured hierarchical documents. HDT exploits document structure by introducing auxiliary anchor tokens and redesigning the attention mechanism into a sparse multi-level hierarchy. By developing a novel sparse attention kernel that considers the hierarchical structure of documents, HDT achieves computational efficiency as well as higher sample efficiency for pre-training and better performance on downstream tasks.
The required Python packages for running this repo are listed in requirements.txt. To install these pacakages at one time, plaese run
pip install -r requirements.txt
To verify our hierarchical attention, we start experiments on ListOPs before training on language tasks using the scripts in ListOPs.
The entry point is run_experiment.py
. You can provide model names and hyperparameters as command line arguments.
For example, to run the HDT vs BERT vs HAT vs Longformer experiment we used in the paper:
cd ListOPs
python run_experiment.py 0.25 5 20 90000 12 128 1 512 300 120 0.0003 fixed blue 512 HDT hdt_testrun
python run_experiment.py 0.25 5 20 90000 12 128 1 512 300 120 0.0003 fixed blue 512 BERT bert_testrun
python run_experiment.py 0.25 5 20 90000 12 128 1 512 300 120 0.0003 fixed blue 512 Longformer Longformer_testrun
python run_experiment.py 0.25 5 20 90000 12 128 1 512 300 120 0.0003 fixed blue 512 HAT HAT_testrun
Note
Currently our customized attention kernel only supports three-level hierarchy, we don't use it for the ListOps tasks where the depths could be much larger, e.g., 20. We create hierarchical attention mask and directly apply the mask on the attention score matrix. A more flexible kernel will be released soon which supports arbitrary levels of hierarchy.
We use cython to speed up the computation of our sparse attention mask. The cython code needs to be compiled on your system before your run the code. python setup.py build_ext --inplace
For pre-training, we collect structured documents from HUPD, unarXive, and Wikipedia. In our implementation, all documents are further preprocessed as a list of sections, where each section is a list of sentences. This hierarchical format allows the HDT model to efficiently process and exploit the structural information present in the document. Here is an example representation of the document data structure:
document = [
[
"Title",
"Abstract",
"This is the first sentence of abstract.",
"This is the second sentence of abstract.",
...
],
[
"Introduction",
"This is the first sentence of the introduction.",
...
],
...
]
To download the preprocessed data, run
from datasets import load_dataset
unarxive = load_dataset('howey/unarXive')
hupd = load_dataset('howey/hupd')
wikipedia = load_dataset('howey/wiki_en')
In our experiments, we extend the SciRepEval with public accessible arxiv full-text data, leading to a subset called FullTextSciRepEval including full-text scientific papers and labels from SciRepEval. FullTextSciRepEval is used to benchmark long document representation in our paper.
The pre-training for HDT uses pretrain.py. Note that both training of encoder-only model and encoder-decoder model shares the same training script, with different arguments setting. For instance, training a HDT encoder-only model (HDT-E) on Masked Language Modeling (MLM) task, please run
python pretrain.py --encoder_only --tok_name google-bert/bert-base-uncased
Here we directly use the BERT tokenizer for simplicity.
Following CRAMMING, we pre-train our models in an academic budget to evaluate the efficiency of our method. As default, the model is trained on 1 GPU for 24 hours.
In addition, to pre-train an encoder-decoder model for generation tasks with multiple gpus for longer time budget (e.g., 48 hours), run
python pretrain.py --tok_name google-t5/t5-base --num_encoder_layers 6 --num_decoder_layers 6 --num_gpus 4 --budget 48
Note
- We uses UL2 as the pre-training objective for encoder-decoder model and MLM for encoder-only model, following the default configuration from the original papers.
- Anchor tokens [DOC], [SEC], and [CLS] are not masked during pre-training.
Model Name | Encoder Layers | Decoder Layers | Hidden Units | Attention Heads | Vocab | Parameters |
---|---|---|---|---|---|---|
howey/HDT-E |
12 | 768 | 12 | 32,768 | 109M | |
howey/HDT-ED |
6 | 6 | 768 | 12 | 32,128 | 112M |
If you use or extend our work, please consider citing our paper. Thank you for your support! 🥰
@inproceedings{He2024COLM,
author = {He, Haoyu and Flicke, Markus and Buchman, Jan and Gurevych, Iryna and Geiger, Andreas},
title = {HDT: Hierarchical Document Transformer},
publisher = {Conference on Language Modeling},
year = {2024},
}