Name	Name	Last commit message	Last commit date
parent directory ..
figures	figures
object_detection	object_detection
LICENSE.md	LICENSE.md
README.md	README.md
requirements.txt	requirements.txt

Vision Grid Transformer for Document Layout Analysis

The official PyTorch implementation of VGT (ICCV 2023).

VGT is a two-stream multi-modal Vision Grid Transformer for document layout analysis, in which Grid Transformer (GiT) is proposed and pre-trained for 2D token-level and segment-level semantic understanding. By fully leveraging multi-modal information and exploiting pre-training techniques to learn better representation, VGT achieves highly competitive scores in the DLA task, and significantly outperforms the previous state-of-the-arts.

Paper

[ICCV 2023]
Arxiv

Install requirements

PyTorch version >= 1.8.0
Python version >= 3.6

pip install -r requirements.txt
# Install `git lfs`
curl -s https://packagecloud.io/install/repositories/github/git-lfs/script.deb.sh | sudo bash
sudo apt-get install git-lfs

The required packages including: Pytorch version 1.9.0, torchvision version 0.10.0 and Timm version 0.5.4, etc.

For mixed-precision training, please install apex

git clone https://github.com/NVIDIA/apex
cd apex
pip install -v --disable-pip-version-check --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./

For object detection, please additionally install detectron2 library. Refer to the Detectron2's INSTALL.md.

# Install `detectron2`
python -m pip install detectron2==0.6 -f \
    https://dl.fbaipublicfiles.com/detectron2/wheels/cu102/torch1.9/index.html

Pretrained models

We provide the pretrained GiT weights in VGT, which are pretrained by the proposed MGLM and SLM tasks.

GiT-pretrian
VGT-pretrain-model

For the ViT weights in VGT, please download checkpoints in DiT-base: dit_base_patch16_224

We load these two weights for VGT training.

Data Preparation

PubLayNet link

Download the data from this link (~96GB). PubLayNet provides the original PDFs, and we use pdfplumber to generate OCR informantions for grid generation. Download the grid pkl from link. The structure of data folder is as below.

publaynet
├── train
│   ├── 1.jpg
├── val
│   ├── 2.jpg
├── test
│   ├── 3.jpg
├── VGT_publaynet_grid_pkl
│   ├── 1.pdf.pkl
│   └── 2.pdf.pkl
├── train.json
├── val.json
├── test.json

Docbank

Download the original data DocBank_500K_ori_img.zip and MSCOCO_Format_Annotation.zip from the Docbank website link. However, the categories of MSCOCO_Format_Annotation.zip are not matched with the dataset. And we provide new annotations with fixed categories in DocBank.zip from link.

We use duguang OCR Parser to generate OCR informantions for grid generation. Download the grid pkl from link. The structure of data folder is as below.

DocBank
├── DocBank_500K_ori_img
│   ├── 1.jpg
├── VGT_docbank_grid_pkl
│   ├── 1.pkl
│   └── 2.pkl
├── 500K_train_VGT.json
├── 500K_valid_VGT.json

D4LA

Download the original data (images, annotations and grid) from the D4LA website link. The structure of data folder is as below.

D4LA
├── train_images
│   ├── 1.jpg
├── test_images
│   ├── 2.jpg
├── VGT_D4LA_grid_pkl
│   ├── 1.pkl
│   └── 2.pkl
├── json
│   ├── train.json
│   └── test.json

Doclaynet

Download the DocLayNet core dataset (~28GB) from the DocLayNet website link. DocLayNet also provides the original PDFs in DocLayNet extra files, and we use pdfplumber to generate OCR informantions for grid generation. Download the grid pkl from link. The structure of data folder is as below.

Doclaynet
├── COCO
│   ├── train.json
│   └── val.json
├── PNG
│   ├── 1.png
│   └── 2.png
├── VGT_DocLayNet_grid_pkl
│   ├── 1.pkl
│   └── 2.pkl

Embedding Preparation

If we want to train VGT from scratch or train VGT without pretrained models, we need to set MODEL.WORDGRID.MODEL_PATH to <embedding_file_path> and MODEL.WORDGRID.USE_PRETRAIN_WEIGHT as True. Here, VGT supports bert-base-uncased, bros-base-uncased and layoutlm-base-uncased embeddings.

Evaluation

We summarize the validation results as follows. We also provide the fine-tuned weights as in the paper.

name	dataset	detection algorithm	mAP	weight
VGT	Publaynet	Cascade R-CNN	96.2	link
VGT	Docbank	Cascade R-CNN	84.1	link
VGT	D4LA	Cascade R-CNN	68.8	link

Besides Publaynet, Docbank and D4LA, we also evaluate VGT on Doclaynet dataset.

name	dataset	detection algorithm	mAP	weight
X101	Doclaynet	Cascade R-CNN	74.6	-
LayoutlmV3	Doclaynet	Cascade R-CNN	76.8	-
DiT_base	Doclaynet	Cascade R-CNN	80.3	-
VGT w/o pretrain	Doclaynet	Cascade R-CNN	82.6	-
VGT with pretrain	Doclaynet	Cascade R-CNN	83.7	link

Following commands provide an example to evaluate the fine-tuned checkpoints. The config files can be found in Configs.

Evaluate the fine-tuned checkpoint of VGT with Cascade R-CNN on PublayNet:

python train_VGT.py --config-file Configs/cascade/publaynet_VGT_cascade_PTM.yaml --eval-only --num-gpus 1 MODEL.WEIGHTS <finetuned_checkpoint_file_path> OUTPUT_DIR <your_output_dir>

PDF Preprocessing

Before inference, a pdf file needs to be converted into images and pkl file needs to be generated for each page

Generating Images

One can convert PDF to a set of image using this code:

python pdf2img.py \
--pdf 'input-pdf-path' \
--output 'output-folder-path' \
--format 'png'

Generating grid information

Every file requires a pkl file that contains the grid information necessary for Grid Transformer. In order to create this file for a MACHINE-READABLE PDF, run the following code:

python create_grid_input.py \
--pdf 'path-to-pdf-file' \
--output 'path-to-output-folder' \
--tokenizer 'google-bert/bert-base-uncased' \
--model 'doclaynet'

Default tokenizer is google-bert/bert-base-uncased and default model is doclaynet Based on the model selected, the extensions might change from pkl to pdf.pkl.

Inference

One can run inference using the inference.py script to use VGT model. It can be run as follows.

python inference.py \
--image_root '/DocBank_root_path/DocBank/DocBank_500K_ori_img/' \
--grid_root '/DocBank_root_path/DocBank/VGT_docbank_grid_pkl/' \
--image_name '1.tar_1401.0001.gz_infoingames_without_metric_arxiv_47_ori' \
--dataset docbank \
--output_root <your_output_dir> / \
--config Configs/cascade/docbank_VGT_cascade_PTM.yaml \
--opts MODEL.WEIGHTS  <finetuned_checkpoint_file_path>

Training

Fine-tuning on Publaynet

python train_VGT.py --config-file Configs/cascade/publaynet_VGT_cascade_PTM.yaml --num-gpus 8 MODEL.WEIGHTS <VGT-pretrain-model_file_path> OUTPUT_DIR <your_output_dir>

Fine-tuning on Docbank

python train_VGT.py --config-file Configs/cascade/docbank_VGT_cascade_PTM.yaml --num-gpus 8 MODEL.WEIGHTS <VGT-pretrain-model_file_path> OUTPUT_DIR <your_output_dir>

Fine-tuning on D4LA

python train_VGT.py --config-file Configs/cascade/D4LA_VGT_cascade_PTM.yaml --num-gpus 8 MODEL.WEIGHTS <VGT-pretrain-model_file_path> OUTPUT_DIR <your_output_dir>

Fine-tuning on Doclaynet

python train_VGT.py --config-file Configs/cascade/doclaynet_VGT_cascade_PTM.yaml --num-gpus 8 MODEL.WEIGHTS <VGT-pretrain-model_file_path> OUTPUT_DIR <your_output_dir>

Citation

If you find this repository useful, please consider citing our work:

@inproceedings{da2023vgt,
    title={Vision Grid Transformer for Document Layout Analysis},
    author={Cheng Da and Chuwei Luo and Qi Zheng and Cong Yao},
    year={2023},
    booktitle = {ICCV},
}

Acknowledgement

This repository is built using the timm library, the detectron2 library, the DeiT repository, the Dino repository, the BEiT repository, the MPViT repository and the DiT repository.

License

VGT is released under the terms of the Apache License, Version 2.0.

VGT is an algorithm for Document Layout Analysis and the code and models herein created by the authors from Alibaba can only be used for research purpose.
Copyright (C) 1999-2022 Alibaba Group Holding Ltd. 

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

VGT

VGT

README.md

Vision Grid Transformer for Document Layout Analysis

Paper

Install requirements

Pretrained models

Data Preparation

Embedding Preparation

Evaluation

PDF Preprocessing

Generating Images

Generating grid information

Inference

Training

Fine-tuning on Publaynet

Fine-tuning on Docbank

Fine-tuning on D4LA

Fine-tuning on Doclaynet

Citation

Acknowledgement

License

Files

VGT

Directory actions

More options

Directory actions

More options

Latest commit

History

VGT

Folders and files

parent directory

README.md

Vision Grid Transformer for Document Layout Analysis

Paper

Install requirements

Pretrained models

Data Preparation

Embedding Preparation

Evaluation

PDF Preprocessing

Generating Images

Generating grid information

Inference

Training

Fine-tuning on Publaynet

Fine-tuning on Docbank

Fine-tuning on D4LA

Fine-tuning on Doclaynet

Citation

Acknowledgement

License