The official PyTorch implementation of VGT (ICCV 2023).
VGT is a two-stream multi-modal Vision Grid Transformer for document layout analysis, in which Grid Transformer (GiT) is proposed and pre-trained for 2D token-level and segment-level semantic understanding. By fully leveraging multi-modal information and exploiting pre-training techniques to learn better representation, VGT achieves highly competitive scores in the DLA task, and significantly outperforms the previous state-of-the-arts.
- [ICCV 2023]
- Arxiv
- PyTorch version >= 1.8.0
- Python version >= 3.6
pip install -r requirements.txt
# Install `git lfs`
curl -s https://packagecloud.io/install/repositories/github/git-lfs/script.deb.sh | sudo bash
sudo apt-get install git-lfs
The required packages including: Pytorch version 1.9.0, torchvision version 0.10.0 and Timm version 0.5.4, etc.
For mixed-precision training, please install apex
git clone https://github.com/NVIDIA/apex
cd apex
pip install -v --disable-pip-version-check --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./
For object detection, please additionally install detectron2 library. Refer to the Detectron2's INSTALL.md.
# Install `detectron2`
python -m pip install detectron2==0.6 -f \
https://dl.fbaipublicfiles.com/detectron2/wheels/cu102/torch1.9/index.html
We provide the pretrained GiT weights in VGT, which are pretrained by the proposed MGLM and SLM tasks.
GiT-pretrian |
---|
VGT-pretrain-model |
For the ViT weights in VGT, please download checkpoints in DiT-base: dit_base_patch16_224
We load these two weights for VGT training.
PubLayNet link
Download the data from this link (~96GB). PubLayNet provides the original PDFs, and we use pdfplumber to generate OCR informantions for grid generation. Download the grid pkl from link. The structure of data folder is as below.
publaynet
├── train
│ ├── 1.jpg
├── val
│ ├── 2.jpg
├── test
│ ├── 3.jpg
├── VGT_publaynet_grid_pkl
│ ├── 1.pdf.pkl
│ └── 2.pdf.pkl
├── train.json
├── val.json
├── test.json
Docbank
Download the original data DocBank_500K_ori_img.zip
and MSCOCO_Format_Annotation.zip
from the Docbank website link.
However, the categories of MSCOCO_Format_Annotation.zip
are not matched with the dataset. And we provide new annotations with fixed categories in DocBank.zip
from link.
We use duguang OCR Parser to generate OCR informantions for grid generation. Download the grid pkl from link. The structure of data folder is as below.
DocBank
├── DocBank_500K_ori_img
│ ├── 1.jpg
├── VGT_docbank_grid_pkl
│ ├── 1.pkl
│ └── 2.pkl
├── 500K_train_VGT.json
├── 500K_valid_VGT.json
D4LA
Download the original data (images, annotations and grid) from the D4LA website link. The structure of data folder is as below.
D4LA
├── train_images
│ ├── 1.jpg
├── test_images
│ ├── 2.jpg
├── VGT_D4LA_grid_pkl
│ ├── 1.pkl
│ └── 2.pkl
├── json
│ ├── train.json
│ └── test.json
Doclaynet
Download the DocLayNet core dataset
(~28GB) from the DocLayNet website link.
DocLayNet also provides the original PDFs in DocLayNet extra files
, and we use pdfplumber to generate OCR informantions for grid generation.
Download the grid pkl from link.
The structure of data folder is as below.
Doclaynet
├── COCO
│ ├── train.json
│ └── val.json
├── PNG
│ ├── 1.png
│ └── 2.png
├── VGT_DocLayNet_grid_pkl
│ ├── 1.pkl
│ └── 2.pkl
If we want to train VGT from scratch or train VGT without pretrained models, we need to set MODEL.WORDGRID.MODEL_PATH
to <embedding_file_path>
and MODEL.WORDGRID.USE_PRETRAIN_WEIGHT
as True
. Here, VGT supports bert-base-uncased, bros-base-uncased and layoutlm-base-uncased embeddings.
We summarize the validation results as follows. We also provide the fine-tuned weights as in the paper.
name | dataset | detection algorithm | mAP | weight |
---|---|---|---|---|
VGT | Publaynet | Cascade R-CNN | 96.2 | link |
VGT | Docbank | Cascade R-CNN | 84.1 | link |
VGT | D4LA | Cascade R-CNN | 68.8 | link |
Besides Publaynet, Docbank and D4LA, we also evaluate VGT on Doclaynet dataset.
name | dataset | detection algorithm | mAP | weight |
---|---|---|---|---|
X101 | Doclaynet | Cascade R-CNN | 74.6 | - |
LayoutlmV3 | Doclaynet | Cascade R-CNN | 76.8 | - |
DiT_base | Doclaynet | Cascade R-CNN | 80.3 | - |
VGT w/o pretrain | Doclaynet | Cascade R-CNN | 82.6 | - |
VGT with pretrain | Doclaynet | Cascade R-CNN | 83.7 | link |
Following commands provide an example to evaluate the fine-tuned checkpoints.
The config files can be found in Configs
.
- Evaluate the fine-tuned checkpoint of VGT with Cascade R-CNN on PublayNet:
python train_VGT.py --config-file Configs/cascade/publaynet_VGT_cascade_PTM.yaml --eval-only --num-gpus 1 MODEL.WEIGHTS <finetuned_checkpoint_file_path> OUTPUT_DIR <your_output_dir>
Before inference, a pdf file needs to be converted into images and pkl
file needs to be generated for each page
One can convert PDF to a set of image using this code:
python pdf2img.py \
--pdf 'input-pdf-path' \
--output 'output-folder-path' \
--format 'png'
Every file requires a pkl
file that contains the grid information necessary for Grid Transformer.
In order to create this file for a MACHINE-READABLE PDF, run the following code:
python create_grid_input.py \
--pdf 'path-to-pdf-file' \
--output 'path-to-output-folder' \
--tokenizer 'google-bert/bert-base-uncased' \
--model 'doclaynet'
Default tokenizer is google-bert/bert-base-uncased
and default model is doclaynet
Based on the model selected, the extensions might change from pkl
to pdf.pkl
.
One can run inference using the inference.py
script to use VGT model. It can be run as follows.
python inference.py \
--image_root '/DocBank_root_path/DocBank/DocBank_500K_ori_img/' \
--grid_root '/DocBank_root_path/DocBank/VGT_docbank_grid_pkl/' \
--image_name '1.tar_1401.0001.gz_infoingames_without_metric_arxiv_47_ori' \
--dataset docbank \
--output_root <your_output_dir> / \
--config Configs/cascade/docbank_VGT_cascade_PTM.yaml \
--opts MODEL.WEIGHTS <finetuned_checkpoint_file_path>
python train_VGT.py --config-file Configs/cascade/publaynet_VGT_cascade_PTM.yaml --num-gpus 8 MODEL.WEIGHTS <VGT-pretrain-model_file_path> OUTPUT_DIR <your_output_dir>
python train_VGT.py --config-file Configs/cascade/docbank_VGT_cascade_PTM.yaml --num-gpus 8 MODEL.WEIGHTS <VGT-pretrain-model_file_path> OUTPUT_DIR <your_output_dir>
python train_VGT.py --config-file Configs/cascade/D4LA_VGT_cascade_PTM.yaml --num-gpus 8 MODEL.WEIGHTS <VGT-pretrain-model_file_path> OUTPUT_DIR <your_output_dir>
python train_VGT.py --config-file Configs/cascade/doclaynet_VGT_cascade_PTM.yaml --num-gpus 8 MODEL.WEIGHTS <VGT-pretrain-model_file_path> OUTPUT_DIR <your_output_dir>
If you find this repository useful, please consider citing our work:
@inproceedings{da2023vgt,
title={Vision Grid Transformer for Document Layout Analysis},
author={Cheng Da and Chuwei Luo and Qi Zheng and Cong Yao},
year={2023},
booktitle = {ICCV},
}
This repository is built using the timm library, the detectron2 library, the DeiT repository, the Dino repository, the BEiT repository, the MPViT repository and the DiT repository.
VGT is released under the terms of the Apache License, Version 2.0.
VGT is an algorithm for Document Layout Analysis and the code and models herein created by the authors from Alibaba can only be used for research purpose.
Copyright (C) 1999-2022 Alibaba Group Holding Ltd.
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.