Historical Diagram Vectorization

This repo is the official implementation for Historical Astronomical Diagrams Decomposition in Geometric Primitives.

This repo builds on the code for DINO-DETR, the official implementation of the paper "DINO: DETR with Improved DeNoising Anchor Boxes for End-to-End Object Detection".

Introduction

We present a model which modifies DINO-DETR to perform historical astronomical diagram vectorization by predicting simple geometric primitives, such as lines, circles, and arcs.

Getting Started

1. Installation

The model was trained with python=3.11.0, pytorch=2.1.0, cuda=11.8 and builds on the DETR-variants DINO/DN/DAB and Deformable-DETR.

Clone this repository and create virtual environment

git clone [email protected]:vayvi/HDV.git
cd HDV/
python3 -m venv venv
source venv/bin/activate

Follow instructions to install a Pytorch version compatible with your system and CUDA version
Install other dependencies
```
pip install -r requirements.txt
```

Compiling CUDA operators

python src/models/dino/ops/setup.py build install # 'cuda not availabel', run => export CUDA_HOME=/usr/local/cuda-<version>
# unit test (should see all checking is True) # could output an outofmemory error
python src/models/dino/ops/test.py

Installing the local package for synthetic data generation
```
pip install -e synthetic/.
```

2. Annotated Dataset and Model Checkpoint

Our annotated dataset along with our main model checkpoints can be found here. Annotations are in SVG format. We provide helper functions for parsing svg files in Python if you would like to process a custom annotated dataset.

To download the manually annotated dataset, run:

bash scripts/download_eida_data.sh

Datasets should be organized as follows:

HDV/
  data/
    └── eida_dataset/
      └── images_and_svgs/
    └── custom_dataset/
      └── images_and_svgs/

To download the pretrained models, run:

bash scripts/download_pretrained_models.sh

Checkpoints should be organized as follows:

HDV/
  logs/
    └── main_model/
      └── checkpoint0012.pth
      └── checkpoint0036.pth
      └── config_cfg.py
    └── other_model/
      └── checkpoint0044.pth
      └── config_cfg.py
    ...

You can process the ground-truth data for evaluation using:

bash scripts/process_annotated_data.sh "eida_dataset" # or "custom_dataset", etc.

3. Synthetic Dataset

Generate Synthetic Dataset

The synthetic dataset generation process requires a resource of text and document backgrounds. We use the resources available in docExtractor and diagram-extraction. The code for generating the synthetic data is also heavily based on docExtractor.

To get the synthetic resource (backgrounds) for the synthetic dataset you can launch:

bash scripts/download_synthetic_resource.sh

Or download it

Download the synthetic resource folder here and unzip it in the data folder.

Evaluation and Testing

1. Evaluate our pretrained models

After downloading and processing the evaluation dataset, you can evaluate the pretrained model as follows. Download a model checkpoint:

model_name corresponds to the folder inside logs/ where the checkpoint file is located
epoch_number epoch number of the checkpoint file to be used
data_folder_name is the name of the folder inside data/ where the evaluation dataset is located (default to eida_dataset)

bash scripts/evaluate_on_eida_final.sh <model_name> <epoch_number> <data_folder_name>

# for logs/main_model/checkpoint0036.pth on eida_dataset
bash scripts/evaluate_on_eida_final.sh main_model 0036 eida_dataset

# for logs/eida_demo_model/checkpoint0044.pth on eida_dataset
bash scripts/evaluate_on_eida_final.sh eida_demo_model 0044 eida_dataset

You should get the AP for different primitives and for different distance thresholds.

If you want to run evaluation on all checkpoints available for a given model, you can use the following script:

bash scripts/evaluate_models_on_gt.sh <ground_truth> <?model_name> <?device_nb> <?batch_size> <?max_size>

# to evaluate all available models on ground truth (cf. svg_to_train.py script)
bash scripts/evaluate_models_on_gt.sh eida_dataset/groundtruth

# to evaluate only one model
bash scripts/evaluate_models_on_gt.sh eida_dataset/groundtruth main_model

2. Inference and Visualization

For inference and visualizing results over custom images, you can use this notebook.

You can also use the following script to run inference on a whole dataset (jpg images located in data/<data_set>/images/):

bash scripts/run_inference.sh <model_name> <epoch_number> <data_set> <export_formats>

# for logs/main_model/checkpoint0036.pth on eida_dataset with svg and npz export formats
bash scripts/run_inference.sh main_model 0036 eida_dataset svg+npz

Results will be saved in data/<data_set>/<export_format>_preds_<model_name><epoch_number>/.

You can compare different inferences on the same dataset with (outputs an HTML file data/<data_set>/<filename>.html):

python src/util/html.py --data_set <data_set> --filename <filename>

Training

1. Training from scratch on synthetic data

To re-train the model from scratch on the synthetic dataset (created on the fly), you can launch

bash scripts/train_model.sh

2. Training on a custom dataset

Turn SVG files into COCO-like annotations using the following script:

data_set folder inside data/ where the evaluation dataset is located (default to eida_dataset)
sanity_check add it whether you want to visualize the processed annotations (will save the images in data/<data_set>/svgs/)
train_portion float value in between 0 and 1 to split the dataset into train and val (default to 0.8)

  data/
    └── <dataset_name>/
      └── images/     # folder containing annotated images in the svgs folder
      └── svgs/       # folder containing SVG files containing ground truth for training

python src/svg_to_train.py --data_set <dataset_name> --sanity_check

# for eida_dataset
python src/svg_to_train.py --data_set eida_dataset --sanity_check

Training data will be created in data/<dataset_name>/groundtruth/. You can use it to run the finetuning script. To train on a custom dataset, the ground truth annotations should be in a COCO-like format, thus be structured as follows:

  data/
    └── <groundtruth_data>/
      └── annotations/     # folder containing JSON files (one for train, one for val) in COCO-like format
      └── train/           # train images (corresponding to train.json)
      └── val/             # val images (corresponding to val.json)

Run the following script to train the model on the custom dataset:

model_name corresponds to the folder inside logs/ where the checkpoint file is located (will take the last checkpoint)
groundtruth_dir relative path to a folder inside data/ where the ground truth dataset is located
device_nb GPU device number to use for training (default to 0)
batch_size batch size for training (default to 2)
max_size maximum image size for data augmentation (default to 1000), to prevent out of memory errors
learning_rate learning rate for training (default to 0.0001)
epoch_nb number of epochs to train (default to 50)

bash scripts/finetune_model.sh <model_dirname> <groundtruth_dir> <device_nb> <batch_size> <max_size> <learning_rate> <epoch_nb>

# to use the data generated by the previous script to finetuning main_model on device #2
bash scripts/finetune_model.sh main_model eida_dataset/groundtruth 2

The outputs of your run will be logged with wandb.

Bibtex

If you find this work useful, please consider citing:

@misc{kalleli2024historical,
    title={Historical Astronomical Diagrams Decomposition in Geometric Primitives},
    author={Syrine Kalleli and Scott Trigg and Ségolène Albouy and Matthieu Husson and Mathieu Aubry},
    year={2024},
    eprint={2403.08721},
    archivePrefix={arXiv},
    primaryClass={cs.CV}
}

Name		Name	Last commit message	Last commit date
Latest commit History 63 Commits
figures		figures
scripts		scripts
src		src
synthetic		synthetic
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Historical Diagram Vectorization

Introduction

Getting Started

Generate Synthetic Dataset

Or download it

Evaluation and Testing

Training

Bibtex

About

Releases

Packages

Contributors 2

Languages

License

vayvi/HDV

Folders and files

Latest commit

History

Repository files navigation

Historical Diagram Vectorization

Introduction

Getting Started

Generate Synthetic Dataset

Or download it

Evaluation and Testing

Training

Bibtex

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages