Skip to content

valeriatisch/captioning-art-historical-photographs

Repository files navigation

Captioning Art-Historical Photographs

Table of Contents πŸ“‹

  1. Overview
  2. Report
  3. Repo Structure
  4. Setup
  5. Datasets
    1. Artpedia Dataset
    2. Wildenstein Plattner Institute (WPI) Dataset
  6. Models
  7. Experiments
    1. Finetuning BLIP on Artpedia
      1. Training
      2. Evaluation
    2. Finetuning with Grayscale Images
    3. Finetuning on Filtered Artpedia Dataset
    4. Constrained Caption Generation
    5. Artists
  8. Contribution
  9. Credits

Overview πŸŒƒ

The Wildenstein Plattner Institut owns a vast collection of art-historical photographs. To make them more accessible, captions are required, especially for visually impaired individuals. Because manually creating captions would be too time-consuming an automatic solution is needed. We present an approach, that is specifically adapted to art images. We work with the language-image pre-training framework BLIP. For that, we leverage the LAVIS library, which provides an interface for applying BLIP. We finetune BLIP's pre-trained base model on the art image dataset Artpedia and investigate methods to improve the training dataset and caption generation. In all approaches, the finetuned models show better results than the pre-trained ones. The captions become more detailed and sometimes contain art-specific content like painting styles.

Report πŸ“„

If you are interested in the details of our project, please have a look at our report. There, you will find our motivation behind this project, a detailed explanation of the methods used, specifically the BLIP model and the evaluation metrics. All of our experiments and results are presented and discussed in depth, including qualitative and quantitative analysis. We also talk about possible future work, in case you'd like to enhance our work.

Repo Structure πŸ—‚

Folder / File Description
LAVIS Forked LAVIS submodule
artists Evaluation of the artist experiment & querying of artists
artpedia Artpedia dataset, formatting, analysis, training & evaluation
grayscaled_artpedia Converting Artpedia images to grayscale, training & evaluation
inference Constrained caption generation
plot Script for plotting the evaluation results
results Evaluation results (finetuned vs. base model)
wpi WPI dataset, analysis & evaluation
lavis_example.ipynb Google Colab Jupyter notebook with examples for using LAVIS to generate captions & attention maps
report Our extensive report of the project

Setup πŸ› 

Python and a package manager like pip and/or conda need to be installed on your system.

  1. Open your terminal and clone the repository, including the submodules.
git clone --recurse-submodules  https://github.com/valeriatisch/captioning-art-photographs-blip.git

If cloned the repo without the submodules, you can run:

git submodule update --init

To update the repo with the submodules, you can run:

git pull --recurse-submodules
  1. We recommend creating a virtual environment. You can create and activate a new environment with conda or another package manager of your preference.
conda create -n captioning_art pip python=3.8
conda activate captioning_art

To deactivate the environment, run:

conda deactivate

To remove the environment, run:

conda remove -n captioning_art --all
  1. To set up the LAVIS submodule, please follow this guide. You can also use our installs.sh script for installation or take a look at it in case you run into installation problems. You might need to install another PyTorch and CUDA versions to match your systems requirements.
chmod +x installs.sh
./installs.sh
  1. Install the remaining requirements. It might happen that some requirements will already be satisfied through the previous installation, but that shouldn't be a problem.
pip install -r requirements.txt
  1. To run the code on your local machine, you can simply do as follows:
python path_to_file/file.py <args>

You will also find many shell scripts in the repo, equally named as the corresponding python scripts. The shell scripts are used to submit a batch job on an HPC cluster using the slurm workload manager. The script sets various Slurm directives, such as job name, email notifications, partition to run the job, GPUs needed, memory, and time limit. If you want to use the shell scripts, you need to adjust the settings inside them.
To run a job, execute:

sbatch path_to_file/file.sh

To train a model, run:

cd LAVIS
python train.py --cfg-path lavis/projects/blip/train/<CONFIG>.yaml

To generate captions, run:

cd LAVIS
python predict.py --image_path=<PATH_TO_IMAGE>

Please take a look into the script for more options.

To evaluate a model, run:

cd LAVIS
python evaluate.py --cfg-path lavis/projects/blip/eval/<CONFIG>.yaml

Please notice, that you need to adjust the YAML configs and fill in the right paths, or you might want to create your own. You also need to adjust the paths in LAVIS/lavis/tasks/captioning.py.

Please refer to the LAVIS readme and the LAVIS documentation for more information and advanced usage.

How to get the datasets used for our experiments, is described in the Datasets section.

How to run the individual experiments, is described in more detail in the Experiments section.

  1. To run the Jupyter notebooks, please make sure Jupyter is installed on your system.
jupyter --version
pip install jupyter

Navigate to the directory containing the notebook you'd like to open and launch it:

cd path_to_directory_with_notebook
jupyter the_notebook.ipynb

The notebook will be opened in your default browser.

There is one notebook that needs to be run in a Google Colab environment. If you are not familiar with Google Colab, please look up this guide.

Datasets πŸ“š

Artpedia Dataset

The Artpedia Dataset and the corresponding paper can be found here.
To download the images you can execute the beginning cells of this Jupyter Notebook to initiate the download of the images.
In the same notebook, we also provide an analysis of the Artpedia dataset.
All plots regarding the Artpedia dataset are saved in artpedia/plots.

We enhanced the Artpedia dataset, so that in the end each image has the following attributes:

Attribute Description Source
title the title of the painting original
img_url the wikimedia url to the image where to download it original
year the year the painting was created original
visual_sentences a list of visual sentences describing the painting original
contextual_sentences a list of contextual sentences describing the painting original
split training (train), validation (val) and test (test) split original
got_img yes or no, depending on whether the image could be downloaded new
matching_scores a list of matching scores between the image and each visual sentence in the same order the sentences are stored visual_sentences new
cosine_similarities a list of cosine similarities between the image and each visual sentence in the same order the sentences are stored visual_sentences new
artist the artist of the painting new

The enhanced dataset can be found here.

Wildenstein Plattner Institute (WPI) Dataset

The Annotations of the WPI dataset can be found in this JSON file.
You can execute the beginning cells of this Jupyter Notebook to initiate the download of the images.
In the same notebook, we also provide an analysis of the WPI dataset.
All plots regarding the WPI dataset are saved in wpi/plots.

The WPI dataset contains the following attributes for an image:

Attribute Description Present
img_urls the url to the image where to download it always
Title the title of the image always
img_path the path to the downloaded image always
Genres a list of genres the image belongs to sometimes
Topics a list of topics the image contains sometimes
Names a list of names like the publisher or author of the image sometimes
Places a list of places the image displays sometimes

Models 🧠

You can find the models we finetuned under this link. The password is blip_models.

Experiments πŸ§ͺ

Do you want to recreate our experiments or use our scripts on other images? Here, we try to guide you through the process of preparing the datasets, finetuning, and evaluating the BLIP model as best we can.
In general, you just need to run the corresponding scripts as described in the Setup section.

Finetuning BLIP on Artpedia πŸ–ΌοΈ

We finetune the BLIP base model on the Artpedia dataset. We use the training, validation, and test split provided by Artpedia.

Training πŸ‹οΈβ€β™€οΈ

First, adjust the config file lavis/projects/blip/train/artpedia.yaml.
To train the model, adjust the shell script and then run:

sbatch artpedia/train_artpedia.sh

Or run the python script directly from the LAVIS directory specifying the path to the right config file:

cd LAVIS
python train.py --cfg-path lavis/projects/blip/train/artpedia.yaml

Evaluation πŸ“Š

Again, adjust lavis/projects/blip/eval/caption_artpedia_eval.yaml first.
To evaluate the model, adjust the shell script too, and run:

sbatch artpedia/eval_artpedia.sh

Or:

cd LAVIS
python evaluate.py --cfg-path lavis/projects/blip/eval/caption_artpedia_eval.yaml

Finetuning on Grayscale Images πŸ–€

We transform the images of the Artpedia dataset to grayscale and finetune the BLIP base model on them. Again, we use the training, validation, and test split provided by Artpedia.

To transform the images to grayscale, run:

sbatch grayscaled_artpedia/convert_bw.sh

Or run the python script directly specifying the input and output directory:

python grayscaled_artpedia/convert_bw.py artpedia/images/ artpedia/images/bw/

Finetuning on Filtered Artpedia Dataset πŸ”

We apply BLIP's filter to calculate matching scores for the images and visual sentences of the Artpedia dataset. We filter out the image-text pairs having a score below a threshold of 80%.

To generate matching scores for the dataset, run:

cd LAVIS
sbatch match.sh

Or run the python script directly specifying the args:

cd LAVIS
python caption_matching.py ../artpedia/imgs ../artpedia/artpedia_res.json ../artpedia/artpedia_scored.json

To train with a filtered version of the Artpedia dataset use LAVIS/lavis/projects/blip/train/artpedia_filtered.yaml. You can define your threshold in this config as well.

Constrained Caption Generation πŸ€–

We explore the potential of constrained caption generation by forcing tags or other words to be included in generated captions.

To generate captions with a constraint, you can run LAVIS/predict.py and pass the words to include with --forcewords.

Artists πŸ‘©β€πŸŽ¨

We investigate the model's ability to recognize the artists of given artworks.

Unfortunately, the original Artpedia dataset does not provide the artists. You can run artists/query_artpedia_artists.py or artists/query_artpedia_artists.sh to get the artists. Don't forget to specify the input and output paths.

For evaluation, use lavis/projects/blip/eval/caption_artists_eval.yaml.

Contribution 🀝

You are more than welcome to contribute to this project.
Please feel free to open an issue, create a pull request or just share an idea.

Credits πŸ™

This project makes use of the following two libraries and a dataset:

  • Artpedia - "A New Visual-Semantic Dataset with Visual and Contextual Sentences" by Matteo Stefanini, Marcella Cornia, Lorenzo Baraldi, Massimiliano Corsini, Rita Cucchiara
  • BLIP - "Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation" by Junnan Li, Dongxu Li, Caiming Xiong, Steven Hoi
  • LAVIS - "A Library for Language-Vision Intelligence" by Dongxu Li, Junnan Li, Hung Le, Guangsen Wang, Silvio Savarese, Steven C. H. Hoi

We want to say a special thank you to the developers for their hard work and for making their code publicly available for others to use.