Forked from the Implementation of End-to-End Transformer Based Model for Image Captioning [PDF/AAAI] [PDF/Arxiv] [AAAI 2022] from here.
Authors: Austin Lamb & Hassan Shah
University of Southern California (USC)
- Python 3.7.16
- PyTorch 1.13.1
- TorchVision 0.14.1
- coco-caption
- numpy 1.21.6
- tqdm 4.66.1
- transformers 4.30.2
See env.yaml, env.txt, and anaconda_env.yaml for more info.
Refer coco-caption README.md, you will first need to download the Stanford CoreNLP 3.6.0 code and models for use by SPICE. To do this, run:
cd coco_caption
bash get_stanford_models.sh
The necessary files in training and evaluation are saved in mscoco
folder, which is organized as follows:
mscoco/
|--feature/
|--coco2014/
|--train2014/
|--val2014/
|--test2014/
|--annotations/
|--misc/
|--sent/
|--txt/
where the mscoco/feature/coco2014
folder contains the raw image and annotation files of MSCOCO 2014 dataset. You can download a zip file from here and unzip at the root level of this repo.
NOTE: You can also extract image features of MSCOCO 2014 using Swin-Transformer or others and save them as ***.npz
files into mscoco/feature
for training speed up, refer to coco_dataset.py and data_loader.py for how to read and prepare features.
In this case, you need to make some modifications to pure_transformer.py (delete the backbone module). For you smart and excellent people, I think it is an easy work.
Download pre-trained Backbone models from here and place them in the root directory of this repo.
You can download the saved models for each experiment here and place within your experiments_PureT folder.
Note: If any of the links here don't work, you should find all the files in a folder at this link.
Note: our repository is mainly based on JDAI-CV/image-captioning, and we directly reused their config.yml files, so there are many useless parameter in our model. (waiting for further sorting)
Before training, you can check and modify the parameters in config.yml
and train.sh
files. Then run the script(s) for each experiment:
# for XE training
bash experiments_PureT/PureT_XE/train.sh
bash experiments_PureT/PureT_SwinV2_XE/train.sh
bash experiments_PureT/PureT_CSwin_XE/train.sh
bash experiments_PureT/PureT_DeiT_XE/train.sh
Copy the pre-trained model you saved from performing XE trainig into folder of experiments_PureT/PureT_*_SCST/snapshot/
and modify config.yml
and train.sh
files to resume from the snapshot. If you download the already pre-trained model weights, then run the script (should already be setup to train from the downloaded pre-trained models):
# for SCST training
bash experiments_PureT/PureT_SCST/train.sh
bash experiments_PureT/PureT_SwinV2_SCST/train.sh
bash experiments_PureT/PureT_CSwin_SCST/train.sh
bash experiments_PureT/PureT_DeiT_SCST/train.sh
Once you are done training (or if you downloaded the pre-trained models from here) you can run inference/evaluation folowing this format:
CUDA_VISIBLE_DEVICES=0 python main_test.py --folder experiments_PureT/PureT_SCST/ --resume 27
BLEU-1 | BLEU-2 | BLEU-3 | BLEU-4 | METEOR | ROUGE-L | CIDEr | SPICE |
---|---|---|---|---|---|---|---|
82.1 | 67.3 | 52.0 | 40.9 | 30.2 | 60.1 | 138.2 | 24.2 |
Our work is forked and built off this resarch paper:
@inproceedings{wangyiyu2022PureT,
title={End-to-End Transformer Based Model for Image Captioning},
author={Yiyu Wang and Jungang Xu and Yingfei Sun},
booktitle={AAAI},
year={2022}
}
This repository is based on JDAI-CV/image-captioning, ruotianluo/self-critical.pytorch, microsoft/Swin-Transformer, and microsoft/CSWin-Transformer.