USC DSCI 565 Final Project "Image Captioning"

Forked from the Implementation of End-to-End Transformer Based Model for Image Captioning [PDF/AAAI] [PDF/Arxiv] [AAAI 2022] from here.
Authors: Austin Lamb & Hassan Shah
University of Southern California (USC)

Requirements (Our Main Enviroment)

Python 3.7.16
PyTorch 1.13.1
TorchVision 0.14.1
coco-caption
numpy 1.21.6
tqdm 4.66.1
transformers 4.30.2

See env.yaml, env.txt, and anaconda_env.yaml for more info.

Preparation

1. coco-caption preparation

Refer coco-caption README.md, you will first need to download the Stanford CoreNLP 3.6.0 code and models for use by SPICE. To do this, run:

cd coco_caption
bash get_stanford_models.sh

2. Data preparation

The necessary files in training and evaluation are saved in mscoco folder, which is organized as follows:

mscoco/
|--feature/
    |--coco2014/
       |--train2014/
       |--val2014/
       |--test2014/
       |--annotations/
|--misc/
|--sent/
|--txt/

where the mscoco/feature/coco2014 folder contains the raw image and annotation files of MSCOCO 2014 dataset. You can download a zip file from here and unzip at the root level of this repo.

NOTE: You can also extract image features of MSCOCO 2014 using Swin-Transformer or others and save them as ***.npz files into mscoco/feature for training speed up, refer to coco_dataset.py and data_loader.py for how to read and prepare features. In this case, you need to make some modifications to pure_transformer.py (delete the backbone module). For you smart and excellent people, I think it is an easy work.

3. Backbone Models Pre-trained Weights

Download pre-trained Backbone models from here and place them in the root directory of this repo.

4. Pre-trained Image Captioning Models

You can download the saved models for each experiment here and place within your experiments_PureT folder.

Note: If any of the links here don't work, you should find all the files in a folder at this link.

Training

Note: our repository is mainly based on JDAI-CV/image-captioning, and we directly reused their config.yml files, so there are many useless parameter in our model. （waiting for further sorting）

1. Training under XE loss

Before training, you can check and modify the parameters in config.yml and train.sh files. Then run the script(s) for each experiment:

# for XE training
bash experiments_PureT/PureT_XE/train.sh
bash experiments_PureT/PureT_SwinV2_XE/train.sh
bash experiments_PureT/PureT_CSwin_XE/train.sh
bash experiments_PureT/PureT_DeiT_XE/train.sh

2. Training using SCST (self-critical sequence training)

Copy the pre-trained model you saved from performing XE trainig into folder of experiments_PureT/PureT_*_SCST/snapshot/ and modify config.yml and train.sh files to resume from the snapshot. If you download the already pre-trained model weights, then run the script (should already be setup to train from the downloaded pre-trained models):

# for SCST training
bash experiments_PureT/PureT_SCST/train.sh
bash experiments_PureT/PureT_SwinV2_SCST/train.sh
bash experiments_PureT/PureT_CSwin_SCST/train.sh
bash experiments_PureT/PureT_DeiT_SCST/train.sh

Evaluation

Once you are done training (or if you downloaded the pre-trained models from here) you can run inference/evaluation folowing this format:

CUDA_VISIBLE_DEVICES=0 python main_test.py --folder experiments_PureT/PureT_SCST/ --resume 27

BLEU-1	BLEU-2	BLEU-3	BLEU-4	METEOR	ROUGE-L	CIDEr	SPICE
82.1	67.3	52.0	40.9	30.2	60.1	138.2	24.2

Reference

Our work is forked and built off this resarch paper:

@inproceedings{wangyiyu2022PureT,
  title={End-to-End Transformer Based Model for Image Captioning},
  author={Yiyu Wang and Jungang Xu and Yingfei Sun},
  booktitle={AAAI},
  year={2022}
}

Acknowledgements

This repository is based on JDAI-CV/image-captioning, ruotianluo/self-critical.pytorch, microsoft/Swin-Transformer, and microsoft/CSWin-Transformer.

Name		Name	Last commit message	Last commit date
Latest commit History 37 Commits
coco_caption		coco_caption
data/temp		data/temp
datasets		datasets
evaluation		evaluation
experiments_PureT		experiments_PureT
imgs		imgs
lib		lib
losses		losses
lr_scheduler		lr_scheduler
models		models
mscoco		mscoco
optimizer		optimizer
samplers		samplers
scorer		scorer
tools		tools
.gitignore		.gitignore
ICC分词预处理.ipynb		ICC分词预处理.ipynb
README.md		README.md
README_CN.md		README_CN.md
anaconda_env.yaml		anaconda_env.yaml
backbone_inference.py		backbone_inference.py
convert.py		convert.py
env.txt		env.txt
env.yaml		env.yaml
main.py		main.py
main_test.py		main_test.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

USC DSCI 565 Final Project "Image Captioning"

Requirements (Our Main Enviroment)

Preparation

1. coco-caption preparation

2. Data preparation

3. Backbone Models Pre-trained Weights

4. Pre-trained Image Captioning Models

Training

1. Training under XE loss

2. Training using SCST (self-critical sequence training)

Evaluation

Reference

Acknowledgements

About

Releases

Packages

Languages

AustianoSC/DSCI-565-Final-Project

Folders and files

Latest commit

History

Repository files navigation

USC DSCI 565 Final Project "Image Captioning"

Requirements (Our Main Enviroment)

Preparation

1. coco-caption preparation

2. Data preparation

3. Backbone Models Pre-trained Weights

4. Pre-trained Image Captioning Models

Training

1. Training under XE loss

2. Training using SCST (self-critical sequence training)

Evaluation

Reference

Acknowledgements

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages