This repository is for Image Captioning via Proximal Policy Optimization.
It is based on JDAI-CV / image-captioning.
Please follow the same data preparation as in the repository above. Basically PPO (Proximal Policy Optimization) is used instead of the generally adopted self-critical training. The point of using PPO is its capability of enforcing trust-region constraints. So there can be separate models for the predictor and the trainer. The trainer is allowed the opportunity to observe various enough images and feedbacks of CIDEr scores before substituting the predictor for generating training trajectories. Also, in the gradient estimator of the algorithm, a word-level baseline via Monte-Carlo estimation is implemented for replacing the sentence-level baseline, for which all words have the same baseline value.
A pre-trained model can be downloaded here.
(This model achieves a CIDEr score of 133.3% on the MSCOCO Karpathy test set, for which X-Transformer in JDAI-CV / image-captioning obtains 132.8%.)
The training is nearly the same as in JDAI-CV / image-captioning. However, we directly fine-tune over a pre-trained X-Transformer model for the sake of computing resources.
Copy the pre-trained X-Transformer model into experiments/xtransformer_rl/snapshot and run the script
bash experiments/xtransformer_rl/train.sh
CUDA_VISIBLE_DEVICES=0 python3 main_test.py --folder experiments/xtransformer_rl --resume 23
where the number 23 is since the checkpoint with the best performance on the Karpathy validation set
is saved as caption_model_23.pth
.
Thanks the contribution of JDAI-CV / image-captioning and self-critical.pytorch.