Contrasting Intra-Modal and Ranking Cross-Modal Hard Negatives to Enhance Visio-Linguistic Compositional Understanding

🎉 The paper was accepted to CVPR 2024:

TL;DR: We Propose two losses on our generated hard negative examples to enhance model's compositional understanding ability for CLIP.

This repo forks from wonderful OpenCLIP, for model and training details, please refer to original repo.

☑️ Checkpoints

The checkpoints could be downloaded directly using gdown with following script:

pip install --upgrade --no-cache-dir gdown # must update gdown to avoid bugs, thanks to https://github.com/wkentaro/gdown/issues/146
gdown 1DWPw3CtGh5cHz9bW_-iXRSG7BBUVl13K #download checkpoint for CE-CLIP

Training

1. Generating Training dataset

The training data is generated based on COCO 2014, so you can either download by yourself and assign coco dataset_path in dataset.py or you can simply run following script to download and generate dataset

cd data/
bash prepare_dataset.sh

2. Training

you need to specify training parameters in scrips/run_all.sh such as --gres=gpu:a100:2 and batch_size, please refer to this script file to see more details, to simply run the training, using following scritps

cd scripts/
bash run_multiple_nodes.sh

The result checkpoint will be at Enhance-FineGrained/src/Outputs

Evaluation

We evaluate our method on four downstream task ARO, VALSE and VL-CheckList, and very recent SugarCrepe and we also provide evaluation code. However, one need go to official github page to download dataset to evaluate on them.

ARO&VALSE

Evaluation code for ARO is included in Enhance-FineGrained/vision-language-models-are-bows, to reproduce results, you need

set up environment by running bash Enhance-FineGrained/vision-language-models-are-bows/scripts/create_environment.sh
cd Enhance-FineGrained/vision-language-models-are-bows/scripts and change the checkpoint path in reproduce_aro.sh, then run the script to reproduce the results. Note that dataset will be download automatically
Evaluation code for VALSE is included in Enhance-FineGrained/VALSE, to reproduce results on valse, please download dataset here first. Then replace dataset path in Enhance-FineGrained/VALSE/clip_valse_eval.py Enhance-FineGrained/VALSE/xvlm_valse_eval.py
replace $checkpoint in Enhance-FineGrained/VALSE/scripts then run the scripts, evaluation results will be included in /home/mila/l/le.zhang/scratch/Enhance-FineGrained/VALSE/output

VL-CheckList [Not Suggested]

❗ Note: The original dataset is not complete, we encourage skip this dataset

Please refer to official github repo to download dataset and perform evaluation. Note that Downloading the dataset can be quite cumbersome

we provide script at here

🌟 SugarCrepe

SugarCrepe is a benchmark for faithful vision-language compositionality evaluation. This dataset fix a several biases in all above benchmarks rendering them hackable that blind models with no access to the image outperform state-of-the-art vision-language models.

to evaluate on this dataset, simply clone their repo and follow their installation setup, and assign retrained to our checkpoints

python main_eval.py --model ViT-B-32 --pretrained Enhance-FineGrained/clip/epoch_5.pt \
    --output ./output \
    --coco_image_root ./data/coco/images/val2017/ \
    --data_root ./data/ \

Ablations

Our method entails curriculum learning, which is validated by the growth of adaptive threshold

📎 Citation

@article{zhang2023contrasting,
  title={Contrasting Intra-Modal and Ranking Cross-Modal Hard Negatives to Enhance Visio-Linguistic Fine-grained Understanding},
  author={Zhang, Le and Awal, Rabiul and Agrawal, Aishwarya},
  journal={arXiv preprint arXiv:2306.08832},
  year={2023}
}

📧 Contact

please let us know if you have further questions or comments, reach out to [email protected]

Name		Name	Last commit message	Last commit date
Latest commit History 56 Commits
.github/workflows		.github/workflows
VALSE		VALSE
assets		assets
data		data
docs		docs
scripts		scripts
src		src
tests		tests
vision-language-models-are-bows		vision-language-models-are-bows
vl_checklist @ ca0c68d		vl_checklist @ ca0c68d
.gitignore		.gitignore
.gitmodules		.gitmodules
CITATION.cff		CITATION.cff
HISTORY.md		HISTORY.md
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
Makefile		Makefile
README.md		README.md
pytest.ini		pytest.ini
requirements-test.txt		requirements-test.txt
requirements-training.txt		requirements-training.txt
requirements.txt		requirements.txt
setup.py		setup.py
transformer_models.ipynb		transformer_models.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Contrasting Intra-Modal and Ranking Cross-Modal Hard Negatives to Enhance Visio-Linguistic Compositional Understanding

☑️ Checkpoints

Training

1. Generating Training dataset

2. Training

Evaluation

ARO&VALSE

VL-CheckList [Not Suggested]

🌟 SugarCrepe

Ablations

📎 Citation

📧 Contact

About

Releases

Packages

Contributors 2

Languages

License

lezhang7/Enhance-FineGrained

Folders and files

Latest commit

History

Repository files navigation

Contrasting Intra-Modal and Ranking Cross-Modal Hard Negatives to Enhance Visio-Linguistic Compositional Understanding

☑️ Checkpoints

Training

1. Generating Training dataset

2. Training

Evaluation

ARO&VALSE

VL-CheckList [Not Suggested]

🌟 SugarCrepe

Ablations

📎 Citation

📧 Contact

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages