This repo provides the benchmark toolkit of our proposed Visual Haystacks (VHs) dataset: Visual Haystacks: A Vision-Centric Needle-In-A-Haystack Benchmark. Check out project page here!
Authors: Tsung-Han Wu, Giscard Biamby, Jerome Quenum, Ritwik Gupta, Joseph E. Gonzalez, Trevor Darrell, David M. Chan at UC Berkeley.
Visual Haystacks (VHs) Benchmark Dataset: ๐ค tsunghanwu/visual_haystacks
Our Multi-Image Retrieval Augmented Generation (MIRAGE) Model: ๐คtsunghanwu/mirage-llama3.1-8.3B, ๐ Github Repo
Visual Haystacks (VHs) is a vision-centric Needle-In-A-Haystack (NIAH) benchmark designed to evaluate the capabilities of Large Multimodal Models (LMMs) in visual retrieval and reasoning tasks involving diverse and unrelated image sets. Conventional visual NIAH challenges often depend on artificial and OCR-centric scenarios, such as copy-and-paste or out-of-domain image patches, or the overlay of transcripts. These setups frequently yield near-perfect performance, providing limited insights into the practical effectiveness of models. In contrast, the VHs benchmark is carefully curated to ensure a realistic, reliable, and vision-focused evaluation. It challenges both open-source and proprietary long-context LMMs (including GPT-4o and Gemini 1.5 Pro), even with in-domain images and seemingly simple questions.
VHs consists of 1K binary visual question-answer pairs for sets containing differeing numbers images, with each set ranging from 1 to 10K images. Each question is about the presence of an object in some relevant images: the model needs to first retrieve these needle images in a haystack of data and then answer the corresponding question. The dataset is carefully curated to ensure that guessing or relying on common sense reasoning without viewing the image results in a 50% accuracy rate. The dataset is derived from the COCO dataset and includes two types of challenges: the single-needle challenge and the multi-needle challenge.
- Single-Needle Challenge: Only a single needle image exists in the haystack of images. The question is framed as, "For the image with the anchor object, is there a target object?"
- Multi-Needle Challenge: Two or three needle images exist in the haystack of images. The question is framed as either, "For all images with the anchor object, do all of them contain the target object?" or "For all images with the anchor object, do any of them contain the target object?"
-
Context Limitations: Current LMMs cannot process more than 100 images due to API rejections (payload exceeding limits), context length overflows, or memory constraints on 4 A100 GPUs.
-
Susceptibility to Visual Distractors: While LMMs can perform nearly as well as specialized detectors on single-image tasks, their effectiveness decreases significantly as the number of images increases.
-
Challenges in Cross-Image Reasoning: LMMs experience substantial performance declines when required to integrate information across multiple key images; reintroducing noisy images exacerbates this decline even further.
-
Positional Biases: LMMs exhibit various positional biasesโinformation placed at different positions within the context window yields different results. For instance, GPT-4 exhibits a "lost-in-the-middle" phenomenon in the visual domain, Gemini 1.5 Pro shows a preference for images at the beginning, and open-source models often favor the last image when given a small set.
In light of these observations, we introduce MIRAGE-8.3B, a pioneering open-source visual-RAG solution capable of handling up to 10,000 image inputs and somewhat mitigating the above challenges. We will to release the code and checkpoint by October 21, 2024!
- 10/18/2024: We've published our MIRAGE codebase. Check it out here.
- 10/14/2024: We've updated our datasets to enhance diversity and balance. In this version, we include GPT-4o, Gemini 1.5 Pro, Claude 3.5 Sonnet, LLaVA-Next, Qwen2-VL-7B-Instruct, Idefics-3, InternVL2-8B, Phi3-vision, mPLUG-OWL3, and LongViLA.
- 07/18/2024: Scripts were released for running inference using various models on the Visual Haystacks (VHs) benchmark, including GPT-4, Gemini, Claude, LLaVA, QwenVL, Idefics2, and others.
We invite collaborators working on multi-image reasoning to reach out for integrating their latest models into our repository!
- Package Installation
# Create conda environment
conda create --name vhs python=3.10
conda activate vhs
# Install packages and then flash-attn separately
pip3 install -r requirements.txt
pip3 install flash-attn --no-build-isolation --no-cache-dir
- Note: Idefics-3 and LongViLA requires additional installation steps. Please refer to their official instructions.
- Data Preparation
- Download the VQA questions from ๐ค tsunghanwu/visual_haystacks. Our data structure is similar to LLaVA's one, which is easy to play with.
huggingface-cli download --repo-type dataset tsunghanwu/visual_haystacks --local-dir dataset/VHs_qa
- Download the COCO 2017 dataset and organize it as follows, with the default root directory
./dataset/coco
:dataset/ โโโ coco โ โโโ annotations โ โโโ test2017 โ โโโ val2017 โโโ VHs_qa โโโ VHs_full โ โโโ multi_needle โ โโโ single_needle โโโ VHs_small โโโ multi_needle โโโ single_needle
Run the script:
python3 main.py
Note:
- Add your OpenAI and Google API key to
conf/solver/*.yaml
. - This all-in-one script will run inference and then go through evaluation.
- Modify configs in
conf/
if needed. Please refer to hydra's document for more information if required as we use this tool in the project.
defaults:
- solver: llava # which solve we're gonna use
- _self_
basic:
debug_mode: False # debug mode or not (use only single instsace to prevent spending $$$)
mode: single_needle # single_needle/multi_needle
image_root: dataset # dataset root directory
test_file_base: dataset/VHs_qa/single_needle/VHs_large # all json files are put in this directory
output_dir: output/${solver.name}_${basic.mode}/result # output result directory (saving jsons)
image_counts: ["oracle", 2, 3] # we will read the json file named as "visual_haystack_{entry}.json"
hydra:
run:
dir: output/${solver.name}_${basic.mode}/logs # log dir
If you use our work or our implementation in this repo, or find them helpful, please consider giving a citation.
@article{wu2024visual,
title={Visual Haystacks: A Vision-Centric Needle-In-A-Haystack Benchmark},
author={Wu, Tsung-Han and Biamby, Giscard and and Quenum, Jerome and Gupta, Ritwik and Gonzalez, Joseph E and Darrell, Trevor and Chan, David M},
journal={arXiv preprint arXiv:2407.13766},
year={2024}
}