We fine-tune a unified visual instruction synthesizer that generates diverse tasks based on image-caption pairs across various domains.
The following steps reproduce our visual instruction synthesizer. Alternatively, you can skip these steps and download our synthesizer from AdaptLLM/visual-instruction-synthesizer.
We combine VisionFLAN and ALLaVA into our required format for fine-tuning the synthesizer.
Download the following data files:
- VisionFLAN:
- ALLaVA:
Using the seed data, we conduct multitask fine-tuning on an open-source MLLM (e.g., LLaVA-v1.6-8B) to generate task triplets based on the corresponding image-caption pairs, and 10% of the images are replaced with a blank image to enhance generalization.
conda activate adamllm
CAPTION=PATH_TO/ALLaVA-Caption-VFLAN-4V.json
PRECISE_A=PATH_TO/vflan_metadata.json
INFORMATIVE_A=PATH_TO/ALLaVA-Instruct-VFLAN-4V.json
IAMGE_FOLDER=PATH_TO/images_191task_1k
bash ./scripts/tune_synthesizer.sh ${CAPTION} ${PRECISE_A} ${INFORMATIVE_A} ${IAMGE_FOLDER}
conda deactivate
The tuned synthesizer is saved as ./exp/synthesizer
.
We use the synthesizer to generate task triplets from image-caption pairs in the target domain, followed by consistency-based data filtering to enhance data quality.
The following steps reproduce our data. You can also skip them and download the resulting synthetic data (including image_caption_and_synthetic_task.json
and images
) from:
conda activate vllm
cd QA-Synthesizer/vllm_inference
SYNTHESIZER=AdaptLLM/visual-instruction-synthesizer # Path to the synthesizer
CONSISTENCY_CHECKER=meta-llama/Meta-Llama-3-8B # Language model for consistency checks
We have included a few data samples in this repository for a quick try:
IMAGE_CAPTION='../data_samples/image_caption_pairs.json' # Path to the image-caption pairs
IMAGE_FOLDER='../data_samples/images' # Path to the image folder
OUTPUT_DIR='../data_samples/' # Output directory for synthesized data
# Run synthesis with data parallelism; adjust CUDA devices as needed:
CUDA_VISIBLE_DEVICES='0,1,2,3,4,5,6,7' bash run_synthesis.sh ${SYNTHESIZER} ${CONSISTENCY_CHECKER} ${IMAGE_CAPTION} ${IMAGE_FOLDER} ${OUTPUT_DIR}
-
download the
image_caption_pairs.json
file andimages
from AdaptLLM/biomed-visual-instructions -
Then run
IMAGE_CAPTION="./biomed-visual-instructions/image_caption_pairs.json"
IMAGE_FOLDER="./biomed-visual-instructions/images"
OUTPUT_DIR="./biomed-visual-instructions"
CUDA_VISIBLE_DEVICES='0,1,2,3,4,5,6,7' bash run_synthesis.sh ${SYNTHESIZER} ${CONSISTENCY_CHECKER} ${IMAGE_CAPTION} ${IMAGE_FOLDER} ${OUTPUT_DIR}
-
download the
image_caption_pairs.json
file andimages
from AdaptLLM/food-visual-instructions -
Then run
IMAGE_CAPTION="./food-visual-instructions/image_caption_pairs.json"
IMAGE_FOLDER="./food-visual-instructions/images"
OUTPUT_DIR="./food-visual-instructions"
CUDA_VISIBLE_DEVICES='0,1,2,3,4,5,6,7' bash run_synthesis.sh ${SYNTHESIZER} ${CONSISTENCY_CHECKER} ${IMAGE_CAPTION} ${IMAGE_FOLDER} ${OUTPUT_DIR}
The synthesized output for single-stage post-training will be saved at: ${OUTPUT_DIR}/image_caption_and_synthetic_task.json