Our paper's link is Med-MoE: Mixture of Domain-Specific Experts for Lightweight Medical Vision-Language Models
Med-MoE is a novel and lightweight framework designed to handle both discriminative and generative multimodal medical tasks. It employs a three-step learning process: aligning multimodal medical images with LLM tokens, instruction tuning with a trainable router for expert selection, and domain-specific MoE tuning. Our model stands out by incorporating highly specialized domain-specific experts, significantly reducing the required model parameters by 30%-50% while achieving superior or on-par performance compared to state-of-the-art models. This expert specialization and efficiency make Med-MoE highly suitable for resource-constrained clinical settings.
Prepare the Environment
-
Clone and navigate to the TinyMed project directory:
cd TinyMed
-
Set up your environment:
conda create -n tinymed python=3.10 -y conda activate tinymed pip install --upgrade pip pip install -e . pip install -e ".[train]" pip install flash-attn --no-build-isolation
-
Replace the default MoE with our provided version.
-
Please download the domain-specific router provided by us or trained by yourself, and replace its path in the
moellava/model/language_model/llava_stablelm_moe.py
file.
Prepare the Datasets
Utilize the LLaVA-Med Datasets for training:
- For Pretrained Models: LLaVA-Med Alignment Dataset
- For Instruction-Tuning: LLaVA-Med Instruct Dataset
- For Pretrained and SFT Image:wget https://hanoverprod.z21.web.core.windows.net/med_llava/llava_med_image_urls.jsonl and python download_image.py(Don't forget to replace your path)
- For MoE-Tuning Stage: Training Jsonl
- MoE-Tuning Stage Image Data: Note that some images from LLaVA-Med are no longer available; these have been excluded from training. Stage3 ImageData -Test.json for VQA:https://drive.google.com/file/d/1pyGsm8G0Gig63DAnOdLuUn3IyxrztWtR/view?usp=sharing
Launch the Web Interface
Use DeepSpeed to start the Gradio web server:
- Phi2 Model:
deepspeed --include localhost:0 moellava/serve/gradio_web_server.py --model-path "./MedMoE-phi2"
- StableLM Model:
deepspeed --include localhost:0 moellava/serve/gradio_web_server.py --model-path "./MedMoE-stablelm-1.6b"
Command Line Inference Execute models from the command line:
- Phi2 Model:
deepspeed --include localhost:0 moellava/serve/cli.py --model-path "./MedMoE-phi2" --image-file "image.jpg"
- StableLM Model:
deepspeed --include localhost:0 moellava/serve/cli.py --model-path "./MedMoE-stablelm-1.6b" --image-file "image.jpg"
Available Models
The evaluation process involves running the model on multiple GPUs and combining the results. Below are the detailed steps and commands:
# Set the number of chunks and GPUs
CHUNKS=2
GPUS=(0 1)
# Run inference on each GPU
for IDX in {0..1}; do
GPU_IDX=${GPUS[$IDX]}
PORT=$((${GPUS[$IDX]} + 29500))
MASTER_PORT_ENV="MASTER_PORT=$PORT"
deepspeed --include localhost:$GPU_IDX --master_port $PORT model_vqa_med.py \
--model-path your_model_path \
--question-file ./test_rad.json \
--image-folder ./3vqa/images \
--answers-file ./test_llava-13b-chunk${CHUNKS}_${IDX}.jsonl \
--temperature 0 \
--num-chunks $CHUNKS \
--chunk-idx $IDX \
--conv-mode stablelm/phi2 &
done
# Combine JSONL results into one file
cat ./test_llava-13b-chunk2_{0..1}.jsonl > ./radvqa.jsonl
# Run evaluation
python run_eval.py \
--gt ./3vqa/test_rad.json \
--pred ./radvqa.jsonl \
--output ./data_RAD/wrong_answers.json
Special thanks to these foundational works:
@misc{jiang2024medmoemixturedomainspecificexperts,
title={Med-MoE: Mixture of Domain-Specific Experts for Lightweight Medical Vision-Language Models},
author={Songtao Jiang and Tuo Zheng and Yan Zhang and Yeying Jin and Li Yuan and Zuozhu Liu},
year={2024},
eprint={2404.10237},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2404.10237},
}