The BiRD shows potential performance in bounding box grounding understanding in biomedical field.
-
✨ We constructe Med-GRIT-270k Dataset. Large-scale biomedical image-mask pairs are transformed into multi-modal conversations by leveraging chatGPT~\cite{OpenAI_2023} in a novel process. It is the first dataset in biomedicine to integrate referring, grounding, and conversations.
-
✨ The first Biomedical Refer-and-grounD Multimodal Large Language Model (BiRD). It is fine-tuned by multi-task instruction learning for the biomedical domain with self-generated data. This validates the effectiveness of multi-task instruction tuning and highlights best practices for adapting the MLLMs to the specialized domain.
You could refer to the official documentation of PaddleMIX to initialize the virtual environment.
For the images downloading, please refer to the SAM-Med2D.
For the QA pairs, please fill the following form to get the Med-GRIT-270k dataset: Google Form. We will send the dataset to you by email after your application is approved.
We perfrom this project on PaddleMIX framework. You can fine-tune the Qwen-VL with this command:
sh train.sh {GPU_ids} paddlemix/config/BiRD/sft_argument_stage2.json
You can also refer to the official documentation fine-tune other multimodal large model.
Infer to generate the prediction jsonl file.
sh tests/models/BiRD/infer_all.sh
Use the prediction jsonl file to calculate the metrics.
sh tests/models/BiRD/eval_all.sh
We thank the following excellent works: FERRET, PaddleMIX, and SAM-Med2D.
The data, code, and model checkpoints are intended to be used solely for (I) future research on visual-language processing and (II) reproducibility of the experimental results reported in the reference paper. The data, code, and model checkpoints are not intended to be used in clinical care or for any clinical decision making purposes.
The primary intended use is to support AI researchers reproducing and building on top of this work. BiRD and its associated models should be helpful for exploring various biomedical pixel grunding and vision question answering (VQA) research questions.
Any deployed use case of the model --- commercial or otherwise --- is out of scope. Although we evaluated the models using a broad set of publicly-available research benchmarks, the models and evaluations are intended for research use only and not intended for deployed use cases.
- The majority of this project is released under the Apache 2.0 license as found in the LICENSE file.
- The service is a research preview intended for non-commercial use only, subject to Terms of Use of the data generated by OpenAI, and Terms of Use of SAM-Med2D-20M. Please contact us if you find any potential violation.
If you find our paper and code useful in your research, please consider giving a star and citation.
@inproceedings{huang2024refer,
title={A Refer-and-Ground Multimodal Large Language Model for Biomedicine},
author={Huang, Xiaoshuang and Huang, Haifeng and Shen, Lingdong and Yang, Yehui and Shang, Fangxin and Liu, Junwei and Liu, Jia},
booktitle={International Conference on Medical Image Computing and Computer-Assisted Intervention},
pages={399--409},
year={2024},
organization={Springer}
}