: Code for "Self-supervised Co-learning of Uncurated Images and Reports Enables Oversight AI in Radiology"
Paper link: https://arxiv.org/abs/2208.05140
[Paper] | Official Pytorch code
Medical X-VL: Medical Domain X-attention Vision-Language model
Medical X-VL is a vision-language pre-training model developed to be tailored for the intrinsic properties of the medical domain data. For demo, we provide python codes where you can vision-language pretrain, fine-tune and evaluate for each task, visualize the cross-attention between the words and visual semantics.
- Ubuntu 20.04
- Python 3.8 (tested on)
- Conda
- Pytorch 1.8.0 (tested on)
- CUDA version 11.3 (tested on)
- CPU or GPU that supports CUDA CuDNN and Pytorch 1.8.
- We tested on GeFore RTX 3090.
- We recommend RAM of more than 32 GB.
- Install Pytorch and other dependencies. It can be easily installed with requirements.txt file.
> pip install -r requirements.txt
The open-source datasets used in paper can be obtained from following links.
- We follow the MedViLL to preprocess and split the MIMIC-CXR and VQA-RAD datasets. See this link for details.
- COVID-19 and normal data can be downloaded in Brixia and NIH databases.
Other parts of the institutional data used in this study are not publicly available due to the patient privacy obligation. Interested users can request the access to these data for research, by contacting the corresponding author J.C.Y. ([email protected]).
You can download the pretrained weights on the CheXpert dataset in link below, which should be located as,
https://drive.google.com/file/d/1RKowiRjRCIj6WUlzhFsJsgaA33g9K9l2/view?usp=sharing
https://drive.google.com/file/d/1Y9uc_eVgp0irNE0BUka9_0qbY5urdS6_/view?usp=sharing
First, download ImageNet-pretrained weights for the visual encoder from this link. We utilized pre-trained ViT-S/16 model as the visual encoder.
> --config ./configs/Pretrain.yaml --output_dir ./output/
Our model support zero-shot retrieval for image-to-text and text-to-image retrieval without any fine-tuning step.
> --config ./configs/Retrieval.yaml --output_dir ./output/ --checkpoint /PATH/TO/PRETRAIN/ --evaluate
From the VLP weights, the model can be fine-tuned for the report generation task as below.
> --config ./configs/Generation.yaml --output_dir ./output/ --checkpoint /PATH/TO/PRETRAIN/
After fine-tuning, inference can be done as below.
> --config ./configs/Generation.yaml --output_dir ./output/ --checkpoint /PATH/TO/FINETUNE/ --evaluate
From the VLP weights, the model can be fine-tuned for the VQA task as below.
> --config ./configs/VQA.yaml --output_dir ./output/ --checkpoint /PATH/TO/PRETRAIN/
After fine-tuning, inference can be done as below.
> --config ./configs/VQA.yaml --output_dir ./output/ --checkpoint /PATH/TO/FINETUNE/ --evaluate
Human error (patient mismatch, orientation confusion) can be detected without any fine-tuning step, as the model is already trained to correlate the image and report in the pre-training stage.
> --config ./configs/Detection.yaml --output_dir ./output/ --checkpoint /PATH/TO/PRETRAIN/ --evaluate
Succesful visualization will show the cross-attention between the words and the visual semantics (image patches) as below.
> --config ./configs/Pretrain.yaml --output_dir ./output/ --checkpoint /PATH/TO/PRETRAIN/ --evaluate