- [2024.08.22] Our approach now supports audio attribtuon on foundation model, we use ImageBind as an example! Welcome to try it according to the tutorial!
- [2024.06.16] Our approach now supports medical multimodal model Quilt interpretation! Welcome to try it according to the tutorial!
- [2024.06.04] Our approach now supports
multi-gpus
interpretation proccessing, please refer to the ./scripts fold! - [2024.06.04] Our approach now supports CLIP interpretation! Welcome to try it according to the tutorial!
- [2024.04.22] Our approach now supports LanguageBind interpretation! Welcome to try it according to the tutorial!
- [2024.04.11] Our approach now supports multi-modal models with ViT as backbone (ImageBind, Pytorch only)! Welcome to try it according to the tutorial!
- [2024.01.17] The original code is available now! Welcome to try it according to the tutorial!
- [2024.01.16] The paper has been accepted by ICLR 2024 and selected for oral presentation!
Our method will both support keras
and pytorch
two deep learning frameworks. You can first install pytorch
.
opencv-python
opencv-contrib-python
mtutils
tqdm
scipy
scikit-learn
scikit-image
matplotlib==3.7.1
seaborn==3.7.1
xplique>=1.0.3
Our original code is based on Keras
, and the method of verification on the ViT model will be completely dependent on Pytorch
.
conda create -n smdl python=3.10
conda activate smdl
python3 -m pip install tensorflow[and-cuda]
pip install git+https://github.com/facebookresearch/segment-anything.git
Note: Our method will no more support TensorFlow/Keras, but focus on PyTorch.
Recognition Models (Please download and put the models to the path ckpt/keras_model
):
Datasets | Model |
---|---|
Celeb-A | keras-ArcFace-R100-Celeb-A.h5 |
VGG-Face2 | keras-ArcFace-R100-VGGFace2.h5 |
CUB-200-2011 | cub-resnet101.h5, cub-resnet101-new.h5, cub-efficientnetv2m.h5, cub-mobilenetv2.h5, cub-vgg19.h5 |
Uncertainty Estimation Models (Please download and put the models to the path ckpt/pytorch_model
):
Datasets | Model |
---|---|
Celeb-A | edl-101-10177.pth |
VGG-Face2 | edl-101-8631.pth |
CUB-200-2011 | cub-resnet101-edl.pth |
Audio classification (on multimodal foundation model ImageBind) attribution:
Medical multimodal model debugging:
If you want to see how to apply this to your own model, please refer to the jupyter notebooks in ./tutorial/ first.
Note: We first publish how to evaluate attribution for multimodal models and how to evaluate it.
Multi GPUs, please refer to the ./scripts fold, for example:
./scripts/clip_multigpu.sh
Then, you may get a saved intermediate result in the path submodular_results/imagenet-clip-vitl/slico-0.0-0.05-1.0-1.0
.
Evaluate the Insertion and Deletion metrics:
python -m evals.eval_AUC_faithfulness --explanation-dir submodular_results/imagenet-clip-vitl/slico-0.0-0.05-1.0-1.0
you may get the results:
Insertion AUC Score: 0.7550
Deletion AUC Score: 0.0814
Xplique: a Neural Networks Explainability Toolbox
Score-CAM: a third-party implementation with Keras.
Segment-Anything: a new AI model from Meta AI that can "cut out" any object, in any image, with a single click.
CLIP: a neural network trained on a variety of (image, text) pairs. It can be instructed in natural language to predict the most relevant text snippet, given an image, without directly optimizing for the task
ImageBind: ImageBind learns a joint embedding across six different modalities - images, text, audio, depth, thermal, and IMU data. It enables novel emergent applications ‘out-of-the-box’ including cross-modal retrieval, composing modalities with arithmetic, cross-modal detection and generation.
LanguageBind: LanguageBind is a language-centric multimodal pretraining approach, taking the language as the bind across different modalities because the language modality is well-explored and contains rich semantics.
@inproceedings{chen2024less,
title={Less is More: Fewer Interpretable Region via Submodular Subset Selection},
author={Chen, Ruoyu and Zhang, Hua and Liang, Siyuan and Li, Jingzhi and Cao, Xiaochun},
booktitle={The Twelfth International Conference on Learning Representations},
year={2024}
}